Shiyu Huang

ORCID: 0009-0008-9541-0622
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Image Enhancement Techniques
  • Human Pose and Action Recognition
  • Image Retrieval and Classification Techniques
  • Video Surveillance and Tracking Methods
  • Advanced Image Processing Techniques
  • Advanced Vision and Imaging
  • Video Analysis and Summarization

Dalian Polytechnic University
2024

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension remains under-explored current benchmarks. To address this gap, we propose MotionBench, comprehensive evaluation benchmark designed to assess the of understanding models. MotionBench evaluates models' motion-level perception through six primary categories motion-oriented question types and includes data collected from...

10.48550/arxiv.2501.02955 preprint EN arXiv (Cornell University) 2025-01-06

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, broader modalities applications. Here propose the CogVLM2 family, a new generation visual language models for image video understanding including CogVLM2, CogVLM2-Video GLM-4V. As an model, inherits expert architecture improved training recipes both pre-training post-training stages, supporting input resolution up to $1344 \times...

10.48550/arxiv.2408.16500 preprint EN arXiv (Cornell University) 2024-08-29

Abstract Currently, deep learning methods for low‐light image enhancement tasks mainly focus on the illumination of images, while neglecting problems noise and feature loss. To address this issue, paper proposes a novel network called DAF‐Retinex, based Retinex‐Net. issue noise, different from traditional denoise methods, utilizes fully convolutional neural to reflection component, additionally, denoising loss function is introduced suppress noise. For preserving details extracting features,...

10.1049/ipr2.13110 article EN cc-by IET Image Processing 2024-05-17

We present a general strategy to aligning visual generation models -- both image and video with human preference. To start with, we build VisionReward fine-grained multi-dimensional reward model. decompose preferences in images videos into multiple dimensions, each represented by series of judgment questions, linearly weighted summed an interpretable accurate score. address the challenges quality assessment, systematically analyze various dynamic features videos, which helps surpass...

10.48550/arxiv.2412.21059 preprint EN arXiv (Cornell University) 2024-12-30
Coming Soon ...