- Multimodal Machine Learning Applications
- Image Retrieval and Classification Techniques
- Advanced Image and Video Retrieval Techniques
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Rough Sets and Fuzzy Logic
- Anatomy and Medical Technology
- 3D Shape Modeling and Analysis
- Data Mining Algorithms and Applications
- Domain Adaptation and Few-Shot Learning
Hong Kong University of Science and Technology
2025
University of Hong Kong
2025
Beijing Jiaotong University
2021-2024
Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed the uneven distributions of target moments. Existing methods generate augmented videos, where moments forced have varying locations. However, since video lengths given small variations, only changing locations results poor generalization ability with lengths. In this paper, we propose a novel training framework complemented by diversified data...
Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief phrases. This limitation curtails model's capacity for fine-grained linguistic comprehension leads to a significant decline performance when faced detailed descriptions or contextual information. To tackle problems, we develop DoGA: Detect objects Grouped...
Composed image retrieval aims at performing task by giving a reference and complementary text piece. Since composing both information can accurately model the users' search intent, composed perform target-specific be potentially applied to many scenarios such as interactive product search. However, two key challenging issues must addressed in occasion. One of them is how fuse heterogeneous piece query into feature space. The other bridge gap between pieces images database. To address issues,...
Composed query image retrieval task aims to retrieve the target in database by a that composes two different modalities: reference and sentence declaring some details of need be modified replaced new elements. Tackling this needs learn multimodal embedding space, which can make semantically similar targets queries close but dissimilar as far away possible. Most existing methods start from perspective model structure design clever interactive modules promote better fusion modalities. However,...
Composed image retrieval aims at retrieving the desired images, given a reference and text piece. To handle this task, two important subprocesses should be modeled reasonably. One is to erase irrelated details of against piece, other replenish in Nowadays, existing methods neglect distinguish between implicitly put them together solve composed task. explicitly orderly model we propose novel method which contains three key components, i.e., Multi-semantic Dynamic Suppression module (MDS),...
Composed image retrieval (CIR) is an emerging and challenging research task that combines two modalities, a reference image, modification text, into one query to retrieve the target image. In online shopping scenarios, user would use text as feedback describe difference between desired order handle task, there must be main problems needed addressed. One localization problem: how precisely find those spatial areas of mentioned by text. The other effectively modify semantics based on However,...
Composed image retrieval (CIR) aims at fusing a reference and text feedback to search for the desired images. Compared general retrieval, it can model users' intent more comprehensively target images accurately, which has significant impacts in various real-world applications, such as E-commerce Internet search. However, because of existing heterogeneous semantic gap, synthetic understanding fusion both are difficult implement. In this work, tackle problem, we propose an end-to-end framework...
This paper investigates the research task of reconstructing 3D clothed human body from a monocular image. Due to inherent ambiguity single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative provide auxiliary information for reconstruction. However, these methods capture only general geometry and overlook specific geometric details, leading inaccurate skeleton reconstruction, incorrect joint positions, unclear cloth wrinkles. In response issues, we...