Gangjian Zhang

ORCID: 0000-0003-1503-4513
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Image Retrieval and Classification Techniques
  • Advanced Image and Video Retrieval Techniques
  • Human Pose and Action Recognition
  • Advanced Vision and Imaging
  • Rough Sets and Fuzzy Logic
  • Anatomy and Medical Technology
  • 3D Shape Modeling and Analysis
  • Data Mining Algorithms and Applications
  • Domain Adaptation and Few-Shot Learning

Hong Kong University of Science and Technology
2025

University of Hong Kong
2025

Beijing Jiaotong University
2021-2024

Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed the uneven distributions of target moments. Existing methods generate augmented videos, where moments forced have varying locations. However, since video lengths given small variations, only changing locations results poor generalization ability with lengths. In this paper, we propose a novel training framework complemented by diversified data...

10.48550/arxiv.2501.06746 preprint EN arXiv (Cornell University) 2025-01-12

10.1109/icassp49660.2025.10888960 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief phrases. This limitation curtails model's capacity for fine-grained linguistic comprehension leads to a significant decline performance when faced detailed descriptions or contextual information. To tackle problems, we develop DoGA: Detect objects Grouped...

10.1609/aaai.v39i6.32603 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Composed image retrieval aims at performing task by giving a reference and complementary text piece. Since composing both information can accurately model the users' search intent, composed perform target-specific be potentially applied to many scenarios such as interactive product search. However, two key challenging issues must addressed in occasion. One of them is how fuse heterogeneous piece query into feature space. The other bridge gap between pieces images database. To address issues,...

10.1145/3474085.3475659 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Composed query image retrieval task aims to retrieve the target in database by a that composes two different modalities: reference and sentence declaring some details of need be modified replaced new elements. Tackling this needs learn multimodal embedding space, which can make semantically similar targets queries close but dissimilar as far away possible. Most existing methods start from perspective model structure design clever interactive modules promote better fusion modalities. However,...

10.1109/tip.2024.3359062 article EN IEEE Transactions on Image Processing 2024-01-01

Composed image retrieval aims at retrieving the desired images, given a reference and text piece. To handle this task, two important subprocesses should be modeled reasonably. One is to erase irrelated details of against piece, other replenish in Nowadays, existing methods neglect distinguish between implicitly put them together solve composed task. explicitly orderly model we propose novel method which contains three key components, i.e., Multi-semantic Dynamic Suppression module (MDS),...

10.1109/tip.2022.3204213 article EN IEEE Transactions on Image Processing 2022-01-01

Composed image retrieval (CIR) is an emerging and challenging research task that combines two modalities, a reference image, modification text, into one query to retrieve the target image. In online shopping scenarios, user would use text as feedback describe difference between desired order handle task, there must be main problems needed addressed. One localization problem: how precisely find those spatial areas of mentioned by text. The other effectively modify semantics based on However,...

10.1109/tmm.2023.3273466 article EN IEEE Transactions on Multimedia 2023-05-05

Composed image retrieval (CIR) aims at fusing a reference and text feedback to search for the desired images. Compared general retrieval, it can model users' intent more comprehensively target images accurately, which has significant impacts in various real-world applications, such as E-commerce Internet search. However, because of existing heterogeneous semantic gap, synthetic understanding fusion both are difficult implement. In this work, tackle problem, we propose an end-to-end framework...

10.1109/tmm.2022.3208742 article EN IEEE Transactions on Multimedia 2022-09-22

10.1109/icme57554.2024.10687702 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2024-07-15

This paper investigates the research task of reconstructing 3D clothed human body from a monocular image. Due to inherent ambiguity single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative provide auxiliary information for reconstruction. However, these methods capture only general geometry and overlook specific geometric details, leading inaccurate skeleton reconstruction, incorrect joint positions, unclear cloth wrinkles. In response issues, we...

10.48550/arxiv.2412.03103 preprint EN arXiv (Cornell University) 2024-12-04
Coming Soon ...