- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Gaze Tracking and Assistive Technology
- Advanced Image and Video Retrieval Techniques
- Visual perception and processing mechanisms
- Human Pose and Action Recognition
- Visual Attention and Saliency Detection
- Image Processing Techniques and Applications
- Neural dynamics and brain function
- Digital Imaging for Blood Diseases
- Domain Adaptation and Few-Shot Learning
- Cell Image Analysis Techniques
Renmin University of China
2023
Beijing Institute of Big Data Research
2023
Tianjin University
2020-2021
Video moment localization aims to retrieve the target segment of an untrimmed video according natural language query. Weakly supervised methods gains attention recently, as precise temporal location is not always available. However, one greatest challenges encountered by weakly method implied in mismatch between and induced coarse annotations. To refine vision-language alignment, recent works contrast cross-modality similarities driven reconstructing masked queries positive negative...
Video sentence grounding aims to localize a segment semantically aligning the given language query from video. Most existing works simply interact video and only once at single early stage. Not multi-level dependencies within videos are not explored since interactions act fixedly on specific level, but also guiding role of is neglected. To tackle these issues, we propose an efficient network namely Temporal-enhanced Cross-modality Fusion Network (TCFN). By directly modulating temporal...
Paired video and language data is naturally temporal concurrency, which requires the modeling of dynamics within each modality alignment across modalities simultaneously. However, most existing video-language representation learning methods only focus on discrete semantic that encourages aligned semantics to be close in latent space, or context dependency captures short-range coherence, failing building concurrency. In this paper, we propose learn representations by pairs as Temporal...
Visual scanning plays an important role in sampling visual information from the surrounding environments for a lot of everyday sensorimotor tasks, such as driving. In this paper, we consider problem mechanism underpinning tasks 3D dynamic environments. We exploit use eye tracking data behaviometric, indicating visuo-motor behavioral measure context virtual A new metric efficiency (VSE), which is defined mathematical divergence between fixation distribution and optical flows induced by...
Video moment localization aims to retrieve the target segment of an untrimmed video according natural language query. Weakly supervised methods gains attention recently, as precise temporal location is not always available. However, one greatest challenges encountered by weakly method implied in mismatch between and induced coarse annotations. To refine vision-language alignment, recent works contrast cross-modality similarities driven reconstructing masked queries positive negative...
Abstract Visual scanning plays an important role in sampling visual information from the surrounding environments for a lot of everyday sensorimotor tasks, such as walking and car driving. In this paper, we consider problem mechanism underpinning tasks 3D dynamic environments. We exploit use eye tracking data behaviometric, indicating visuo-motor behavioral measures context virtual A new metric efficiency ( VSE ), which is defined mathematical divergence between fixation distribution optical...
Abstract Eye movement behavior, which provides the visual information acquisition and processing, plays an important role in performing sensorimotor tasks, such as driving, by human beings everyday life. In procedure of eye is contributed through a specific coordination head gaze changes, with motions preceding movements. Notably we believe that this essence indicates kind causality. paper, investigate transfer entropy to set up quantity for measuring unidirectional causality from motion...