- Video Analysis and Summarization
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Data Quality and Management
- Music and Audio Processing
- Advanced Vision and Imaging
- Multimedia Communication and Technology
- Machine Learning and Data Classification
- Natural Language Processing Techniques
North University of China
2024
Sun Yat-sen University
2022-2024
Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different mechanisms to reveal potential associations between and question. While these effectively remove irrelevant information from the attention, they ignore pseudo-related within interaction attention. To address this problem, we proposed novel energy-based refined-attention...
The joint task of video moment retrieval and highlight detection is a challenging study, which requires building model that not only captures contextual information between sequences in time but also has the ability to understand judge significance. This paper solves these problems from three aspects. Firstly, we design parameter-free cross-modal statistical correlation interaction method. A novel saliency enhancement function defined quantify differences important features associated with...
Outfit collocation requires considering the interrelationship and adaptability among attributes of component items. However, with numerous diverse fashion items, accurately capturing attribute features modeling complex relationships between become key challenges. To address these challenges, we propose a novel scheme Decoupling-driven Multi-level Attribute Parsing for interpretable outfit collocation. First, decouple series from item's visual feature by fully supervised, which can improve...
Spatiotemporal attention learning has always been a challenging research task in video question answering (VideoQA). It needs to consider not only the modelling of local neighbourhood dependencies between adjacent frames but also long-term nonadjacent frames. Although existing methods are usually good at temporal one aspect, they cannot simultaneously and effectively model To address this issue, we first derive novel statistic-driven difference-aware generation function, which can...