- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Anomaly Detection Techniques and Applications
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Natural Language Processing Techniques
- Subtitles and Audiovisual Media
- Speech and dialogue systems
- Mobile Health and mHealth Applications
- AI in Service Interactions
- Image Retrieval and Classification Techniques
- Generative Adversarial Networks and Image Synthesis
- Topic Modeling
- Text and Document Classification Technologies
- Nutritional Studies and Diet
Zhejiang University of Technology
2021-2023
Zhejiang University
2023
The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...
This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on distinct capability, mirroring the progression from basic perception to logical reasoning...
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing evaluation benchmarks cover a limited number of tasks testing rudimentary capabilities, falling short tracking LVLM development. In this study, we present MMT-Bench, comprehensive benchmark designed to assess LVLMs across massive requiring expert knowledge deliberate recognition, localization, reasoning, planning....
Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...
Named entity disambiguation (NED) finds the specific meaning of an mention in a particular context and links it to target entity. With emergence multimedia, modalities content on Internet have become more diverse, which poses difficulties for traditional NED, vast amounts information make impossible manually label every kind ambiguous data train practical NED model. In response this situation, we present MMGraph, uses multimodal graph convolution aggregate visual contextual language accurate...
Compared with the progress made on human activity classification, much less success has been achieved interaction understanding (HIU). Apart from latter task is more challenging, main causation that recent approaches learn interactive relations via shallow graphical representations, which are inadequate to model complicated interactive-relations. This paper proposes a deep consistency-aware framework aiming at tackling grouping and labelling inconsistencies in HIU. consists of three...
The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...
With the increasing health concerns on diet, it's worthwhile to develop an intelligent assistant that can help users eat healthier. Such automatically give personal advice for user's diet and generate report about eating a regular basis. To boost research such assistant, we establish real-world foodlog database using various methods as filter, cluster graph convolutional network. This is built based lifelog medical data, which named Real-World Multimodal Foodlog (RWMF). It contains 7500...
A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is immense importance to the surveillance public security regions like campuses, squares parks. Different from conventional human interaction recognition, which uses choreographed videos inputs, neglects concurrent interactive groups, performs detection recognition separate stages, we introduce a new task named (HID). HID devotes detecting subjects,...
Most instance segmentation models are not end-to-end trainable due to either the incorporation of proposal estimation (RPN) as a pre-processing or non-maximum suppression (NMS) post-processing. Here we propose novel method termed ISDA. It reshapes task into predicting set object masks, which generated via traditional convolution operation with learned position-aware kernels and features objects. Such by leveraging deformable attention network multi-scale representation. Thanks introduced...