- Multimodal Machine Learning Applications
- Topic Modeling
- Natural Language Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Anomaly Detection Techniques and Applications
- Speech Recognition and Synthesis
- Advanced Image and Video Retrieval Techniques
- Semantic Web and Ontologies
- Medical Image Segmentation Techniques
- Network Security and Intrusion Detection
- Usability and User Interface Design
- Interactive and Immersive Displays
- Data-Driven Disease Surveillance
- Advanced Graph Neural Networks
- Image Retrieval and Classification Techniques
- Complex Network Analysis Techniques
- Speech and dialogue systems
Zhejiang University
2023-2024
Alibaba Group (Cayman Islands)
2023
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where VPG-generated images are fed a frozen LLM generate corresponding captions. However, this image-captioning based objective inherently biases VPG concentrate solely primary contents sufficient for caption generation, often neglecting...
Dynamic graph-based data are ubiquitous in the real world, such as social networks, finance systems, and traffic flow. Fast accurately detecting anomalies these dynamic graphs is of vital importance. However, despite promising results current anomaly detection methods have achieved, there two major limitations when coping with graphs. The first limitation that topological structures temporal dynamics been modeled separately, resulting less expressive features for detection. second models...
Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...
Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse tasks, as different tasks often entail distinct search intents. To address this challenge, in work we introduce ControlRetriever, a generic and efficient approach with parameter isolated architecture, capable of controlling models directly varied harnessing the power instructions explicitly describe intents natural language. Leveraging foundation ControlNet, which...
Recent years have seen a surge of interest in anomaly detection for tackling industrial defect detection, event etc. However, existing unsupervised detectors, particularly those the vision modality, face significant challenges due to redundant information and sparse latent space. Conversely, language modality performs well its relatively single data. This paper tackles aforementioned from multimodal point view. Specifically, we propose Cross-modal Guidance (CMG), which consists Entropy...
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual presents an ongoing challenge. This is due to a conflicting objective: for comprehension, MLLM needs abstract visuals; generation, it preserve visuals as much possible. Thus, objective dilemma visual-tokens. To resolve conflict, we propose encoding images into morph-tokens serve dual purpose: they act prompts instructing generate texts; take on different, non-conflicting role complete...
The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic editing and external resorting, each possess strengths weaknesses, struggling to balance the desired properties of reliability, generality, locality when applied MLLMs. In this paper, we propose UniKE, a novel multimodal method that establishes unified perspective paradigm resorting. Both types are conceptualized as vectorized key-value...
In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification retrieval, struggles with scenarios requiring fine-grained semantic differentiation. This paper addresses these...
Instruction-based image editing aims to modify specific elements with natural language instructions. However, current models in this domain often struggle accurately execute complex user instructions, as they are trained on low-quality data limited types. We present AnyEdit, a comprehensive multi-modal instruction dataset, comprising 2.5 million high-quality pairs spanning over 20 types and five domains. ensure the diversity quality of AnyEdit collection through three aspects: initial...
Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, events. The hurdles to enhancing this capability include extensive manual labor, the lack of compositionality existing data absence explicit supervision. In paper, we propose STEP, a novel...
Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based built on Large Language Models (LLMs) often require frequent updates due platform-specific APIs, visual leveraging Multimodal (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these face significant challenges perception, particularly when handling...
Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...