NFDI4DS | UHH-SEMS - Publication Details

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

OPENALEX - Publications

Juncheng Li Kaihang Pan Zhiqi Ge Minghe Gao Hanwang Zhang and 5 more

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where VPG-generated images are fed a frozen LLM generate corresponding captions. However, this image-captioning based objective inherently biases VPG concentrate solely primary contents sufficient for caption generation, often neglecting...

10.48550/arxiv.2308.04152 preprint EN other-oa arXiv (Cornell University) 2023-01-01

RustGraph: Robust Anomaly Detection in Dynamic Graphs by Jointly Learning Structural-Temporal Dependency

OPENALEX - Publications

Jianhao Guo Siliang Tang Juncheng Li Kaihang Pan Lingfei Wu

Dynamic graph-based data are ubiquitous in the real world, such as social networks, finance systems, and traffic flow. Fast accurately detecting anomalies these dynamic graphs is of vital importance. However, despite promising results current anomaly detection methods have achieved, there two major limitations when coping with graphs. The first limitation that topological structures temporal dynamics been modeled separately, resulting less expressive features for detection. second models...

10.1109/tkde.2023.3328645 article EN IEEE Transactions on Knowledge and Data Engineering 2023-10-30

Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization

OPENALEX - Publications

Kaihang Pan Juncheng Li Hongye Song Jun Lin Xiaozhong Liu and 1 more

Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...

10.18653/v1/2023.findings-emnlp.75 article EN cc-by 2023-01-01

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

OPENALEX - Publications

Kaihang Pan Juncheng Li Hongye Song Fei Hao Wei Ji and 4 more

Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse tasks, as different tasks often entail distinct search intents. To address this challenge, in work we introduce ControlRetriever, a generic and efficient approach with parameter isolated architecture, capable of controlling models directly varied harnessing the power instructions explicitly describe intents natural language. Leveraging foundation ControlNet, which...

10.48550/arxiv.2308.10025 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Improving Vision Anomaly Detection with the Guidance of Language Modality

OPENALEX - Publications

Dong Chen Kaihang Pan Guoming Wang Yueting Zhuang Siliang Tang

Recent years have seen a surge of interest in anomaly detection for tackling industrial defect detection, event etc. However, existing unsupervised detectors, particularly those the vision modality, face significant challenges due to redundant information and sparse latent space. Conversely, language modality performs well its relatively single data. This paper tackles aforementioned from multimodal point view. Specifically, we propose Cross-modal Guidance (CMG), which consists Entropy...

10.48550/arxiv.2310.02821 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Auto-Encoding Morph-Tokens for Multimodal LLM

OPENALEX - Publications

Kaihang Pan Siliang Tang Juncheng Li Zhaoyu Fan Wei Chow and 4 more

For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual presents an ongoing challenge. This is due to a conflicting objective: for comprehension, MLLM needs abstract visuals; generation, it preserve visuals as much possible. Thus, objective dilemma visual-tokens. To resolve conflict, we propose encoding images into morph-tokens serve dual purpose: they act prompts instructing generate texts; take on different, non-conflicting role complete...

10.48550/arxiv.2405.01926 preprint EN arXiv (Cornell University) 2024-05-03

I3: I ntent- I ntrospective Retrieval Conditioned on I nstructions

OPENALEX - Publications

Kaihang Pan Juncheng Li Wenjie Wang Hao Fei Hongye Song and 5 more

10.1145/3626772.3657745 article Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

OPENALEX - Publications

Kaihang Pan Zhaoyu Fan Juncheng Li Qifan Yu Fei Hao and 4 more

The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic editing and external resorting, each possess strengths weaknesses, struggling to balance the desired properties of reliability, generality, locality when applied MLLMs. In this paper, we propose UniKE, a novel multimodal method that establishes unified perspective paradigm resorting. Both types are conceptualized as vectorized key-value...

10.48550/arxiv.2409.19872 preprint EN arXiv (Cornell University) 2024-09-29

Unified Generative and Discriminative Training for Multi-modal Large Language Models

OPENALEX - Publications

Wei Chow Juncheng Li Qifan Yu Kaihang Pan Hao Fei and 5 more

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification retrieval, struggles with scenarios requiring fine-grained semantic differentiation. This paper addresses these...

10.48550/arxiv.2411.00304 preprint EN arXiv (Cornell University) 2024-10-31

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

OPENALEX - Publications

Qifan Yu Wei Chow Zhongqi Yue Kaihang Pan Yang Wu and 5 more

Instruction-based image editing aims to modify specific elements with natural language instructions. However, current models in this domain often struggle accurately execute complex user instructions, as they are trained on low-quality data limited types. We present AnyEdit, a comprehensive multi-modal instruction dataset, comprising 2.5 million high-quality pairs spanning over 20 types and five domains. ensure the diversity quality of AnyEdit collection through three aspects: initial...

10.48550/arxiv.2411.15738 preprint EN arXiv (Cornell University) 2024-11-24

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

OPENALEX - Publications

Hao Qiu Minghe Gao Long Qian Kaihang Pan Qifan Yu and 5 more

Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, events. The hurdles to enhancing this capability include extensive manual labor, the lack of compositionality existing data absence explicit supervision. In paper, we propose STEP, a novel...

10.48550/arxiv.2412.00161 preprint EN arXiv (Cornell University) 2024-11-29

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

OPENALEX - Publications

Zhiqi Ge Juncheng Li Xiaoli Pang Minghe Gao Kaihang Pan and 5 more

Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based built on Large Language Models (LLMs) often require frequent updates due platform-specific APIs, visual leveraging Multimodal (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these face significant challenges perception, particularly when handling...

10.48550/arxiv.2412.10342 preprint EN arXiv (Cornell University) 2024-12-13

Improving Vision Anomaly Detection with the Guidance of Language Modality

OPENALEX - Publications

Dong Chen Kaihang Pan Guangming Dai Guoming Wang Yueting Zhuang and 2 more

10.1109/tmm.2024.3521813 article EN IEEE Transactions on Multimedia 2024-01-01

Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization

OPENALEX - Publications

Kaihang Pan Juncheng Li Hongye Song Jun Lin Xiaozhong Liu and 1 more

Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...

10.48550/arxiv.2303.12314 preprint EN other-oa arXiv (Cornell University) 2023-01-01