Kaihang Pan

ORCID: 0009-0001-2967-4573
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Topic Modeling
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Anomaly Detection Techniques and Applications
  • Speech Recognition and Synthesis
  • Advanced Image and Video Retrieval Techniques
  • Semantic Web and Ontologies
  • Medical Image Segmentation Techniques
  • Network Security and Intrusion Detection
  • Usability and User Interface Design
  • Interactive and Immersive Displays
  • Data-Driven Disease Surveillance
  • Advanced Graph Neural Networks
  • Image Retrieval and Classification Techniques
  • Complex Network Analysis Techniques
  • Speech and dialogue systems

Zhejiang University
2023-2024

Alibaba Group (Cayman Islands)
2023

Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where VPG-generated images are fed a frozen LLM generate corresponding captions. However, this image-captioning based objective inherently biases VPG concentrate solely primary contents sufficient for caption generation, often neglecting...

10.48550/arxiv.2308.04152 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Dynamic graph-based data are ubiquitous in the real world, such as social networks, finance systems, and traffic flow. Fast accurately detecting anomalies these dynamic graphs is of vital importance. However, despite promising results current anomaly detection methods have achieved, there two major limitations when coping with graphs. The first limitation that topological structures temporal dynamics been modeled separately, resulting less expressive features for detection. second models...

10.1109/tkde.2023.3328645 article EN IEEE Transactions on Knowledge and Data Engineering 2023-10-30

Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...

10.18653/v1/2023.findings-emnlp.75 article EN cc-by 2023-01-01

Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse tasks, as different tasks often entail distinct search intents. To address this challenge, in work we introduce ControlRetriever, a generic and efficient approach with parameter isolated architecture, capable of controlling models directly varied harnessing the power instructions explicitly describe intents natural language. Leveraging foundation ControlNet, which...

10.48550/arxiv.2308.10025 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recent years have seen a surge of interest in anomaly detection for tackling industrial defect detection, event etc. However, existing unsupervised detectors, particularly those the vision modality, face significant challenges due to redundant information and sparse latent space. Conversely, language modality performs well its relatively single data. This paper tackles aforementioned from multimodal point view. Specifically, we propose Cross-modal Guidance (CMG), which consists Entropy...

10.48550/arxiv.2310.02821 preprint EN other-oa arXiv (Cornell University) 2023-01-01

For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual presents an ongoing challenge. This is due to a conflicting objective: for comprehension, MLLM needs abstract visuals; generation, it preserve visuals as much possible. Thus, objective dilemma visual-tokens. To resolve conflict, we propose encoding images into morph-tokens serve dual purpose: they act prompts instructing generate texts; take on different, non-conflicting role complete...

10.48550/arxiv.2405.01926 preprint EN arXiv (Cornell University) 2024-05-03

10.1145/3626772.3657745 article Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic editing and external resorting, each possess strengths weaknesses, struggling to balance the desired properties of reliability, generality, locality when applied MLLMs. In this paper, we propose UniKE, a novel multimodal method that establishes unified perspective paradigm resorting. Both types are conceptualized as vectorized key-value...

10.48550/arxiv.2409.19872 preprint EN arXiv (Cornell University) 2024-09-29

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification retrieval, struggles with scenarios requiring fine-grained semantic differentiation. This paper addresses these...

10.48550/arxiv.2411.00304 preprint EN arXiv (Cornell University) 2024-10-31

Instruction-based image editing aims to modify specific elements with natural language instructions. However, current models in this domain often struggle accurately execute complex user instructions, as they are trained on low-quality data limited types. We present AnyEdit, a comprehensive multi-modal instruction dataset, comprising 2.5 million high-quality pairs spanning over 20 types and five domains. ensure the diversity quality of AnyEdit collection through three aspects: initial...

10.48550/arxiv.2411.15738 preprint EN arXiv (Cornell University) 2024-11-24

Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, events. The hurdles to enhancing this capability include extensive manual labor, the lack of compositionality existing data absence explicit supervision. In paper, we propose STEP, a novel...

10.48550/arxiv.2412.00161 preprint EN arXiv (Cornell University) 2024-11-29

Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based built on Large Language Models (LLMs) often require frequent updates due platform-specific APIs, visual leveraging Multimodal (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these face significant challenges perception, particularly when handling...

10.48550/arxiv.2412.10342 preprint EN arXiv (Cornell University) 2024-12-13

Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt under few-shot settings on the one hand heavily relies good initialization of prompts. On other hand, it can easily overfit training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning initialize but they fail data-efficiently generalize unseen To address above...

10.48550/arxiv.2303.12314 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...