Xuwu Wang

ORCID: 0000-0003-3363-570X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Multimodal Machine Learning Applications
  • Natural Language Processing Techniques
  • Advanced Graph Neural Networks
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Image Retrieval and Classification Techniques
  • Data Quality and Management
  • Biomedical Text Mining and Ontologies
  • Bayesian Modeling and Causal Inference
  • Human Pose and Action Recognition
  • Machine Learning in Healthcare
  • Web Data Mining and Analysis
  • Mental Health via Writing
  • Complex Network Analysis Techniques
  • Anomaly Detection Techniques and Applications
  • Context-Aware Activity Recognition Systems
  • Gaussian Processes and Bayesian Inference
  • Data Mining Algorithms and Applications
  • Machine Learning and Data Classification
  • Antiplatelet Therapy and Cardiovascular Diseases
  • Software Engineering Research
  • Atrial Fibrillation Management and Outcomes
  • Lipoproteins and Cardiovascular Health
  • Visual Attention and Saliency Detection

Fudan University
2019-2024

Chinese PLA General Hospital
2010

Recent years have witnessed the resurgence of knowledge engineering which is featured by fast growth graphs. However, most existing graphs are represented with pure symbols, hurts machine's capability to understand real world. The multi-modalization an inevitable key step towards realization human-level machine intelligence. results this endeavor Multi-modal Knowledge Graphs (MMKGs). In survey on MMKGs constructed texts and images, we first give definitions MMKGs, followed preliminaries...

10.1109/tkde.2022.3224228 article EN IEEE Transactions on Knowledge and Data Engineering 2022-11-24

Visual grounding focuses on establishing fine-grained alignment between vision and natural language, which has essential applications in multimodal reasoning systems. Existing methods use pre-trained query-agnostic visual backbones to extract feature maps independently without considering the query information. We argue that features extracted from really needed for are inconsistent. One reason is there differences pre-training tasks grounding. Moreover, since query-agnostic, it difficult...

10.1109/cvpr52688.2022.01506 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Multimodal named entity recognition (MNER) aims to detect and classify entities in multimodal scenarios. It requires bridging the gap between natural language visual context, which presents two-fold challenges: cross-modal alignment is diversified, interaction sometimes implicit. Existing MNER methods are vulnerable some implicit interactions prone overlook involved significant features. To tackle this problem, we novelly propose refine attention by identifying highlighting task-salient The...

10.1109/icme52920.2022.9859972 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2022-07-18

Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Rui Ming Yan, Lihan Chen, Yanghua Xiao. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.328 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Low-Rank Adaptation (LoRA) is widely used for adapting large language models (LLMs) to specific domains due its efficiency and modularity. Meanwhile, vanilla LoRA struggles with task conflicts in multi-task scenarios. Recent works adopt Mixture of Experts (MoE) by treating each module as an expert, thereby mitigating interference through multiple specialized modules. While effective, these methods often isolate knowledge within individual tasks, failing fully exploit the shared across...

10.48550/arxiv.2501.15103 preprint EN arXiv (Cornell University) 2025-01-25

Image-text retrieval is a challenging cross-modal task that arouses much attention. While the traditional methods cannot break down barriers between different modalities, Vision-Language Pre-trained (VLP) models greatly improve image-text performance based on massive pairs. Nonetheless, VLP-based are still prone to produce results be aligned with entities. Recent efforts try fix this problem at pre-training stage, which not only expensive but also unpractical due unavailable of full...

10.1145/3539597.3570481 article EN 2023-02-22

Low-dimensional embeddings of knowledge graphs and behavior have proved remarkably powerful in varieties tasks, from predicting unobserved edges between entities to content recommendation. The two types can contain distinct complementary information for the same entities/nodes. However, previous works focus either on graph embedding or while few consider both a unified way. Here we present BEM, Bayesian framework that incorporates graphs. To be more specific, BEM takes as prior pre-trained...

10.1145/3357384.3358014 article EN 2019-11-03

Referring expression comprehension aims to align natural language queries with visual scenes, which requires establishing fine-grained correspondence between vision and language. This has important applications in multi-modal reasoning systems. Existing methods typically use text-agnostic backbones extract features independently without considering the specific text input. However, we argue that extracted can be inconsistent referring expression, hurts understanding. To address this, first...

10.1145/3660638 article EN ACM Transactions on Multimedia Computing Communications and Applications 2024-04-25

In this paper, we introduce "InfiAgent-DABench", the first benchmark specifically designed to evaluate LLM-based agents in data analysis tasks. This contains DAEval, a dataset consisting of 311 questions derived from 55 CSV files, and an agent framework LLMs as agents. We adopt format-prompting technique, ensuring be closed-form that can automatically evaluated. Our extensive benchmarking 23 state-of-the-art uncovers current challenges encountered addition, have developed DAAgent,...

10.48550/arxiv.2401.05507 preprint EN other-oa arXiv (Cornell University) 2024-01-01

With the explosive growth of multi-modal information on Internet, unimodal search cannot satisfy requirement Internet applications. Text-image retrieval research is needed to realize high-quality and efficient between different modalities. Existing text-image mostly based general vision-language datasets (e.g. MS-COCO, Flickr30K), in which query utterance rigid unnatural (i.e. verbosity formality). To overcome shortcoming, we construct a new Compact Fragmented Query challenge dataset (named...

10.48550/arxiv.2403.13317 preprint EN arXiv (Cornell University) 2024-03-20

Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due its modular design and widespread availability on platforms like Huggingface. This modularity sparked interest in combining multiple LoRAs enhance LLM capabilities. However, existing methods LoRA composition primarily focus task-specific adaptations that require additional training, current model merging techniques often fail fully leverage LoRA's nature, leading...

10.48550/arxiv.2409.16167 preprint EN arXiv (Cornell University) 2024-09-24

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured processing, as exemplified by ChartQA and ChatGPT-Ada, multimodal unstructured processing seen Visual Question Answering (VQA). These areas attracted significant attention from both industry academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse scenarios. In response, we introduce BabelBench, an...

10.48550/arxiv.2410.00773 preprint EN arXiv (Cornell University) 2024-10-01

Recent years have witnessed the resurgence of knowledge engineering which is featured by fast growth graphs. However, most existing graphs are represented with pure symbols, hurts machine's capability to understand real world. The multi-modalization an inevitable key step towards realization human-level machine intelligence. results this endeavor Multi-modal Knowledge Graphs (MMKGs). In survey on MMKGs constructed texts and images, we first give definitions MMKGs, followed preliminaries...

10.48550/arxiv.2202.05786 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Low-dimensional embeddings of knowledge graphs and behavior have proved remarkably powerful in varieties tasks, from predicting unobserved edges between entities to content recommendation. The two types can contain distinct complementary information for the same entities/nodes. However, previous works focus either on graph embedding or while few consider both a unified way. Here we present BEM , Bayesian framework that incorporates graphs. To be more specific, takes as prior pre-trained...

10.48550/arxiv.1908.10611 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Visual grounding focuses on establishing fine-grained alignment between vision and natural language, which has essential applications in multimodal reasoning systems. Existing methods use pre-trained query-agnostic visual backbones to extract feature maps independently without considering the query information. We argue that features extracted from really needed for are inconsistent. One reason is there differences pre-training tasks grounding. Moreover, since query-agnostic, it difficult...

10.48550/arxiv.2203.15442 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...