NFDI4DS | UHH-SEMS - Publication Details

Xin Wang

ORCID: 0000-0003-2605-5504

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100328100

Research Areas

Multimodal Machine Learning Applications
Topic Modeling
Natural Language Processing Techniques
Domain Adaptation and Few-Shot Learning
Human Pose and Action Recognition
Video Analysis and Summarization
Speech and dialogue systems
Advanced Image and Video Retrieval Techniques
Generative Adversarial Networks and Image Synthesis
Speech Recognition and Synthesis
Wireless Networks and Protocols
Cooperative Communication and Network Coding
Text Readability and Simplification
Mobile Ad Hoc Networks
Reinforcement Learning in Robotics
Advanced Text Analysis Techniques
Advanced Vision and Imaging
Face and Expression Recognition
Machine Learning and Data Classification
Gaze Tracking and Assistive Technology
Advanced Graph Neural Networks
Psychology of Moral and Emotional Judgment
Anomaly Detection Techniques and Applications
Adversarial Robustness in Machine Learning
Media Influence and Health

Harbin Institute of Technology
2016-2025

Tianjin University
2023-2024

University at Albany, State University of New York
2023-2024

Shanghai International Studies University
2024

University of California, Santa Cruz
2008-2023

Xidian University
2023

Xiamen University
2023

Second Hospital of Shanxi Medical University
2023

Shanxi Medical University
2023

Institute of Information Engineering
2022

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

OPENALEX - Publications

Xin Wang Qiuyuan Huang Aslı Çelikyılmaz Jianfeng Gao Dinghan Shen and 3 more

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how address three critical challenges for task: cross-modal grounding, ill-posed feedback, and generalization problems. First, propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces grounding both locally globally via reinforcement learning (RL). Particularly, matching critic used provide intrinsic...

10.1109/cvpr.2019.00679 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

OPENALEX - Publications

Da Zhang Xiyang Dai Xin Wang Yuan-Fang Wang Larry S. Davis

This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a contains multiple moments of interests and the describes complex temporal dependencies, which often happens real scenarios. We identify two crucial challenges: semantic misalignment structural misalignment. However, existing approaches treat different separately do explicitly model moment-wise relations. In this paper, we present Moment Alignment Network...

10.1109/cvpr.2019.00134 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Video Captioning via Hierarchical Reinforcement Learning

OPENALEX - Publications

Xin Wang Wenhu Chen Jiawei Wu Yuan-Fang Wang William Yang Wang

Video captioning is the task of automatically generating a textual description actions in video. Although previous work (e.g. sequence-to-sequence model) has shown promising results abstracting coarse short video, it still very challenging to caption video containing multiple fine-grained with detailed description. This paper aims address challenge by proposing novel hierarchical reinforcement learning framework for captioning, where high-level Manager module learns design sub-goals and...

10.1109/cvpr.2018.00443 preprint EN 2018-06-01

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling

OPENALEX - Publications

Xin Wang Wenhu Chen Yuan-Fang Wang William Yang Wang

Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different captions, more expressive language styles and contain many imaginary concepts that do not appear images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due limitations automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties...

10.18653/v1/p18-1083 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

OPENALEX - Publications

Xin Wang Yuan-Fang Wang William Yang Wang

Xin Wang, Yuan-Fang William Yang Wang. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2125 preprint EN cc-by 2018-01-01

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

OPENALEX - Publications

Xin Wang Yasheng Wang Fei Mi Pingyi Zhou Yao Wan and 5 more

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for intelligence. Recently, many pre-trained language (e.g., CuBERT and CodeBERT) have been proposed model context serve as a basis downstream intelligence tasks such search, clone detection, program translation. Current approaches typically consider plain sequence tokens, or inject structure information AST data-flow)...

10.48550/arxiv.2108.04556 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

OPENALEX - Publications

Jing Gu Eliana Stefani Qi Wu Jesse Thomason Xin Wang

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) a fundamental interdisciplinary topic towards this goal, receives increasing attention from language processing, computer vision, robotics, machine learning communities. In paper, we review contemporary studies emerging field VLN, covering tasks, evaluation metrics, methods, etc....

10.18653/v1/2022.acl-long.524 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

OPENALEX - Publications

Juncheng Li Junlin Xie Long Qian Linchao Zhu Siliang Tang and 4 more

Temporal grounding in videos aims to localize one target video segment that semantically corresponds a given query sentence. Thanks the semantic diversity of natural language descriptions, temporal allows activity beyond pre-defined classes and has received increasing attention recent years. The is rooted principle compositionality linguistics, where novel semantics can be systematically described by combining known words ways (compositional generalization). However, current datasets do not...

10.1109/cvpr52688.2022.00304 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Parameter-Efficient Model Adaptation for Vision Transformers

OPENALEX - Publications

Xuehai He Chunyuan Li Pengchuan Zhang Jianwei Yang Xin Wang

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., transformers) to downstream tasks. Common approaches for model adaptation either update all parameters or leverage linear probes. this paper, we aim study parameter-efficient strategies transformers on the image classification task. We formulate efficient as a subspace training problem and perform comprehensive benchmarking over different methods. conduct an...

10.1609/aaai.v37i1.25160 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

Self-Supervised Learning for Contextualized Extractive Summarization

OPENALEX - Publications

Hong Wang Xin Wang Wenhan Xiong Mo Yu Xiaoxiao Guo and 2 more

Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at document level. In this paper, we aim to improve task by introducing three auxiliary pre-training tasks that learn document-level in self-supervised fashion. Experiments on widely-used CNN/DM dataset validate effectiveness of proposed tasks. Furthermore, show after pre-training, clean model simple building blocks is able outperform...

10.18653/v1/p19-1214 preprint EN cc-by 2019-01-01

Self-Supervised Dialogue Learning

OPENALEX - Publications

Jiawei Wu Xin Wang William Yang Wang

The sequential order of utterances is often meaningful in coherent dialogues, and the changes could lead to low-quality incoherent conversations. We consider information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous systems. Therefore, this paper, we introduce self-supervised learning task, inconsistent detection, explicitly capture flow conversation dialogues. Given sampled utterance pair triple, task predict whether it ordered or...

10.18653/v1/p19-1375 preprint EN cc-by 2019-01-01

Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

OPENALEX - Publications

Pengda Qin Xin Wang Wenhu Chen Chunyun Zhang Weiran Xu and 1 more

Large-scale knowledge graphs (KGs) are shown to become more important in current information systems. To expand the coverage of KGs, previous studies on graph completion need collect adequate training instances for newly-added relations. In this paper, we consider a novel formulation, zero-shot learning, free cumbersome curation. For relations, attempt learn their semantic features from text descriptions and hence recognize facts unseen relations with no examples being seen. purpose,...

10.1609/aaai.v34i05.6392 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

OPENALEX - Publications

Xiangrong Zhang Yunpeng Li Xin Wang Feixiang Liu Zhaoji Wu and 2 more

The aim of remote sensing image captioning (RSIC) is to describe a given (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers word vector from preceding However, these are indirectly guided confusion attentive regions, as (1) weighted average in attention mechanism distracts capturing pertinent visual regions and (2) there few constraints or rewards for learning long-range transitions. In this paper,...

10.3390/rs15030579 article EN cc-by Remote Sensing 2023-01-18

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

OPENALEX - Publications

Ming Jiang Qiuyuan Huang Lei Zhang Xin Wang Pengchuan Zhang and 3 more

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhe Gan, Jana Diesner, Jianfeng Gao. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1220 article EN cc-by 2019-01-01

Towards Generating Long and Coherent Text with Multi-Level Latent Variable Models

OPENALEX - Publications

Dinghan Shen Aslı Çelikyılmaz Yizhe Zhang Li‐Qun Chen Xin Wang and 2 more

Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation with latent variables. However, previous works typically focus on synthesizing relatively short sentences (up to 20 words), and the posterior collapse issue has been widely identified in text-VAEs. In this paper, we propose leverage several multi-level structures learn a VAE model generating long, coherent text. particular, hierarchy of stochastic layers between encoder...

10.18653/v1/p19-1200 preprint EN cc-by 2019-01-01

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

OPENALEX - Publications

Linjie Li Jie Lei Zhe Gan Licheng Yu Yen-Chun Chen and 10 more

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of task. In reality, truly useful VidL system is expected to be easily generalizable diverse tasks, domains, and datasets. To facilitate the evaluation such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage 11 over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; (iii) captioning. VALUE benchmark aims cover broad range...

10.48550/arxiv.2106.04632 preprint EN cc-by-nc-sa arXiv (Cornell University) 2021-01-01

Component-wise Self-Correction Network for Human Motion Prediction

OPENALEX - Publications

Jinkai Li Jinghua Wang Xin Wang Liang Yan Yong Xu

10.1109/icassp49660.2025.10889262 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Improving Generalization for AI-Synthesized Voice Detection

OPENALEX - Publications

Hainan Ren Lin Li Chun-Hao Liu Xin Wang Shu Hu

AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused malicious purposes. While existing detection models excel in intra-domain evaluation, they face challenges generalizing across different domains, potentially becoming obsolete as new generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), are limited...

10.1609/aaai.v39i19.34221 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

XL-NBT: A Cross-lingual Neural Belief Tracking Framework

OPENALEX - Publications

Wenhu Chen Jianshu Chen Yu Su Xin Wang Dong Yu and 2 more

Task-oriented dialog systems are becoming pervasive, and many companies heavily rely on them to complement human agents for customer service in call centers. With globalization, the need providing cross-lingual support becomes more urgent than ever. However, poses great challenges—it requires a large amount of additional annotated data from native speakers. In order bypass expensive annotation achieve first step towards ultimate goal building universal system, we set out build state tracking...

10.18653/v1/d18-1038 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

OPENALEX - Publications

Jiawei Wu Xin Wang William Yang Wang

Jiawei Wu, Xin Wang, William Yang Wang. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1120 preprint EN 2019-01-01

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

OPENALEX - Publications

Xin Wang Jiawei Wu Da Zhang Yu Su William Yang Wang

Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities training corpus, and do not generalize open vocabulary scenarios. Here we introduce a novel task, zeroshot that aims at describing out-of-domain videos unseen activities. Videos different usually require captioning strategies many aspects, i.e. word selection, semantic construction, style expression etc, which poses great challenge depict without paired data....

10.1609/aaai.v33i01.33018965 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

Coming Soon ...