Xin Wang

ORCID: 0000-0003-2605-5504
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Topic Modeling
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Speech and dialogue systems
  • Advanced Image and Video Retrieval Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Speech Recognition and Synthesis
  • Wireless Networks and Protocols
  • Cooperative Communication and Network Coding
  • Text Readability and Simplification
  • Mobile Ad Hoc Networks
  • Reinforcement Learning in Robotics
  • Advanced Text Analysis Techniques
  • Advanced Vision and Imaging
  • Face and Expression Recognition
  • Machine Learning and Data Classification
  • Gaze Tracking and Assistive Technology
  • Advanced Graph Neural Networks
  • Psychology of Moral and Emotional Judgment
  • Anomaly Detection Techniques and Applications
  • Adversarial Robustness in Machine Learning
  • Media Influence and Health

Harbin Institute of Technology
2016-2025

Tianjin University
2023-2024

University at Albany, State University of New York
2023-2024

Shanghai International Studies University
2024

University of California, Santa Cruz
2008-2023

Xidian University
2023

Xiamen University
2023

Second Hospital of Shanxi Medical University
2023

Shanxi Medical University
2023

Institute of Information Engineering
2022

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how address three critical challenges for task: cross-modal grounding, ill-posed feedback, and generalization problems. First, propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces grounding both locally globally via reinforcement learning (RL). Particularly, matching critic used provide intrinsic...

10.1109/cvpr.2019.00679 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a contains multiple moments of interests and the describes complex temporal dependencies, which often happens real scenarios. We identify two crucial challenges: semantic misalignment structural misalignment. However, existing approaches treat different separately do explicitly model moment-wise relations. In this paper, we present Moment Alignment Network...

10.1109/cvpr.2019.00134 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Video captioning is the task of automatically generating a textual description actions in video. Although previous work (e.g. sequence-to-sequence model) has shown promising results abstracting coarse short video, it still very challenging to caption video containing multiple fine-grained with detailed description. This paper aims address challenge by proposing novel hierarchical reinforcement learning framework for captioning, where high-level Manager module learns design sub-goals and...

10.1109/cvpr.2018.00443 preprint EN 2018-06-01

Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different captions, more expressive language styles and contain many imaginary concepts that do not appear images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due limitations automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties...

10.18653/v1/p18-1083 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Xin Wang, Yuan-Fang William Yang Wang. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2125 preprint EN cc-by 2018-01-01

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for intelligence. Recently, many pre-trained language (e.g., CuBERT and CodeBERT) have been proposed model context serve as a basis downstream intelligence tasks such search, clone detection, program translation. Current approaches typically consider plain sequence tokens, or inject structure information AST data-flow)...

10.48550/arxiv.2108.04556 preprint EN other-oa arXiv (Cornell University) 2021-01-01

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) a fundamental interdisciplinary topic towards this goal, receives increasing attention from language processing, computer vision, robotics, machine learning communities. In paper, we review contemporary studies emerging field VLN, covering tasks, evaluation metrics, methods, etc....

10.18653/v1/2022.acl-long.524 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Temporal grounding in videos aims to localize one target video segment that semantically corresponds a given query sentence. Thanks the semantic diversity of natural language descriptions, temporal allows activity beyond pre-defined classes and has received increasing attention recent years. The is rooted principle compositionality linguistics, where novel semantics can be systematically described by combining known words ways (compositional generalization). However, current datasets do not...

10.1109/cvpr52688.2022.00304 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., transformers) to downstream tasks. Common approaches for model adaptation either update all parameters or leverage linear probes. this paper, we aim study parameter-efficient strategies transformers on the image classification task. We formulate efficient as a subspace training problem and perform comprehensive benchmarking over different methods. conduct an...

10.1609/aaai.v37i1.25160 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at document level. In this paper, we aim to improve task by introducing three auxiliary pre-training tasks that learn document-level in self-supervised fashion. Experiments on widely-used CNN/DM dataset validate effectiveness of proposed tasks. Furthermore, show after pre-training, clean model simple building blocks is able outperform...

10.18653/v1/p19-1214 preprint EN cc-by 2019-01-01

The sequential order of utterances is often meaningful in coherent dialogues, and the changes could lead to low-quality incoherent conversations. We consider information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous systems. Therefore, this paper, we introduce self-supervised learning task, inconsistent detection, explicitly capture flow conversation dialogues. Given sampled utterance pair triple, task predict whether it ordered or...

10.18653/v1/p19-1375 preprint EN cc-by 2019-01-01

Large-scale knowledge graphs (KGs) are shown to become more important in current information systems. To expand the coverage of KGs, previous studies on graph completion need collect adequate training instances for newly-added relations. In this paper, we consider a novel formulation, zero-shot learning, free cumbersome curation. For relations, attempt learn their semantic features from text descriptions and hence recognize facts unseen relations with no examples being seen. purpose,...

10.1609/aaai.v34i05.6392 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

The aim of remote sensing image captioning (RSIC) is to describe a given (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers word vector from preceding However, these are indirectly guided confusion attentive regions, as (1) weighted average in attention mechanism distracts capturing pertinent visual regions and (2) there few constraints or rewards for learning long-range transitions. In this paper,...

10.3390/rs15030579 article EN cc-by Remote Sensing 2023-01-18

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhe Gan, Jana Diesner, Jianfeng Gao. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1220 article EN cc-by 2019-01-01

Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation with latent variables. However, previous works typically focus on synthesizing relatively short sentences (up to 20 words), and the posterior collapse issue has been widely identified in text-VAEs. In this paper, we propose leverage several multi-level structures learn a VAE model generating long, coherent text. particular, hierarchy of stochastic layers between encoder...

10.18653/v1/p19-1200 preprint EN cc-by 2019-01-01

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of task. In reality, truly useful VidL system is expected to be easily generalizable diverse tasks, domains, and datasets. To facilitate the evaluation such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage 11 over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; (iii) captioning. VALUE benchmark aims cover broad range...

10.48550/arxiv.2106.04632 preprint EN cc-by-nc-sa arXiv (Cornell University) 2021-01-01

10.1109/icassp49660.2025.10889262 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused malicious purposes. While existing detection models excel in intra-domain evaluation, they face challenges generalizing across different domains, potentially becoming obsolete as new generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), are limited...

10.1609/aaai.v39i19.34221 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Task-oriented dialog systems are becoming pervasive, and many companies heavily rely on them to complement human agents for customer service in call centers. With globalization, the need providing cross-lingual support becomes more urgent than ever. However, poses great challenges—it requires a large amount of additional annotated data from native speakers. In order bypass expensive annotation achieve first step towards ultimate goal building universal system, we set out build state tracking...

10.18653/v1/d18-1038 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Jiawei Wu, Xin Wang, William Yang Wang. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1120 preprint EN 2019-01-01

Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities training corpus, and do not generalize open vocabulary scenarios. Here we introduce a novel task, zeroshot that aims at describing out-of-domain videos unseen activities. Videos different usually require captioning strategies many aspects, i.e. word selection, semantic construction, style expression etc, which poses great challenge depict without paired data....

10.1609/aaai.v33i01.33018965 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17
Coming Soon ...