- Multimodal Machine Learning Applications
- Topic Modeling
- Natural Language Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Speech and dialogue systems
- Advanced Image and Video Retrieval Techniques
- Generative Adversarial Networks and Image Synthesis
- Speech Recognition and Synthesis
- Wireless Networks and Protocols
- Cooperative Communication and Network Coding
- Text Readability and Simplification
- Mobile Ad Hoc Networks
- Reinforcement Learning in Robotics
- Advanced Text Analysis Techniques
- Advanced Vision and Imaging
- Face and Expression Recognition
- Machine Learning and Data Classification
- Gaze Tracking and Assistive Technology
- Advanced Graph Neural Networks
- Psychology of Moral and Emotional Judgment
- Anomaly Detection Techniques and Applications
- Adversarial Robustness in Machine Learning
- Media Influence and Health
Harbin Institute of Technology
2016-2025
Tianjin University
2023-2024
University at Albany, State University of New York
2023-2024
Shanghai International Studies University
2024
University of California, Santa Cruz
2008-2023
Xidian University
2023
Xiamen University
2023
Second Hospital of Shanxi Medical University
2023
Shanxi Medical University
2023
Institute of Information Engineering
2022
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how address three critical challenges for task: cross-modal grounding, ill-posed feedback, and generalization problems. First, propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces grounding both locally globally via reinforcement learning (RL). Particularly, matching critic used provide intrinsic...
This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a contains multiple moments of interests and the describes complex temporal dependencies, which often happens real scenarios. We identify two crucial challenges: semantic misalignment structural misalignment. However, existing approaches treat different separately do explicitly model moment-wise relations. In this paper, we present Moment Alignment Network...
Video captioning is the task of automatically generating a textual description actions in video. Although previous work (e.g. sequence-to-sequence model) has shown promising results abstracting coarse short video, it still very challenging to caption video containing multiple fine-grained with detailed description. This paper aims address challenge by proposing novel hierarchical reinforcement learning framework for captioning, where high-level Manager module learns design sub-goals and...
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different captions, more expressive language styles and contain many imaginary concepts that do not appear images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due limitations automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties...
Xin Wang, Yuan-Fang William Yang Wang. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for intelligence. Recently, many pre-trained language (e.g., CuBERT and CodeBERT) have been proposed model context serve as a basis downstream intelligence tasks such search, clone detection, program translation. Current approaches typically consider plain sequence tokens, or inject structure information AST data-flow)...
A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) a fundamental interdisciplinary topic towards this goal, receives increasing attention from language processing, computer vision, robotics, machine learning communities. In paper, we review contemporary studies emerging field VLN, covering tasks, evaluation metrics, methods, etc....
Temporal grounding in videos aims to localize one target video segment that semantically corresponds a given query sentence. Thanks the semantic diversity of natural language descriptions, temporal allows activity beyond pre-defined classes and has received increasing attention recent years. The is rooted principle compositionality linguistics, where novel semantics can be systematically described by combining known words ways (compositional generalization). However, current datasets do not...
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., transformers) to downstream tasks. Common approaches for model adaptation either update all parameters or leverage linear probes. this paper, we aim study parameter-efficient strategies transformers on the image classification task. We formulate efficient as a subspace training problem and perform comprehensive benchmarking over different methods. conduct an...
Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at document level. In this paper, we aim to improve task by introducing three auxiliary pre-training tasks that learn document-level in self-supervised fashion. Experiments on widely-used CNN/DM dataset validate effectiveness of proposed tasks. Furthermore, show after pre-training, clean model simple building blocks is able outperform...
The sequential order of utterances is often meaningful in coherent dialogues, and the changes could lead to low-quality incoherent conversations. We consider information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous systems. Therefore, this paper, we introduce self-supervised learning task, inconsistent detection, explicitly capture flow conversation dialogues. Given sampled utterance pair triple, task predict whether it ordered or...
Large-scale knowledge graphs (KGs) are shown to become more important in current information systems. To expand the coverage of KGs, previous studies on graph completion need collect adequate training instances for newly-added relations. In this paper, we consider a novel formulation, zero-shot learning, free cumbersome curation. For relations, attempt learn their semantic features from text descriptions and hence recognize facts unseen relations with no examples being seen. purpose,...
The aim of remote sensing image captioning (RSIC) is to describe a given (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers word vector from preceding However, these are indirectly guided confusion attentive regions, as (1) weighted average in attention mechanism distracts capturing pertinent visual regions and (2) there few constraints or rewards for learning long-range transitions. In this paper,...
Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhe Gan, Jana Diesner, Jianfeng Gao. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.
Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation with latent variables. However, previous works typically focus on synthesizing relatively short sentences (up to 20 words), and the posterior collapse issue has been widely identified in text-VAEs. In this paper, we propose leverage several multi-level structures learn a VAE model generating long, coherent text. particular, hierarchy of stochastic layers between encoder...
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of task. In reality, truly useful VidL system is expected to be easily generalizable diverse tasks, domains, and datasets. To facilitate the evaluation such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage 11 over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; (iii) captioning. VALUE benchmark aims cover broad range...
AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused malicious purposes. While existing detection models excel in intra-domain evaluation, they face challenges generalizing across different domains, potentially becoming obsolete as new generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), are limited...
Task-oriented dialog systems are becoming pervasive, and many companies heavily rely on them to complement human agents for customer service in call centers. With globalization, the need providing cross-lingual support becomes more urgent than ever. However, poses great challenges—it requires a large amount of additional annotated data from native speakers. In order bypass expensive annotation achieve first step towards ultimate goal building universal system, we set out build state tracking...
Jiawei Wu, Xin Wang, William Yang Wang. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities training corpus, and do not generalize open vocabulary scenarios. Here we introduce a novel task, zeroshot that aims at describing out-of-domain videos unseen activities. Videos different usually require captioning strategies many aspects, i.e. word selection, semantic construction, style expression etc, which poses great challenge depict without paired data....