- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- Music and Audio Processing
- Speech and Audio Processing
- Emotion and Mood Recognition
- Topic Modeling
- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Sentiment Analysis and Opinion Mining
- Domain Adaptation and Few-Shot Learning
- Image Retrieval and Classification Techniques
- Subtitles and Audiovisual Media
- Anomaly Detection Techniques and Applications
- Face recognition and analysis
- Biometric Identification and Security
- Speech and dialogue systems
- Text and Document Classification Technologies
- Video Surveillance and Tracking Methods
- Image and Object Detection Techniques
- Generative Adversarial Networks and Image Synthesis
- Risk and Safety Analysis
- Remote-Sensing Image Classification
- Advanced Image Processing Techniques
Renmin University of China
2016-2025
China Academy of Engineering Physics
2014-2025
Hong Kong Polytechnic University
2017-2024
Nanjing Normal University
2023
Alibaba Group (Cayman Islands)
2023
Northeastern University
2022
University of Chinese Academy of Sciences
2017-2022
Beijing Normal University
2018-2021
Guizhou University
2019
Huzhou Vocational and Technical College
2017
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled unlabeled data. By storing into parameters fine-tuning on specific tasks, rich implicitly encoded benefit variety downstream which has been extensively demonstrated via experimental...
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of on web. The current dominant approach is learn a joint embedding space measure cross-modal similarities. However, simple embeddings are insufficient represent complicated visual textual details, such as scenes, objects, actions their compositions. To improve fine-grained video-text retrieval, we propose Hierarchical Graph Reasoning (HGR) model, which decomposes matching into...
Humans are able to describe image contents with coarse fine details as they wish. However, most captioning models intention-agnostic which cannot generate diverse descriptions according different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure represent intention in fine-grained level and control what how detailed generated description should be. The ASG is a directed graph consisting of three types abstract nodes (object, attribute,...
Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate pairs, we a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of sequential multi-modal U-Net for process by design. Two subnets audio video learn gradually aligned pairs from Gaussian...
In this paper we explore one of the key aspects in building an emotion recognition system: generating suitable feature representations. We generate representations from both acoustic and lexical levels. At level, first extract low-level features such as intensity, F0, jitter, shimmer spectral contours etc. then different based on these features, including statistics over a new representation derived set codewords, Gaussian Supervectors. propose named vector (eVector). also use traditional...
Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in Audio/Visual Emotion Challenge (AVEC) 2017, requires participants to perform continuous prediction three affective dimensions: Arousal, Valence and Likability based audiovisual signals. We highlight aspects of solutions: 1) explore fuse different hand-crafted deep learned features from all available...
Describing videos with natural language is one of the ultimate goals video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR to Language Challenge provides a good chance study multi-modality fusion in caption task. In this paper, we propose encoder integrate it text sequence decoder into an end-to-end framework. Features from visual, meta modalities are fused together represent contents. Long Short-Term Memory Recurrent Neural Networks...
Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Challenge (AVEC2015), whose goal is to explore utilizing audio, visual physiological signals continuously predict value of emotion dimensions (arousal valence). Our system applies Recurrent Neural Networks (RNN) model temporal information. We various aspects improve prediction performance including: dominant modalities arousal valence...
Jinming Zhao, Ruichen Li, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early or late fusion. In this paper, we propose novel multi-modal strategy named conditional attention fusion, which can dynamically pay to different at each time step. Long-short term memory recurrent neural networks (LSTM-RNN) applied basic uni-modality model capture long dependencies. The weights assigned are automatically decided by...
The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, therefore, makes the captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics unsupervised fashion from data guides decoder with these topics. Compared pre-defined topics, mined are semantically visually coherent can reflect distribution better. We formulate topic-aware generation as a...
Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer video processing. This survey aims provide a comprehensive overview for Video-Language learning. We first briefly introduce structure as background knowledge, including attention mechanism, position encoding etc. then describe typical paradigm & fine-tuning processing in terms proxy downstream commonly used datasets....
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ming Yan, Guohai Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, Fei Huang. Findings of the Association for Computational Linguistics: EMNLP 2023.
Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing retrieval systems for locating videos which certain predefined events are shown.Typical focus heavily on use of visual data.Audio data, however, also contains rich information that can be effectively used video retrieval, MED could benefit from attention researchers audio analysis.We present several performing using only report results each system 2011 development...
Dermoscopy image is usually used in early diagnosis of malignant melanoma. The accuracy by visual inspection highly relied on the dermatologist's clinical experience. Due to inaccuracy, subjectivity, and poor reproducibility human judgement, an automatic recognition algorithm dermoscopy desired. In this work, we present a hybrid classification framework for assessment combining deep convolutional neural network (CNN), Fisher vector (FV) support machine (SVM). Specifically, representations...
Generating image descriptions in different languages is essential to satisfy users worldwide. However, it prohibitively expensive collect large-scale paired image-caption dataset for every target language which critical training descent captioning models. Previous works tackle the unpaired cross-lingual problem through a pivot language, with help of data and pivot-to-target machine translation such language-pivoted approach suffers from inaccuracy brought by translation, including disfluency...
Automatic video description generation (a.k.a captioning) is one of the ultimate goals for understanding. Despite wide range applications such as indexing and retrieval etc., captioning task remains quite challenging due to complexity diversity content. First, open-domain videos cover a broad topics, which results in highly variable vocabularies expression styles describe contents. Second, naturally contain multiple modalities including image, motion, acoustic media. The information provided...
Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems.State-of-the-art mispronunciation models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness (GOP) based algorithm pronunciation scoring.However, GOP scoring have two major limitations: i.e., (i) They depend on forced alignment which splits speech into phonetic segments independently them scoring, neglects transitions between phonemes within segment; (ii) only...
Bayesian uncertainty analysis is a highly effective tool for estimating model uncertainty, thereby improving the prediction ability with limited data. The data quality plays role in analysis. This paper presents novel approach to assess of experiment high explosives. By assigning varying weights based on their quality, we adopt statistical framework quantify uncertainties associated reactant equation state resulting quantification not only elucidates current physical knowledge but also paves...
This paper describes the issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations. Two systems were developed evaluated in NIST RT-04S Meeting Recognition Evaluation, Multiple Distant Microphone (MDM) system Individual Headset (IHM) system. The MDM achieved a diarization performance 28.17%. also aims to provide speech segments grouping information recognition, necessary prerequisite subsequent audio processing. A 44.5% word error rate was...
The recent advances in image captioning stimulate the research generating natural language description for visual content, which can be widely applied many applications such as assisting blind people. Video generation is a more complex task than caption. Most works of video focus on information video. However, audio provides rich describing contents well. In this paper, we propose to generate descriptions sentences using both and cues. We use unified deep neural networks with convolutional...