Qin Jin

ORCID: 0000-0001-6486-6020
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Video Analysis and Summarization
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Music and Audio Processing
  • Speech and Audio Processing
  • Emotion and Mood Recognition
  • Topic Modeling
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Sentiment Analysis and Opinion Mining
  • Domain Adaptation and Few-Shot Learning
  • Image Retrieval and Classification Techniques
  • Subtitles and Audiovisual Media
  • Anomaly Detection Techniques and Applications
  • Face recognition and analysis
  • Biometric Identification and Security
  • Speech and dialogue systems
  • Text and Document Classification Technologies
  • Video Surveillance and Tracking Methods
  • Image and Object Detection Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Risk and Safety Analysis
  • Remote-Sensing Image Classification
  • Advanced Image Processing Techniques

Renmin University of China
2016-2025

China Academy of Engineering Physics
2014-2025

Hong Kong Polytechnic University
2017-2024

Nanjing Normal University
2023

Alibaba Group (Cayman Islands)
2023

Northeastern University
2022

University of Chinese Academy of Sciences
2017-2022

Beijing Normal University
2018-2021

Guizhou University
2019

Huzhou Vocational and Technical College
2017

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled unlabeled data. By storing into parameters fine-tuning on specific tasks, rich implicitly encoded benefit variety downstream which has been extensively demonstrated via experimental...

10.1016/j.aiopen.2021.08.002 article EN cc-by-nc-nd AI Open 2021-01-01

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of on web. The current dominant approach is learn a joint embedding space measure cross-modal similarities. However, simple embeddings are insufficient represent complicated visual textual details, such as scenes, objects, actions their compositions. To improve fine-grained video-text retrieval, we propose Hierarchical Graph Reasoning (HGR) model, which decomposes matching into...

10.1109/cvpr42600.2020.01065 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Humans are able to describe image contents with coarse fine details as they wish. However, most captioning models intention-agnostic which cannot generate diverse descriptions according different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure represent intention in fine-grained level and control what how detailed generated description should be. The ASG is a directed graph consisting of three types abstract nodes (object, attribute,...

10.1109/cvpr42600.2020.00998 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.440 article EN cc-by 2021-01-01

We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate pairs, we a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of sequential multi-modal U-Net for process by design. Two subnets audio video learn gradually aligned pairs from Gaussian...

10.1109/cvpr52729.2023.00985 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

In this paper we explore one of the key aspects in building an emotion recognition system: generating suitable feature representations. We generate representations from both acoustic and lexical levels. At level, first extract low-level features such as intensity, F0, jitter, shimmer spectral contours etc. then different based on these features, including statistics over a new representation derived set codewords, Gaussian Supervectors. propose named vector (eVector). also use traditional...

10.1109/icassp.2015.7178872 article EN 2015-04-01

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in Audio/Visual Emotion Challenge (AVEC) 2017, requires participants to perform continuous prediction three affective dimensions: Arousal, Valence and Likability based audiovisual signals. We highlight aspects of solutions: 1) explore fuse different hand-crafted deep learned features from all available...

10.1145/3133944.3133949 article EN 2017-10-20

Describing videos with natural language is one of the ultimate goals video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR to Language Challenge provides a good chance study multi-modality fusion in caption task. In this paper, we propose encoder integrate it text sequence decoder into an end-to-end framework. Features from visual, meta modalities are fused together represent contents. Long Short-Term Memory Recurrent Neural Networks...

10.1145/2964284.2984065 article EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Challenge (AVEC2015), whose goal is to explore utilizing audio, visual physiological signals continuously predict value of emotion dimensions (arousal valence). Our system applies Recurrent Neural Networks (RNN) model temporal information. We various aspects improve prediction performance including: dominant modalities arousal valence...

10.1145/2808196.2811638 article EN 2015-10-13

Jinming Zhao, Ruichen Li, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.203 article EN cc-by 2021-01-01

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early or late fusion. In this paper, we propose novel multi-modal strategy named conditional attention fusion, which can dynamically pay to different at each time step. Long-short term memory recurrent neural networks (LSTM-RNN) applied basic uni-modality model capture long dependencies. The weights assigned are automatically decided by...

10.1145/2964284.2967286 article EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, therefore, makes the captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics unsupervised fashion from data guides decoder with these topics. Compared pre-defined topics, mined are semantically visually coherent can reflect distribution better. We formulate topic-aware generation as a...

10.1145/3123266.3123420 article EN Proceedings of the 30th ACM International Conference on Multimedia 2017-10-20

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer video processing. This survey aims provide a comprehensive overview for Video-Language learning. We first briefly introduce structure as background knowledge, including attention mechanism, position encoding etc. then describe typical paradigm & fine-tuning processing in terms proxy downstream commonly used datasets....

10.1016/j.aiopen.2022.01.001 article EN cc-by-nc-nd AI Open 2022-01-01

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ming Yan, Guohai Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, Fei Huang. Findings of the Association for Computational Linguistics: EMNLP 2023.

10.18653/v1/2023.findings-emnlp.187 article EN cc-by 2023-01-01

Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing retrieval systems for locating videos which certain predefined events are shown.Typical focus heavily on use of visual data.Audio data, however, also contains rich information that can be effectively used video retrieval, MED could benefit from attention researchers audio analysis.We present several performing using only report results each system 2011 development...

10.21437/interspeech.2012-556 article EN Interspeech 2022 2012-09-09

Dermoscopy image is usually used in early diagnosis of malignant melanoma. The accuracy by visual inspection highly relied on the dermatologist's clinical experience. Due to inaccuracy, subjectivity, and poor reproducibility human judgement, an automatic recognition algorithm dermoscopy desired. In this work, we present a hybrid classification framework for assessment combining deep convolutional neural network (CNN), Fisher vector (FV) support machine (SVM). Specifically, representations...

10.1109/isbi.2017.7950524 article EN 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) 2017-04-01

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it prohibitively expensive collect large-scale paired image-caption dataset for every target language which critical training descent captioning models. Previous works tackle the unpaired cross-lingual problem through a pivot language, with help of data and pivot-to-target machine translation such language-pivoted approach suffers from inaccuracy brought by translation, including disfluency...

10.1145/3343031.3350996 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

Automatic video description generation (a.k.a captioning) is one of the ultimate goals for understanding. Despite wide range applications such as indexing and retrieval etc., captioning task remains quite challenging due to complexity diversity content. First, open-domain videos cover a broad topics, which results in highly variable vocabularies expression styles describe contents. Second, naturally contain multiple modalities including image, motion, acoustic media. The information provided...

10.1109/tmm.2019.2896515 article EN IEEE Transactions on Multimedia 2019-01-30

Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems.State-of-the-art mispronunciation models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness (GOP) based algorithm pronunciation scoring.However, GOP scoring have two major limitations: i.e., (i) They depend on forced alignment which splits speech into phonetic segments independently them scoring, neglects transitions between phonemes within segment; (ii) only...

10.21437/interspeech.2020-2953 article EN Interspeech 2022 2020-10-25

Bayesian uncertainty analysis is a highly effective tool for estimating model uncertainty, thereby improving the prediction ability with limited data. The data quality plays role in analysis. This paper presents novel approach to assess of experiment high explosives. By assigning varying weights based on their quality, we adopt statistical framework quantify uncertainties associated reactant equation state resulting quantification not only elucidates current physical knowledge but also paves...

10.1063/5.0244326 article EN cc-by AIP Advances 2025-02-01

This paper describes the issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations. Two systems were developed evaluated in NIST RT-04S Meeting Recognition Evaluation, Multiple Distant Microphone (MDM) system Individual Headset (IHM) system. The MDM achieved a diarization performance 28.17%. also aims to provide speech segments grouping information recognition, necessary prerequisite subsequent audio processing. A 44.5% word error rate was...

10.21437/interspeech.2004-249 article EN Interspeech 2022 2004-10-04

The recent advances in image captioning stimulate the research generating natural language description for visual content, which can be widely applied many applications such as assisting blind people. Video generation is a more complex task than caption. Most works of video focus on information video. However, audio provides rich describing contents well. In this paper, we propose to generate descriptions sentences using both and cues. We use unified deep neural networks with convolutional...

10.1145/2911996.2912043 article EN 2016-06-06
Coming Soon ...