NFDI4DS | UHH-SEMS - Publication Details

Qin Jin

ORCID: 0000-0001-6486-6020

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5009985839

Research Areas

Multimodal Machine Learning Applications
Video Analysis and Summarization
Human Pose and Action Recognition
Advanced Image and Video Retrieval Techniques
Music and Audio Processing
Speech and Audio Processing
Emotion and Mood Recognition
Topic Modeling
Speech Recognition and Synthesis
Natural Language Processing Techniques
Sentiment Analysis and Opinion Mining
Domain Adaptation and Few-Shot Learning
Image Retrieval and Classification Techniques
Subtitles and Audiovisual Media
Anomaly Detection Techniques and Applications
Face recognition and analysis
Biometric Identification and Security
Speech and dialogue systems
Text and Document Classification Technologies
Video Surveillance and Tracking Methods
Image and Object Detection Techniques
Generative Adversarial Networks and Image Synthesis
Risk and Safety Analysis
Remote-Sensing Image Classification
Advanced Image Processing Techniques

Renmin University of China
2016-2025

China Academy of Engineering Physics
2014-2025

Hong Kong Polytechnic University
2017-2024

Nanjing Normal University
2023

Alibaba Group (Cayman Islands)
2023

Northeastern University
2022

University of Chinese Academy of Sciences
2017-2022

Beijing Normal University
2018-2021

Guizhou University
2019

Huzhou Vocational and Technical College
2017

Pre-trained models: Past, present and future

OPENALEX - Publications

Xu Han Zhengyan Zhang Ning Ding Yuxian Gu Xiao Liu and 19 more

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled unlabeled data. By storing into parameters fine-tuning on specific tasks, rich implicitly encoded benefit variety downstream which has been extensively demonstrated via experimental...

10.1016/j.aiopen.2021.08.002 article EN cc-by-nc-nd AI Open 2021-01-01

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

OPENALEX - Publications

Shizhe Chen Yida Zhao Qin Jin Qi Wu

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of on web. The current dominant approach is learn a joint embedding space measure cross-modal similarities. However, simple embeddings are insufficient represent complicated visual textual details, such as scenes, objects, actions their compositions. To improve fine-grained video-text retrieval, we propose Hierarchical Graph Reasoning (HGR) model, which decomposes matching into...

10.1109/cvpr42600.2020.01065 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs

OPENALEX - Publications

Shizhe Chen Qin Jin Peng Wang Qi Wu

Humans are able to describe image contents with coarse fine details as they wish. However, most captioning models intention-agnostic which cannot generate diverse descriptions according different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure represent intention in fine-grained level and control what how detailed generated description should be. The ASG is a directed graph consisting of three types abstract nodes (object, attribute,...

10.1109/cvpr42600.2020.00998 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

OPENALEX - Publications

Jingwen Hu Yuchen Liu Jinming Zhao Qin Jin

Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.440 article EN cc-by 2021-01-01

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

OPENALEX - Publications

Ludan Ruan Yiyang Ma Huan Yang Huiguo He Bei Liu and 4 more

We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate pairs, we a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of sequential multi-modal U-Net for process by design. Two subnets audio video learn gradually aligned pairs from Gaussian...

10.1109/cvpr52729.2023.00985 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Speech emotion recognition with acoustic and lexical features

OPENALEX - Publications

Qin Jin Chengxin Li Shizhe Chen Huimin Wu

In this paper we explore one of the key aspects in building an emotion recognition system: generating suitable feature representations. We generate representations from both acoustic and lexical levels. At level, first extract low-level features such as intensity, F0, jitter, shimmer spectral contours etc. then different based on these features, including statistics over a new representation derived set codewords, Gaussian Supervectors. propose named vector (eVector). also use traditional...

10.1109/icassp.2015.7178872 article EN 2015-04-01

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition

OPENALEX - Publications

Shizhe Chen Qin Jin Jinming Zhao Shuai Wang

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our effort for the Affect Subtask in Audio/Visual Emotion Challenge (AVEC) 2017, requires participants to perform continuous prediction three affective dimensions: Arousal, Valence and Likability based audiovisual signals. We highlight aspects of solutions: 1) explore fuse different hand-crafted deep learned features from all available...

10.1145/3133944.3133949 article EN 2017-10-20

Describing Videos using Multi-modal Fusion

OPENALEX - Publications

Qin Jin Jia Chen Shizhe Chen Yifan Xiong Alexander G. Hauptmann

Describing videos with natural language is one of the ultimate goals video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR to Language Challenge provides a good chance study multi-modality fusion in caption task. In this paper, we propose encoder integrate it text sequence decoder into an end-to-end framework. Features from visual, meta modalities are fused together represent contents. Long Short-Term Memory Recurrent Neural Networks...

10.1145/2964284.2984065 article EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks

OPENALEX - Publications

Shizhe Chen Qin Jin

Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Challenge (AVEC2015), whose goal is to explore utilizing audio, visual physiological signals continuously predict value of emotion dimensions (arousal valence). Our system applies Recurrent Neural Networks (RNN) model temporal information. We various aspects improve prediction performance including: dominant modalities arousal valence...

10.1145/2808196.2811638 article EN 2015-10-13

Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

OPENALEX - Publications

Jinming Zhao Ruichen Li Qin Jin

Jinming Zhao, Ruichen Li, Qin Jin. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.203 article EN cc-by 2021-01-01

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

OPENALEX - Publications

Shizhe Chen Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early or late fusion. In this paper, we propose novel multi-modal strategy named conditional attention fusion, which can dynamically pay to different at each time step. Long-short term memory recurrent neural networks (LSTM-RNN) applied basic uni-modality model capture long dependencies. The weights assigned are automatically decided by...

10.1145/2964284.2967286 article EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

Video Captioning with Guidance of Multimodal Latent Topics

OPENALEX - Publications

Shizhe Chen Jia Chen Qin Jin Alexander G. Hauptmann

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, therefore, makes the captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics unsupervised fashion from data guides decoder with these topics. Compared pre-defined topics, mined are semantically visually coherent can reflect distribution better. We formulate topic-aware generation as a...

10.1145/3123266.3123420 article EN Proceedings of the 30th ACM International Conference on Multimedia 2017-10-20

Survey: Transformer based video-language pre-training

OPENALEX - Publications

Ludan Ruan Qin Jin

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer video processing. This survey aims provide a comprehensive overview for Video-Language learning. We first briefly introduce structure as background knowledge, including attention mechanism, position encoding etc. then describe typical paradigm & fine-tuning processing in terms proxy downstream commonly used datasets....

10.1016/j.aiopen.2022.01.001 article EN cc-by-nc-nd AI Open 2022-01-01

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

OPENALEX - Publications

Jiabo Ye Anwen Hu Haiyang Xu Qinghao Ye Ming Yan and 9 more

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ming Yan, Guohai Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, Fei Huang. Findings of the Association for Computational Linguistics: EMNLP 2023.

10.18653/v1/2023.findings-emnlp.187 article EN cc-by 2023-01-01

Event-based video retrieval using audio

OPENALEX - Publications

Qin Jin Peter Schulam Shourabh Rawat Susanne Burger Duo Ding and 1 more

Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing retrieval systems for locating videos which certain predefined events are shown.Typical focus heavily on use of visual data.Audio data, however, also contains rich information that can be effectively used video retrieval, MED could benefit from attention researchers audio analysis.We present several performing using only report results each system 2011 development...

10.21437/interspeech.2012-556 article EN Interspeech 2022 2012-09-09

Hybrid dermoscopy image classification framework based on deep convolutional neural network and Fisher vector

OPENALEX - Publications

Zhen Yu Dong Ni Siping Chen Qin Jin Shengli Li and 2 more

Dermoscopy image is usually used in early diagnosis of malignant melanoma. The accuracy by visual inspection highly relied on the dermatologist's clinical experience. Due to inaccuracy, subjectivity, and poor reproducibility human judgement, an automatic recognition algorithm dermoscopy desired. In this work, we present a hybrid classification framework for assessment combining deep convolutional neural network (CNN), Fisher vector (FV) support machine (SVM). Specifically, representations...

10.1109/isbi.2017.7950524 article EN 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) 2017-04-01

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

OPENALEX - Publications

Yuqing Song Shizhe Chen Yida Zhao Qin Jin

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it prohibitively expensive collect large-scale paired image-caption dataset for every target language which critical training descent captioning models. Previous works tackle the unpaired cross-lingual problem through a pivot language, with help of data and pivot-to-target machine translation such language-pivoted approach suffers from inaccuracy brought by translation, including disfluency...

10.1145/3343031.3350996 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

Generating Video Descriptions With Latent Topic Guidance

OPENALEX - Publications

Shizhe Chen Qin Jin Jia Chen Alexander G. Hauptmann

Automatic video description generation (a.k.a captioning) is one of the ultimate goals for understanding. Despite wide range applications such as indexing and retrieval etc., captioning task remains quite challenging due to complexity diversity content. First, open-domain videos cover a broad topics, which results in highly variable vocabularies expression styles describe contents. Second, naturally contain multiple modalities including image, motion, acoustic media. The information provided...

10.1109/tmm.2019.2896515 article EN IEEE Transactions on Multimedia 2019-01-30

Context-Aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

OPENALEX - Publications

Jiatong Shi Nan Huo Qin Jin

Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems.State-of-the-art mispronunciation models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness (GOP) based algorithm pronunciation scoring.However, GOP scoring have two major limitations: i.e., (i) They depend on forced alignment which splits speech into phonetic segments independently them scoring, neglects transitions between phonemes within segment; (ii) only...

10.21437/interspeech.2020-2953 article EN Interspeech 2022 2020-10-25

Weighted Bayesian uncertainty quantification for the high explosive reactants using limited data

OPENALEX - Publications

Y. Wang Hao Pan Jian-Wei Yin Pei Wang Qin Jin and 1 more

Bayesian uncertainty analysis is a highly effective tool for estimating model uncertainty, thereby improving the prediction ability with limited data. The data quality plays role in analysis. This paper presents novel approach to assess of experiment high explosives. By assigning varying weights based on their quality, we adopt statistical framework quantify uncertainties associated reactant equation state resulting quantification not only elucidates current physical knowledge but also paves...

10.1063/5.0244326 article EN cc-by AIP Advances 2025-02-01

Speaker segmentation and clustering in meetings

OPENALEX - Publications

Qin Jin Tanja Schultz

This paper describes the issue of automatic speaker segmentation and clustering for natural, multi-speaker meeting conversations. Two systems were developed evaluated in NIST RT-04S Meeting Recognition Evaluation, Multiple Distant Microphone (MDM) system Individual Headset (IHM) system. The MDM achieved a diarization performance 28.17%. also aims to provide speech segments grouping information recognition, necessary prerequisite subsequent audio processing. A 44.5% word error rate was...

10.21437/interspeech.2004-249 article EN Interspeech 2022 2004-10-04

Video Description Generation using Audio and Visual Cues

OPENALEX - Publications

Qin Jin Junwei Liang

The recent advances in image captioning stimulate the research generating natural language description for visual content, which can be widely applied many applications such as assisting blind people. Video generation is a more complex task than caption. Most works of video focus on information video. However, audio provides rich describing contents well. In this paper, we propose to generate descriptions sentences using both and cues. We use unified deep neural networks with convolutional...

10.1145/2911996.2912043 article EN 2016-06-06

Coming Soon ...