Jianhua Tao

ORCID: 0000-0002-9344-6428
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Emotion and Mood Recognition
  • Natural Language Processing Techniques
  • Topic Modeling
  • Sentiment Analysis and Opinion Mining
  • Speech and dialogue systems
  • Digital Media Forensic Detection
  • Phonetics and Phonology Research
  • Face and Expression Recognition
  • Advanced Graph Neural Networks
  • Face recognition and analysis
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Advanced Text Analysis Techniques
  • Anomaly Detection Techniques and Applications
  • Blind Source Separation Techniques
  • Advanced Data Compression Techniques
  • Recommender Systems and Techniques
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • EEG and Brain-Computer Interfaces
  • Social Robot Interaction and HRI
  • Text and Document Classification Technologies

Tsinghua University
2003-2025

Chinese Academy of Sciences
2015-2024

University of Chinese Academy of Sciences
2016-2024

Beijing Academy of Artificial Intelligence
2018-2024

Institute of Automation
2015-2024

Center for Excellence in Brain Science and Intelligence Technology
2016-2023

Shandong Institute of Automation
2007-2023

University of Science and Technology of China
2023

University of Augsburg
2022

Harbin Engineering University
2022

Emotion recognition in conversation is a crucial topic for its widespread applications the field of human-computer interactions. Unlike vanilla emotion individual utterances, conversational requires modeling both context-sensitive and speaker-sensitive dependencies. Despite promising results recent works, they generally do not leverage advanced fusion techniques to generate multimodal representations an utterance. In this way, have limitations intra-modal cross-modal order address these...

10.1109/taslp.2021.3049898 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, recent shared tasks have not covered many real-life and challenging scenarios. The first Deep synthesis Detection challenge (ADD) motivated to fill gap. ADD 2022 includes three tracks: low-quality fake audio (LF), partially (PF) game (FG). LF track focuses on dealing with bona fide fully utterances various real-world noises etc. PF aims distinguish from real. FG a rivalry game, two tasks:...

10.1109/icassp43922.2022.9746939 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs realistic settings. In this paper, we propose a generic unified framework address them, named...

10.1109/taffc.2023.3274829 article EN IEEE Transactions on Affective Computing 2023-05-10

Conversations have become a critical data format on social media platforms. Understanding conversation from emotion, content and other aspects also attracts increasing attention researchers due to its widespread application in human-computer interaction. In real-world environments, we often encounter the problem of incomplete modalities, which has core issue understanding. To address this problem, propose various methods. However, existing approaches are mainly designed for individual...

10.1109/tpami.2023.3234553 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-01-01

Automatic music type classification is very helpful for the management of digital databases. In this paper, octave-based spectral contrast feature proposed to represent characteristics a clip. It represented relative distribution instead average envelope. Experiments show that performs well in classification. Another comparison experiment demonstrates has better discrimination among different types than mel-frequency cepstral coefficients (MFCC), which often used previous systems.

10.1109/icme.2002.1035731 article EN 2003-06-25

Emotion is an important element in expressive speech synthesis. Unlike traditional discrete emotion simulations, this paper attempts to synthesize emotional by using "strong", "medium", and "weak" classifications. This tests different models, a linear modification model (LMM), Gaussian mixture (GMM), classification regression tree (CART) model. The makes direct of sentence F0 contours syllabic durations from acoustic distributions speech, such as, topline, baseline, durations, intensities....

10.1109/tasl.2006.876113 article EN IEEE Transactions on Audio Speech and Language Processing 2006-06-21

Micro-expressions are brief involuntary facial expressions that reveal genuine emotions and, thus, help detect lies. Because of their many promising applications, they have attracted the attention researchers from various fields. Recent research reveals two perceptual color spaces (CIELab and CIELuv) provide useful information for expression recognition. This paper is an extended version our International Conference on Pattern Recognition paper, in which we propose a novel space model,...

10.1109/tip.2015.2496314 article EN IEEE Transactions on Image Processing 2015-10-30

This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is predict continuous values of emotion dimensions arousal and valence from audio, visual physiology modalities. The state art classifier for dimensional recognition, long short term memory recurrent neural network (LSTM-RNN) utilized. Except regular LSTM-RNN prediction architecture, two techniques are investigated recognition problem. first one ε -insensitive loss utilized as function optimize....

10.1145/2808196.2811634 article EN 2015-10-13

Multimodal fusion increases the performance of emotion recognition because complementarity different modalities. Compared with decision level and feature fusion, model makes better use advantages deep neural networks. In this work, we utilize Transformer to fuse audio-visual modalities on level. Specifically, multi-head attention produces multimodal emotional intermediate representations from common semantic space after encoding audio visual Meanwhile, it also can learn long-term temporal...

10.1109/icassp40776.2020.9053762 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

10.1007/s12652-016-0406-z article EN Journal of Ambient Intelligence and Humanized Computing 2016-09-10

Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) RNNs are replaced with blocks, which powerful to model long-term dependencies inside sequences and able be efficiently parallelized. Furthermore, path-aware regularization is proposed assist SA-T learn alignments improve performance. Additionally,...

10.21437/interspeech.2019-2203 preprint EN Interspeech 2022 2019-09-13

Early interventions in mental health conditions such as Major Depressive Disorder (MDD) are critical to improved outcomes, they can help reduce the burden of disease. As efficient diagnosis depression severity is therefore highly desirable, use behavioural cues speech characteristics attracting increasing interest field quantitative research. However, despite widespread machine learning methods analysis community, lack adequate labelled data has become a bottleneck preventing broader...

10.1109/jstsp.2019.2955012 article EN IEEE Journal of Selected Topics in Signal Processing 2019-11-22

Physiological studies have shown that there are some differences in speech and facial activities between depressive healthy individuals. Based on this fact, we propose a novel spatio-temporal attention (STA) network multimodal feature fusion (MAFF) strategy to obtain the representation of depression cues for predicting individual level. Specifically, first divide amplitude spectrum/video into fixed-length segments input these STA network, which not only integrates spatial temporal...

10.1109/taffc.2020.3031345 article EN IEEE Transactions on Affective Computing 2020-10-15

The joint training framework for speech enhancement and recognition methods have obtained quite good performances robust end-to-end automatic (ASR). However, these only utilize the enhanced feature as input of component, which are affected by distortion problem. In order to address this problem, paper proposes a gated recurrent fusion (GRF) method with ASR. GRF algorithm is used dynamically combine noisy features. Therefore, can not remove noise signals from features, but also learn raw fine...

10.1109/taslp.2020.3039600 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-11-20

Conversational emotion recognition is a crucial research topic in human-computer interactions. Due to the heavy annotation cost and inevitable label ambiguity, collecting large amounts of labeled data challenging expensive, which restricts performance current fully-supervised methods this domain. To address problem, researchers attempt distill knowledge from unlabeled via semi-supervised learning. However, most these ignore multimodal interactive information, although recent works have...

10.1109/taffc.2022.3141237 article EN IEEE Transactions on Affective Computing 2022-01-07

Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human–computer interaction research. In recent years, machines have managed to master art generating speech that understandable by humans. However, linguistic content an utterance encompasses only part meaning. Affect, or expressivity, capacity turn into medium capable conveying intimate thoughts, feelings, emotions—aspects are essential for engaging naturalistic interpersonal...

10.1109/jproc.2023.3250266 article EN Proceedings of the IEEE 2023-03-10

The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing models have shown remarkable success in discriminating known audio, but struggle when encountering new attack types. To address this challenge, one emergent approaches is continual learning. In paper, we propose learning approach called Radian Weight Modification (RWM)...

10.1609/aaai.v38i17.29929 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24
Coming Soon ...