NFDI4DS | UHH-SEMS - Publication Details

Leveraging Self-Supervised Learning for Speaker Diarization

OPENALEX - Publications

Jiangyu Han Federico Landini Johan Rohdin Anna Silnova Mireia Díez and 1 more

10.1109/icassp49660.2025.10889475 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing

OPENALEX - Publications

Wei Rao Yihui Fu Yanxin Hu Xin Xu Yvkai Jv and 9 more

The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. consists of two separate tasks: 1) Task 1 with single microphone array and focusing practical application real-time requirement 2) 2 multiple distributed micro-phone arrays, which a non-real-time track does not have any constraints so that participants could explore algorithms obtain high quality. Targeting the real conferencing room application,...

10.1109/asru51503.2021.9688126 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

Dicow: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

OPENALEX - Publications

Alexander Polok Dominik Klement Martin Kocour Jiangyu Han Federico Landini and 5 more

10.2139/ssrn.5123534 preprint EN 2025-01-01

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement

OPENALEX - Publications

Xiaofeng Ge Jiangyu Han Yanhua Long Haixin Guan

10.21437/interspeech.2022-43 article EN Interspeech 2022 2022-09-16

Diacorrect: Error Correction Back-End for Speaker Diarization

OPENALEX - Publications

Jiangyu Han Federico Landini Johan Rohdin Mireia Díez Lukáš Burget and 3 more

In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in simple yet effective way. This method is inspired by techniques automatic speech recognition. Our model consists two parallel convolutional encoders and transformer-based decoder. By exploiting interactions between input recording initial system's outputs, DiaCorrect can automatically correct speaker activities minimize errors. Experiments on 2-speaker telephony data show...

10.1109/icassp48485.2024.10446968 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Multi-Channel Target Speech Extraction with Channel Decorrelation and Target Speaker Adaptation

OPENALEX - Publications

Jiangyu Han Xinyuan Zhou Yanhua Long Yijie Li

The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies multi-channel are still relatively limited. In this work, we propose two methods exploiting spatial information to extract speech. first one is using a adaptation layer in parallel encoder architecture. second designing channel decorrelation mechanism inter-channel differential enhance representation. We compare proposed with strong state-of-the-art baselines....

10.1109/icassp39728.2021.9414244 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction

OPENALEX - Publications

Jiangyu Han Yanhua Long Lukáš Burget Jaň Černocký

In recent years, a number of time-domain speech separation methods have been proposed. However, most them are very sensitive to the environments and wide domain coverage tasks. this paper, from time-frequency perspective, we propose densely-connected pyramid complex convolutional network, termed DPCCN, improve robustness under complicated conditions. Furthermore, generalize DPCCN target extraction (TSE) by integrating new specially designed speaker encoder. Moreover, also investigate...

10.1109/icassp43922.2022.9747340 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Attention-Based Scaling Adaptation for Target Speech Extraction

OPENALEX - Publications

Jiangyu Han Wei Rao Yanhua Long Jiaen Liang

The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and speaker to exploit discriminative clues. We propose a special mechanism without introducing any additional parameters scaling adaptation layer better adapt network towards extracting speech. Furthermore, by mixture embedding matrix pooling method, our proposed attention-based (ASA) can clues more efficient way....

10.1109/asru51503.2021.9687903 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

Leveraging Self-Supervised Learning for Speaker Diarization

OPENALEX - Publications

Jiangyu Han Federico Landini Johan Rohdin Anna Silnova Mireia Díez and 1 more

End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, their application speaker somehow limited. In this work, we explore using to alleviate problem of training. We use same pipeline Pyannote and improve local end-to-end with Conformer. Experiments far-field AMI, AISHELL-4, AliMeeting...

10.48550/arxiv.2409.09408 preprint EN arXiv (Cornell University) 2024-09-14

INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing

OPENALEX - Publications

Wei Rao Yihui Fu Yanxin Hu Xin Xu Yvkai Jv and 9 more

The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. consists of two separate tasks: 1) Task 1 with single microphone array and focusing practical application real-time requirement 2) 2 multiple distributed arrays, which a non-real-time track does not have any constraints so that participants could explore algorithms obtain high quality. Targeting the real conferencing room application, database was...

10.48550/arxiv.2104.00960 preprint EN other-oa arXiv (Cornell University) 2021-01-01

A novel brain inception neural network model using EEG graphic structure for emotion recognition

OPENALEX - Publications

Weijie Huang Xiaohui Gao Guanyi Zhao Yumeng Han Jiangyu Han and 5 more

Purpose EEG analysis of emotions is greatly significant for the diagnosis psychological diseases and brain-computer interface (BCI) applications. However, applications brain neural network emotion classification are rarely reported accuracy recognition cross-subject tasks remains a challenge. Thus, this paper proposes to design domain invariant model EEG-network based identification.Methods A novel brain-inception-network deep learning proposed extract discriminative graph features from...

10.1080/27706710.2023.2222159 article EN cc-by-nc Brain-Apparatus Communication A Journal of Bacomics 2023-06-06

Improving Channel Decorrelation for Multi-Channel Target Speech Extraction

OPENALEX - Publications

Jiangyu Han Wei Rao Yannan Wang Yanhua Long

Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract inter-channel differential enhance reference encoder representation. Although shown promising results for from mixtures, performance is still limited by nature of original theory. In this paper, we propose two methods broaden horizon...

10.21437/interspeech.2021-298 article EN Interspeech 2022 2021-08-27

Heterogeneous separation consistency training for adaptation of unsupervised speech separation

OPENALEX - Publications

Jiangyu Han Yanhua Long

Abstract Recently, supervised speech separation has made great progress. However, limited by the nature of training, most existing methods require ground-truth sources and are trained on synthetic datasets. This reliance is problematic, because signals usually unavailable in real conditions. Moreover, many industry scenarios, acoustic characteristics deviate far from ones simulated Therefore, performance degrades significantly when applying models to applications. To address these problems,...

10.1186/s13636-023-00273-y article EN cc-by EURASIP Journal on Audio Speech and Music Processing 2023-01-20

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

OPENALEX - Publications

Jiangyu Han Yanhua Long Lukáš Burget Jaň Černocký

In recent years, a number of time-domain speech separation methods have been proposed. However, most them are very sensitive to the environments and wide domain coverage tasks. this paper, from time-frequency perspective, we propose densely-connected pyramid complex convolutional network, termed DPCCN, improve robustness under complicated conditions. Furthermore, generalize DPCCN target extraction (TSE) by integrating new specially designed speaker encoder. Moreover, also investigate...

10.48550/arxiv.2112.13520 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Dynamic acoustic compensation and adaptive focal training for personalized speech enhancement

OPENALEX - Publications

Xiaofeng Ge Jiangyu Han Haixin Guan Yanhua Long

10.1016/j.apacoust.2023.109803 article EN Applied Acoustics 2023-12-12

BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge

OPENALEX - Publications

Alexander Polok Dominik Klement Jiangyu Han Šimon Sedláček Bolaji Yusuf and 3 more

10.21437/chime.2024-4 article EN 2024-09-06

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

OPENALEX - Publications

Alexander Polok Dominik Klement Martin Kocour Jiangyu Han Federico Landini and 5 more

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), novel approach target-speaker ASR that leverages diarization outputs as conditioning information. DiCoW extends the pre-trained model by integrating labels directly, eliminating reliance and reducing need for...

10.48550/arxiv.2501.00114 preprint EN arXiv (Cornell University) 2024-12-30

Hybridformer: Improving Squeezeformer with Hybrid Attention and NSR Mechanism

OPENALEX - Publications

Yuguang Yang Pan Yu Jingjing Yin Jiangyu Han Lei Ma and 1 more

SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by large convolution kernel size, local modeling ability is insufficient. this paper, we propose a novel method HybridFormer to improve fast and efficient way. Specifically, first incorporate linear attention (LA) hybrid LASA paradigm increase model's speed. Second, neural architecture...

10.1109/icassp49357.2023.10096467 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Spatiotemporal Feature Extraction of Dynamic Brain Networks and its Application in EEG-Based Emotion Recognition

OPENALEX - Publications

Jiangyu Han Qian Zhong Ruiting Lin Pengcheng Zhu Nan Qiu and 1 more

10.1109/wi-iat62293.2024.00092 article EN 2024-12-09

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

OPENALEX - Publications

Jiangyu Han Yanhua Long

Recently, supervised speech separation has made great progress. However, limited by the nature of training, most existing methods require ground-truth sources and are trained on synthetic datasets. This reliance is problematic, because signals usually unavailable in real conditions. Moreover, many industry scenarios, acoustic characteristics deviate far from ones simulated Therefore, performance degrades significantly when applying models to applications. To address these problems, this...

10.2139/ssrn.4121081 article EN SSRN Electronic Journal 2022-01-01

HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism

OPENALEX - Publications

Yuguang Yang Pan Yu Jingjing Yin Jiangyu Han Лей Ма and 1 more

SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by large convolution kernel size, local modeling ability is insufficient. this paper, we propose a novel method HybridFormer to improve fast and efficient way. Specifically, first incorporate linear attention (LA) hybrid LASA paradigm increase model's speed. Second, neural architecture...

10.48550/arxiv.2303.08636 preprint EN other-oa arXiv (Cornell University) 2023-01-01

DiaCorrect: Error Correction Back-end For Speaker Diarization

OPENALEX - Publications

Jiangyu Han Federico Landini Johan Rohdin Mireia Díez Lukáš Burget and 3 more

In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in simple yet effective way. This method is inspired by techniques automatic speech recognition. Our model consists two parallel convolutional encoders and transform-based decoder. By exploiting interactions between input recording initial system's outputs, DiaCorrect can automatically correct speaker activities minimize errors. Experiments on 2-speaker telephony data show...

10.48550/arxiv.2309.08377 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Compartmental Gaussian Mixture Model used for Layer Detection in Radio-Echo Sounding Data

OPENALEX - Publications

Yan Liu Jiangyu Han Weiwei Qu Bo Zhao Qiang Wang

This paper presents a compartmental Gaussian mixture model (GMM) method to trace layers of the ice sheet with radio-echo sounding (RES) data. Based on compartmentalization RES data, proposed build model, which is solved using Fuzzy C-means (FCM) and expectation maximization (EM) obtain preliminary layer detection results. And boundaries are detected according analyzing classification results GMM. Experimental show that can effectively.

10.1109/icsp.2018.8652476 article EN 2022 16th IEEE International Conference on Signal Processing (ICSP) 2018-08-01