- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- EEG and Brain-Computer Interfaces
- Advanced Adaptive Filtering Techniques
- Infant Health and Development
- Landslides and related hazards
- Emotion and Mood Recognition
- Distributed and Parallel Computing Systems
- Hearing Loss and Rehabilitation
- Indoor and Outdoor Localization Technologies
- Phonetics and Phonology Research
- Cryospheric studies and observations
- Neural Networks and Applications
- Functional Brain Connectivity Studies
- Winter Sports Injuries and Performance
- Telecommunications and Broadcasting Technologies
Brno University of Technology
2022-2025
Chongqing University of Posts and Telecommunications
2023-2024
Shanghai Normal University
2020-2023
Tencent (China)
2021
Shandong University of Science and Technology
2018
The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. consists of two separate tasks: 1) Task 1 with single microphone array and focusing practical application real-time requirement 2) 2 multiple distributed micro-phone arrays, which a non-real-time track does not have any constraints so that participants could explore algorithms obtain high quality. Targeting the real conferencing room application,...
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in simple yet effective way. This method is inspired by techniques automatic speech recognition. Our model consists two parallel convolutional encoders and transformer-based decoder. By exploiting interactions between input recording initial system's outputs, DiaCorrect can automatically correct speaker activities minimize errors. Experiments on 2-speaker telephony data show...
The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies multi-channel are still relatively limited. In this work, we propose two methods exploiting spatial information to extract speech. first one is using a adaptation layer in parallel encoder architecture. second designing channel decorrelation mechanism inter-channel differential enhance representation. We compare proposed with strong state-of-the-art baselines....
In recent years, a number of time-domain speech separation methods have been proposed. However, most them are very sensitive to the environments and wide domain coverage tasks. this paper, from time-frequency perspective, we propose densely-connected pyramid complex convolutional network, termed DPCCN, improve robustness under complicated conditions. Furthermore, generalize DPCCN target extraction (TSE) by integrating new specially designed speaker encoder. Moreover, also investigate...
The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and speaker to exploit discriminative clues. We propose a special mechanism without introducing any additional parameters scaling adaptation layer better adapt network towards extracting speech. Furthermore, by mixture embedding matrix pooling method, our proposed attention-based (ASA) can clues more efficient way....
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, their application speaker somehow limited. In this work, we explore using to alleviate problem of training. We use same pipeline Pyannote and improve local end-to-end with Conformer. Experiments far-field AMI, AISHELL-4, AliMeeting...
The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. consists of two separate tasks: 1) Task 1 with single microphone array and focusing practical application real-time requirement 2) 2 multiple distributed arrays, which a non-real-time track does not have any constraints so that participants could explore algorithms obtain high quality. Targeting the real conferencing room application, database was...
Purpose EEG analysis of emotions is greatly significant for the diagnosis psychological diseases and brain-computer interface (BCI) applications. However, applications brain neural network emotion classification are rarely reported accuracy recognition cross-subject tasks remains a challenge. Thus, this paper proposes to design domain invariant model EEG-network based identification.Methods A novel brain-inception-network deep learning proposed extract discriminative graph features from...
Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract inter-channel differential enhance reference encoder representation. Although shown promising results for from mixtures, performance is still limited by nature of original theory. In this paper, we propose two methods broaden horizon...
Abstract Recently, supervised speech separation has made great progress. However, limited by the nature of training, most existing methods require ground-truth sources and are trained on synthetic datasets. This reliance is problematic, because signals usually unavailable in real conditions. Moreover, many industry scenarios, acoustic characteristics deviate far from ones simulated Therefore, performance degrades significantly when applying models to applications. To address these problems,...
In recent years, a number of time-domain speech separation methods have been proposed. However, most them are very sensitive to the environments and wide domain coverage tasks. this paper, from time-frequency perspective, we propose densely-connected pyramid complex convolutional network, termed DPCCN, improve robustness under complicated conditions. Furthermore, generalize DPCCN target extraction (TSE) by integrating new specially designed speaker encoder. Moreover, also investigate...
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), novel approach target-speaker ASR that leverages diarization outputs as conditioning information. DiCoW extends the pre-trained model by integrating labels directly, eliminating reliance and reducing need for...
SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by large convolution kernel size, local modeling ability is insufficient. this paper, we propose a novel method HybridFormer to improve fast and efficient way. Specifically, first incorporate linear attention (LA) hybrid LASA paradigm increase model's speed. Second, neural architecture...
Recently, supervised speech separation has made great progress. However, limited by the nature of training, most existing methods require ground-truth sources and are trained on synthetic datasets. This reliance is problematic, because signals usually unavailable in real conditions. Moreover, many industry scenarios, acoustic characteristics deviate far from ones simulated Therefore, performance degrades significantly when applying models to applications. To address these problems, this...
SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by large convolution kernel size, local modeling ability is insufficient. this paper, we propose a novel method HybridFormer to improve fast and efficient way. Specifically, first incorporate linear attention (LA) hybrid LASA paradigm increase model's speed. Second, neural architecture...
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in simple yet effective way. This method is inspired by techniques automatic speech recognition. Our model consists two parallel convolutional encoders and transform-based decoder. By exploiting interactions between input recording initial system's outputs, DiaCorrect can automatically correct speaker activities minimize errors. Experiments on 2-speaker telephony data show...
This paper presents a compartmental Gaussian mixture model (GMM) method to trace layers of the ice sheet with radio-echo sounding (RES) data. Based on compartmentalization RES data, proposed build model, which is solved using Fuzzy C-means (FCM) and expectation maximization (EM) obtain preliminary layer detection results. And boundaries are detected according analyzing classification results GMM. Experimental show that can effectively.