- Speech and Audio Processing
- Music and Audio Processing
- Advanced Adaptive Filtering Techniques
- Speech Recognition and Synthesis
- Blind Source Separation Techniques
- Indoor and Outdoor Localization Technologies
- Hearing Loss and Rehabilitation
- Acoustic Wave Phenomena Research
- Underwater Acoustics Research
- Robotics and Sensor-Based Localization
- Advanced Algorithms and Applications
- Infant Health and Development
- Ultrasonics and Acoustic Wave Propagation
- Advanced Image and Video Retrieval Techniques
- Radio Wave Propagation Studies
- Face recognition and analysis
- Industrial Vision Systems and Defect Detection
- Social Robot Interaction and HRI
- QR Code Applications and Technologies
- Advanced Data Compression Techniques
- Speech and dialogue systems
- Direction-of-Arrival Estimation Techniques
- Advanced Neural Network Applications
- AI and Big Data Applications
- Data Management and Algorithms
Westlake University
2019-2025
Institute for Advanced Study
2022
Institut national de recherche en informatique et en automatique
2015-2020
Centre Inria de l'Université Grenoble Alpes
2015-2020
Zhejiang Water Conservancy and Hydropower Survey and Design Institute
2020
Université Grenoble Alpes
2019
Kingston University
2018
Directorate-General for Interpretation
2017
Peking University
2010-2013
This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band refer to the models that input noisy spectral feature, output target, respectively. The model processes each frequency independently. Its consists of one several context frequencies. is prediction clean target corresponding frequency. These two types have distinct characteristics. can capture global long-distance cross-band dependencies. However, it lacks...
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal model is proposed. The well suited for challenging scenarios that consist several participants multi-party interaction while they move around and turn their heads towards the other rather than facing cameras microphones. Multiple-person visual tracking combined with multiple speech-source localization order tackle speech-to-person association problem. latter solved within...
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, proposed performs end-to-end enhancement. It is mainly composed of interleaved narrow-band cross-band blocks respectively information. The process frequencies independently, use self-attention mechanism temporal convolutional layers perform spatial-feature-based speaker...
Robust and precise defect detection is of great significance in the production high-quality printed circuit board (PCB). However, due to complexity PCB environments, most previous works still utilise traditional image processing matching algorithms detect defects. In this work, an improved bare approach proposed by learning deep discriminative features, which also greatly reduced high requirement a large dataset for method. First, authors extend existing with some artificial data affine...
Self-supervised learning (SSL) has emerged as a popular approach for audio representations. One goal of self-supervised pre-training is to transfer knowledge downstream tasks, generally including clip-level and frame-level tasks. While tasks are important fine-grained acoustic scene/event understanding, prior studies primarily evaluate on In order tackle both this paper proposes Audio Teacher-Student Transformer (ATST), with version (named ATST-Clip) ATST-Frame), responsible representations,...
Sound event detection (SED) often suffers from the data deficiency problem. Recent SED systems leverage large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where help produce more discriminative features for SED. However, are regarded as a frozen feature extractor in most systems, and fine-tuning of has been rarely studied. In this work, we study method We introduce frame-level audio teacher-student transformer model (ATST-Frame), our newly proposed SelfSL...
Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic However, for recognition, individual utterance mannerisms can lead confusion and false recognition. To solve this problem, novel lip descriptor presented involving both geometry-based appearance-based features in paper. Specifically, set of proposed...
In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between noise. How to fully exploit these two types of their temporal dynamics remains an interesting research problem. As a solution this problem, paper proposes multi-cue fusion network named McNet, which cascades four modules respectively the full-band spatial, narrowband sub-band spectral, information. Experiments show that each module in proposed has its unique contribution and, as...
This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all possible candidate source locations defined on a grid. After optimizing GMM-based objective function, given observed set features, both number sources their are estimated by selecting GMM with largest priors. achieved enforcing sparse solution, thus favoring small...
We address the problem of online localization and tracking multiple moving speakers in reverberant environments. This paper has following contributions. use direct-path relative transfer function (DP-RTF), an interchannel feature that encodes acoustic information robust against reverberation, we propose algorithm well suited for estimating DP-RTFs associated with audio sources. Another crucial ingredient proposed method is its ability to properly assign audio-source directions. Toward this...
Sound source localization (SSL) is an important technique for many audio processing systems, such as speech enhancement/recognition and human-robot interaction. Although methods have been proposed SSL, it still remains a challenging task to achieve accurate under adverse acoustic scenarios. In this paper, novel binaural SSL method based on time-frequency convolutional neural network (TF-CNN) with multitask learning simultaneously localize azimuth elevation unknown conditions. First, the...
Personal-sound-zones (PSZ) techniques deliver independent sounds to multiple zones within a room using loudspeaker array. The target signal for each zone is clearly audible that while inaudible or non-distracting in others, assured by applying pre-filters the are traditionally designed with time-domain frequency-domain methods, which suffer from high computational complexity and large system latency, respectively. This work proposes subband pressure-matching method short-time Fourier...
Binaural sound source localization is an important technique for speech enhancement, video conferencing, and human-robot interaction, etc. However, in realistic scenarios, the reverberation environmental noise would degrade precision of direction estimation. Therefore, reliable essential to practical applications. To deal with these disturbances, this paper presents a novel binaural approach based on weighting generalized parametric mapping. First, as preprocessing stage, used separately...
This paper addresses the problem of relative transfer function (RTF) estimation in presence stationary noise. We propose an RTF identification method based on segmental power spectral density (PSD) matrix subtraction. First multiple channel microphone signals are divided into segments corresponding to speech-plus-noise activity and noise-only. Then, subtraction two PSD matrices leads almost noise-free by reducing noise component preserving non-stationary speech component. is used for single...
This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between acoustic functions two microphone channels. First, complex-value is decomposed into inter-channel intensity difference, sinusoidal phase difference time-frequency domain. Then, features from series temporal context frames are utilized train DNN...
In human-robot interaction (HRI), speech sound source localization (SSL) is a convenient and efficient way to obtain the relative position between speaker robot. However, implementing SSL system based on TDOA method encounters many problems, such as noise of real environments, solution nonlinear equations, switch far field near field. this paper, fourth-order cumulant spectrum derived, which time delay estimation (TDE) algorithm that available for signal immune spatially correlated Gaussian...
This paper addresses the problem of multiple sound source counting and localization in adverse acoustic environments, using microphone array recordings. The proposed time-frequency (TF) wise spatial spectrum clustering based method contains two stages. First, given received sensor signals, correlation matrix is computed denoised TF domain. TF-wise estimated on signal subspace information, further enhanced by an exponential transform, which can increase reliability presence possibility...
Direct-path relative transfer function (DP-RTF) refers to the ratio between direct-path acoustic functions of two microphone channels. Though DP-RTF fully encodes sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in presence noise reverberation. This paper proposes learn with deep neural networks for robust binaural source localization. A learning network designed regress sensor signals real-valued representation DP-RTF. It consists branched...
Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the to specific problem with limited number labeled data. SSL has achieved promising results in various domains. This work addresses segment-level general audio SSL, proposes new transformer-based teacher-student model, named ATST. A transformer encoder is developed on recently emerged baseline scheme, which largely improves modeling capability pre-training. In addition, strategy for...
Estimating the noise power spectral density (PSD) is essential for single channel speech enhancement algorithms. In this paper, we propose a PSD estimation approach based on regional statistics. The proposed statistics consist of four features representing past and present periodograms in short-time period. We show that these are efficient characterizing statistical difference between noisy PSD. therefore to use estimating presence probability (SPP). recursively estimated by averaging values...
This paper addresses the problem of binaural localization a single speech source in noisy and reverberant environments. For given microphone setup, response corresponding to direct-path propagation is function direction. In practice, this contaminated by noise reverberations. The relative transfer (DP-RTF) defined as ratio between acoustic two channels. We propose method estimate DP-RTF from signals short-time Fourier transform domain. First, convolutive approximation adopted accurately...
This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement.The proposed method is developed in the short time Fourier transform (STFT) domain.Online processing requires frame-byframe signal reception and processing.A paramount feature of that same used across frequencies, which drastically reduces number parameters, amount training data computational burden.Training performed manner: input consists one frequency, together with some context...
Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors remove these objects. However, are computationally too expensive mobile robot on-board processing. In practical applications, output noisy sounds that can be effectively detected by sound source localization. The directional...
Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem data deficiency.The integration semi-supervised learning (SSL) largely mitigates such problem.This paper researches on several modules SSL, and introduces random consistency training (RCT) strategy.First, hard mixup augmentation is proposed to account for additive property sounds.Second, scheme applied stochastically combine different types methods with high flexibility.Third,...