Xiaofei Li

ORCID: 0000-0003-0393-9905
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Advanced Adaptive Filtering Techniques
  • Speech Recognition and Synthesis
  • Blind Source Separation Techniques
  • Indoor and Outdoor Localization Technologies
  • Hearing Loss and Rehabilitation
  • Acoustic Wave Phenomena Research
  • Underwater Acoustics Research
  • Robotics and Sensor-Based Localization
  • Advanced Algorithms and Applications
  • Infant Health and Development
  • Ultrasonics and Acoustic Wave Propagation
  • Advanced Image and Video Retrieval Techniques
  • Radio Wave Propagation Studies
  • Face recognition and analysis
  • Industrial Vision Systems and Defect Detection
  • Social Robot Interaction and HRI
  • QR Code Applications and Technologies
  • Advanced Data Compression Techniques
  • Speech and dialogue systems
  • Direction-of-Arrival Estimation Techniques
  • Advanced Neural Network Applications
  • AI and Big Data Applications
  • Data Management and Algorithms

Westlake University
2019-2025

Institute for Advanced Study
2022

Institut national de recherche en informatique et en automatique
2015-2020

Centre Inria de l'Université Grenoble Alpes
2015-2020

Zhejiang Water Conservancy and Hydropower Survey and Design Institute
2020

Université Grenoble Alpes
2019

Kingston University
2018

Directorate-General for Interpretation
2017

Peking University
2010-2013

This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band refer to the models that input noisy spectral feature, output target, respectively. The model processes each frequency independently. Its consists of one several context frequencies. is prediction clean target corresponding frequency. These two types have distinct characteristics. can capture global long-distance cross-band dependencies. However, it lacks...

10.1109/icassp39728.2021.9414177 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal model is proposed. The well suited for challenging scenarios that consist several participants multi-party interaction while they move around and turn their heads towards the other rather than facing cameras microphones. Multiple-person visual tracking combined with multiple speech-source localization order tackle speech-to-person association problem. latter solved within...

10.1109/tpami.2017.2648793 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2017-01-05

This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, proposed performs end-to-end enhancement. It is mainly composed of interleaved narrow-band cross-band blocks respectively information. The process frequencies independently, use self-attention mechanism temporal convolutional layers perform spatial-feature-based speaker...

10.1109/taslp.2024.3357036 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Robust and precise defect detection is of great significance in the production high-quality printed circuit board (PCB). However, due to complexity PCB environments, most previous works still utilise traditional image processing matching algorithms detect defects. In this work, an improved bare approach proposed by learning deep discriminative features, which also greatly reduced high requirement a large dataset for method. First, authors extend existing with some artificial data affine...

10.1049/joe.2018.8275 article EN cc-by-nc-nd The Journal of Engineering 2018-08-18

Self-supervised learning (SSL) has emerged as a popular approach for audio representations. One goal of self-supervised pre-training is to transfer knowledge downstream tasks, generally including clip-level and frame-level tasks. While tasks are important fine-grained acoustic scene/event understanding, prior studies primarily evaluate on In order tackle both this paper proposes Audio Teacher-Student Transformer (ATST), with version (named ATST-Clip) ATST-Frame), responsible representations,...

10.1109/taslp.2024.3352248 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Sound event detection (SED) often suffers from the data deficiency problem. Recent SED systems leverage large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where help produce more discriminative features for SED. However, are regarded as a frozen feature extractor in most systems, and fine-tuning of has been rarely studied. In this work, we study method We introduce frame-level audio teacher-student transformer model (ATST-Frame), our newly proposed SelfSL...

10.1109/icassp48485.2024.10446159 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic However, for recognition, individual utterance mannerisms can lead confusion and false recognition. To solve this problem, novel lip descriptor presented involving both geometry-based appearance-based features in paper. Specifically, set of proposed...

10.1109/tmm.2016.2520091 article EN IEEE Transactions on Multimedia 2016-01-21

In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between noise. How to fully exploit these two types of their temporal dynamics remains an interesting research problem. As a solution this problem, paper proposes multi-cue fusion network named McNet, which cascades four modules respectively the full-band spatial, narrowband sub-band spectral, information. Experiments show that each module in proposed has its unique contribution and, as...

10.1109/icassp49357.2023.10095509 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all possible candidate source locations defined on a grid. After optimizing GMM-based objective function, given observed set features, both number sources their are estimated by selecting GMM with largest priors. achieved enforcing sparse solution, thus favoring small...

10.1109/taslp.2017.2740001 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-08-14

We address the problem of online localization and tracking multiple moving speakers in reverberant environments. This paper has following contributions. use direct-path relative transfer function (DP-RTF), an interchannel feature that encodes acoustic information robust against reverberation, we propose algorithm well suited for estimating DP-RTFs associated with audio sources. Another crucial ingredient proposed method is its ability to properly assign audio-source directions. Toward this...

10.1109/jstsp.2019.2903472 article EN IEEE Journal of Selected Topics in Signal Processing 2019-03-01

Sound source localization (SSL) is an important technique for many audio processing systems, such as speech enhancement/recognition and human-robot interaction. Although methods have been proposed SSL, it still remains a challenging task to achieve accurate under adverse acoustic scenarios. In this paper, novel binaural SSL method based on time-frequency convolutional neural network (TF-CNN) with multitask learning simultaneously localize azimuth elevation unknown conditions. First, the...

10.1109/access.2019.2905617 article EN cc-by-nc-nd IEEE Access 2019-01-01

Personal-sound-zones (PSZ) techniques deliver independent sounds to multiple zones within a room using loudspeaker array. The target signal for each zone is clearly audible that while inaudible or non-distracting in others, assured by applying pre-filters the are traditionally designed with time-domain frequency-domain methods, which suffer from high computational complexity and large system latency, respectively. This work proposes subband pressure-matching method short-time Fourier...

10.1121/10.0035578 article EN The Journal of the Acoustical Society of America 2025-02-01

Binaural sound source localization is an important technique for speech enhancement, video conferencing, and human-robot interaction, etc. However, in realistic scenarios, the reverberation environmental noise would degrade precision of direction estimation. Therefore, reliable essential to practical applications. To deal with these disturbances, this paper presents a novel binaural approach based on weighting generalized parametric mapping. First, as preprocessing stage, used separately...

10.1109/taslp.2017.2703650 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-05-15

This paper addresses the problem of relative transfer function (RTF) estimation in presence stationary noise. We propose an RTF identification method based on segmental power spectral density (PSD) matrix subtraction. First multiple channel microphone signals are divided into segments corresponding to speech-plus-noise activity and noise-only. Then, subtraction two PSD matrices leads almost noise-free by reducing noise component preserving non-stationary speech component. is used for single...

10.1109/icassp.2015.7177983 preprint EN 2015-04-01

This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between acoustic functions two microphone channels. First, complex-value is decomposed into inter-channel intensity difference, sinusoidal phase difference time-frequency domain. Then, features from series temporal context frames are utilized train DNN...

10.1049/cit2.12024 article EN cc-by-nc-nd CAAI Transactions on Intelligence Technology 2021-04-14

In human-robot interaction (HRI), speech sound source localization (SSL) is a convenient and efficient way to obtain the relative position between speaker robot. However, implementing SSL system based on TDOA method encounters many problems, such as noise of real environments, solution nonlinear equations, switch far field near field. this paper, fourth-order cumulant spectrum derived, which time delay estimation (TDE) algorithm that available for signal immune spatially correlated Gaussian...

10.1109/tsmcb.2012.2226443 article EN IEEE Transactions on Cybernetics 2012-11-16

This paper addresses the problem of multiple sound source counting and localization in adverse acoustic environments, using microphone array recordings. The proposed time-frequency (TF) wise spatial spectrum clustering based method contains two stages. First, given received sensor signals, correlation matrix is computed denoised TF domain. TF-wise estimated on signal subspace information, further enhanced by an exponential transform, which can increase reliability presence possibility...

10.1109/taslp.2019.2915785 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-05-10

Direct-path relative transfer function (DP-RTF) refers to the ratio between direct-path acoustic functions of two microphone channels. Though DP-RTF fully encodes sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in presence noise reverberation. This paper proposes learn with deep neural networks for robust binaural source localization. A learning network designed regress sensor signals real-valued representation DP-RTF. It consists branched...

10.1109/taslp.2021.3120641 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the to specific problem with limited number labeled data. SSL has achieved promising results in various domains. This work addresses segment-level general audio SSL, proposes new transformer-based teacher-student model, named ATST. A transformer encoder is developed on recently emerged baseline scheme, which largely improves modeling capability pre-training. In addition, strategy for...

10.21437/interspeech.2022-10126 article EN Interspeech 2022 2022-09-16

Estimating the noise power spectral density (PSD) is essential for single channel speech enhancement algorithms. In this paper, we propose a PSD estimation approach based on regional statistics. The proposed statistics consist of four features representing past and present periodograms in short-time period. We show that these are efficient characterizing statistical difference between noisy PSD. therefore to use estimating presence probability (SPP). recursively estimated by averaging values...

10.1109/icassp.2016.7471661 preprint EN 2016-03-01

This paper addresses the problem of binaural localization a single speech source in noisy and reverberant environments. For given microphone setup, response corresponding to direct-path propagation is function direction. In practice, this contaminated by noise reverberations. The relative transfer (DP-RTF) defined as ratio between acoustic two channels. We propose method estimate DP-RTF from signals short-time Fourier transform domain. First, convolutive approximation adopted accurately...

10.1109/taslp.2016.2598319 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-08-04

This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement.The proposed method is developed in the short time Fourier transform (STFT) domain.Online processing requires frame-byframe signal reception and processing.A paramount feature of that same used across frequencies, which drastically reduces number parameters, amount training data computational burden.Training performed manner: input consists one frequency, together with some context...

10.21437/interspeech.2020-2091 article EN Interspeech 2022 2020-10-25

Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors remove these objects. However, are computationally too expensive mobile robot on-board processing. In practical applications, output noisy sounds that can be effectively detected by sound source localization. The directional...

10.1109/iros51168.2021.9636585 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021-09-27

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem data deficiency.The integration semi-supervised learning (SSL) largely mitigates such problem.This paper researches on several modules SSL, and introduces random consistency training (RCT) strategy.First, hard mixup augmentation is proposed to account for additive property sounds.Second, scheme applied stochastically combine different types methods with high flexibility.Third,...

10.21437/interspeech.2022-10037 article EN Interspeech 2022 2022-09-16
Coming Soon ...