NFDI4DS | UHH-SEMS - Publication Details

Xiaofei Li

ORCID: 0000-0003-0393-9905

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100358313

Research Areas

Speech and Audio Processing
Music and Audio Processing
Advanced Adaptive Filtering Techniques
Speech Recognition and Synthesis
Blind Source Separation Techniques
Indoor and Outdoor Localization Technologies
Hearing Loss and Rehabilitation
Acoustic Wave Phenomena Research
Underwater Acoustics Research
Robotics and Sensor-Based Localization
Advanced Algorithms and Applications
Infant Health and Development
Ultrasonics and Acoustic Wave Propagation
Advanced Image and Video Retrieval Techniques
Radio Wave Propagation Studies
Face recognition and analysis
Industrial Vision Systems and Defect Detection
Social Robot Interaction and HRI
QR Code Applications and Technologies
Advanced Data Compression Techniques
Speech and dialogue systems
Direction-of-Arrival Estimation Techniques
Advanced Neural Network Applications
AI and Big Data Applications
Data Management and Algorithms

Westlake University
2019-2025

Institute for Advanced Study
2022

Institut national de recherche en informatique et en automatique
2015-2020

Centre Inria de l'Université Grenoble Alpes
2015-2020

Zhejiang Water Conservancy and Hydropower Survey and Design Institute
2020

Université Grenoble Alpes
2019

Kingston University
2018

Directorate-General for Interpretation
2017

Peking University
2010-2013

Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

OPENALEX - Publications

Xiang Hao Xiangdong Su Radu Horaud Xiaofei Li

This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band refer to the models that input noisy spectral feature, output target, respectively. The model processes each frequency independently. Its consists of one several context frequencies. is prediction clean target corresponding frequency. These two types have distinct characteristics. can capture global long-distance cross-band dependencies. However, it lacks...

10.1109/icassp39728.2021.9414177 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

OPENALEX - Publications

Israel D. Gebru Silèye Ba Xiaofei Li Radu Horaud

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal model is proposed. The well suited for challenging scenarios that consist several participants multi-party interaction while they move around and turn their heads towards the other rather than facing cameras microphones. Multiple-person visual tracking combined with multiple speech-source localization order tackle speech-to-person association problem. latter solved within...

10.1109/tpami.2017.2648793 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2017-01-05

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

OPENALEX - Publications

Changsheng Quan Xiaofei Li

This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, proposed performs end-to-end enhancement. It is mainly composed of interleaved narrow-band cross-band blocks respectively information. The process frequencies independently, use self-attention mechanism temporal convolutional layers perform spatial-feature-based speaker...

10.1109/taslp.2024.3357036 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Improved bare PCB defect detection approach based on deep feature learning

OPENALEX - Publications

Can Zhang Wei Shi Xiaofei Li Haijian Zhang Hong Liu

Robust and precise defect detection is of great significance in the production high-quality printed circuit board (PCB). However, due to complexity PCB environments, most previous works still utilise traditional image processing matching algorithms detect defects. In this work, an improved bare approach proposed by learning deep discriminative features, which also greatly reduced high requirement a large dataset for method. First, authors extend existing with some artificial data affine...

10.1049/joe.2018.8275 article EN cc-by-nc-nd The Journal of Engineering 2018-08-18

Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks

OPENALEX - Publications

Xian Wei Li Nian Shao Xiaofei Li

Self-supervised learning (SSL) has emerged as a popular approach for audio representations. One goal of self-supervised pre-training is to transfer knowledge downstream tasks, generally including clip-level and frame-level tasks. While tasks are important fine-grained acoustic scene/event understanding, prior studies primarily evaluate on In order tackle both this paper proposes Audio Teacher-Student Transformer (ATST), with version (named ATST-Clip) ATST-Frame), responsible representations,...

10.1109/taslp.2024.3352248 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Fine-Tune the Pretrained ATST Model for Sound Event Detection

OPENALEX - Publications

Nian Shao Xian Li Xiaofei Li

Sound event detection (SED) often suffers from the data deficiency problem. Recent SED systems leverage large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where help produce more discriminative features for SED. However, are regarded as a frozen feature extractor in most systems, and fine-tuning of has been rarely studied. In this work, we study method We introduce frame-level audio teacher-student transformer model (ATST-Frame), our newly proposed SelfSL...

10.1109/icassp48485.2024.10446159 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

OPENALEX - Publications

Pingping Wu Hong Liu Xiaofei Li Ting Fan Xuewu Zhang

Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic However, for recognition, individual utterance mannerisms can lead confusion and false recognition. To solve this problem, novel lip descriptor presented involving both geometry-based appearance-based features in paper. Specifically, set of proposed...

10.1109/tmm.2016.2520091 article EN IEEE Transactions on Multimedia 2016-01-21

MCNET: Fuse Multiple Cues for Multichannel Speech Enhancement

OPENALEX - Publications

Yujie Yang Changsheng Quan Xiaofei Li

In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between noise. How to fully exploit these two types of their temporal dynamics remains an interesting research problem. As a solution this problem, paper proposes multi-cue fusion network named McNet, which cascades four modules respectively the full-band spatial, narrowband sub-band spectral, information. Experiments show that each module in proposed has its unique contribution and, as...

10.1109/icassp49357.2023.10095509 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization With Spatial Sparsity Regularization

OPENALEX - Publications

Xiaofei Li Laurent Girin Radu Horaud Sharon Gannot

This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all possible candidate source locations defined on a grid. After optimizing GMM-based objective function, given observed set features, both number sources their are estimated by selecting GMM with largest priors. achieved enforcing sparse solution, thus favoring small...

10.1109/taslp.2017.2740001 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-08-14

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

OPENALEX - Publications

Xiaofei Li Yutong Ban Laurent Girin Xavier Alameda-Pineda Radu Horaud

We address the problem of online localization and tracking multiple moving speakers in reverberant environments. This paper has following contributions. use direct-path relative transfer function (DP-RTF), an interchannel feature that encodes acoustic information robust against reverberation, we propose algorithm well suited for estimating DP-RTFs associated with audio sources. Another crucial ingredient proposed method is its ability to properly assign audio-source directions. Toward this...

10.1109/jstsp.2019.2903472 article EN IEEE Journal of Selected Topics in Signal Processing 2019-03-01

Multitask Learning of Time-Frequency CNN for Sound Source Localization

OPENALEX - Publications

Cheng Pang Hong Liu Xiaofei Li

Sound source localization (SSL) is an important technique for many audio processing systems, such as speech enhancement/recognition and human-robot interaction. Although methods have been proposed SSL, it still remains a challenging task to achieve accurate under adverse acoustic scenarios. In this paper, novel binaural SSL method based on time-frequency convolutional neural network (TF-CNN) with multitask learning simultaneously localize azimuth elevation unknown conditions. First, the...

10.1109/access.2019.2905617 article EN cc-by-nc-nd IEEE Access 2019-01-01

Personal sound zones in the short-time Fourier transform domain with relaxed reverberation

OPENALEX - Publications

Jun Tang Wenye Zhu Xiaofei Li

Personal-sound-zones (PSZ) techniques deliver independent sounds to multiple zones within a room using loudspeaker array. The target signal for each zone is clearly audible that while inaudible or non-distracting in others, assured by applying pre-filters the are traditionally designed with time-domain frequency-domain methods, which suffer from high computational complexity and large system latency, respectively. This work proposes subband pressure-matching method short-time Fourier...

10.1121/10.0035578 article EN The Journal of the Acoustical Society of America 2025-02-01

Sub-millimeter Acoustic Vibration Measurement and Monitoring using a Single Smartphone

OPENALEX - Publications

Liu Yang Xiaofei Li Wenwu Wang Xinheng Wang Peisong Li and 2 more

10.1109/tim.2025.3552861 article EN IEEE Transactions on Instrumentation and Measurement 2025-01-01

Binaural Sound Localization Based on Reverberation Weighting and Generalized Parametric Mapping

OPENALEX - Publications

Cheng Pang Hong Liu Jie Zhang Xiaofei Li

Binaural sound source localization is an important technique for speech enhancement, video conferencing, and human-robot interaction, etc. However, in realistic scenarios, the reverberation environmental noise would degrade precision of direction estimation. Therefore, reliable essential to practical applications. To deal with these disturbances, this paper presents a novel binaural approach based on weighting generalized parametric mapping. First, as preprocessing stage, used separately...

10.1109/taslp.2017.2703650 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-05-15

Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction

OPENALEX - Publications

Xiaofei Li Laurent Girin Radu Horaud Sharon Gannot

This paper addresses the problem of relative transfer function (RTF) estimation in presence stationary noise. We propose an RTF identification method based on segmental power spectral density (PSD) matrix subtraction. First multiple channel microphone signals are divided into segments corresponding to speech-plus-noise activity and noise-only. Then, subtraction two PSD matrices leads almost noise-free by reducing noise component preserving non-stationary speech component. is used for single...

10.1109/icassp.2015.7177983 preprint EN 2015-04-01

Enhancing direct‐path relative transfer function using deep neural network for robust sound source localization

OPENALEX - Publications

Bing Yang Runwei Ding Yutong Ban Xiaofei Li Hong Liu

This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between acoustic functions two microphone channels. First, complex-value is decomposed into inter-channel intensity difference, sinusoidal phase difference time-frequency domain. Then, features from series temporal context frames are utilized train DNN...

10.1049/cit2.12024 article EN cc-by-nc-nd CAAI Transactions on Intelligence Technology 2021-04-14

Sound Source Localization for HRI Using FOC-Based Time Difference Feature and Spatial Grid Matching

OPENALEX - Publications

Xiaofei Li Hong Liu

In human-robot interaction (HRI), speech sound source localization (SSL) is a convenient and efficient way to obtain the relative position between speaker robot. However, implementing SSL system based on TDOA method encounters many problems, such as noise of real environments, solution nonlinear equations, switch far field near field. this paper, fourth-order cumulant spectrum derived, which time delay estimation (TDE) algorithm that available for signal immune spatially correlated Gaussian...

10.1109/tsmcb.2012.2226443 article EN IEEE Transactions on Cybernetics 2012-11-16

Multiple Sound Source Counting and Localization Based on TF-Wise Spatial Spectrum Clustering

OPENALEX - Publications

Bing Yang Hong Liu Cheng Pang Xiaofei Li

This paper addresses the problem of multiple sound source counting and localization in adverse acoustic environments, using microphone array recordings. The proposed time-frequency (TF) wise spatial spectrum clustering based method contains two stages. First, given received sensor signals, correlation matrix is computed denoised TF domain. TF-wise estimated on signal subspace information, further enhanced by an exponential transform, which can increase reliability presence possibility...

10.1109/taslp.2019.2915785 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-05-10

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

OPENALEX - Publications

Bing Yang Hong Liu Xiaofei Li

Direct-path relative transfer function (DP-RTF) refers to the ratio between direct-path acoustic functions of two microphone channels. Though DP-RTF fully encodes sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in presence noise reverberation. This paper proposes learn with deep neural networks for robust binaural source localization. A learning network designed regress sensor signals real-valued representation DP-RTF. It consists branched...

10.1109/taslp.2021.3120641 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

ATST: Audio Representation Learning with Teacher-Student Transformer

OPENALEX - Publications

Xian Li Xiaofei Li

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the to specific problem with limited number labeled data. SSL has achieved promising results in various domains. This work addresses segment-level general audio SSL, proposes new transformer-based teacher-student model, named ATST. A transformer encoder is developed on recently emerged baseline scheme, which largely improves modeling capability pre-training. In addition, strategy for...

10.21437/interspeech.2022-10126 article EN Interspeech 2022 2022-09-16

Non-stationary noise power spectral density estimation based on regional statistics

OPENALEX - Publications

Xiaofei Li Laurent Girin Sharon Gannot Radu Horaud

Estimating the noise power spectral density (PSD) is essential for single channel speech enhancement algorithms. In this paper, we propose a PSD estimation approach based on regional statistics. The proposed statistics consist of four features representing past and present periodograms in short-time period. We show that these are efficient characterizing statistical difference between noisy PSD. therefore to use estimating presence probability (SPP). recursively estimated by averaging values...

10.1109/icassp.2016.7471661 preprint EN 2016-03-01

Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

OPENALEX - Publications

Xiaofei Li Laurent Girin Radu Horaud Sharon Gannot

This paper addresses the problem of binaural localization a single speech source in noisy and reverberant environments. For given microphone setup, response corresponding to direct-path propagation is function direction. In practice, this contaminated by noise reverberations. The relative transfer (DP-RTF) defined as ratio between acoustic two channels. We propose method estimate DP-RTF from signals short-time Fourier transform domain. First, convolutive approximation adopted accurately...

10.1109/taslp.2016.2598319 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-08-04

Online Monaural Speech Enhancement Using Delayed Subband LSTM

OPENALEX - Publications

Xiaofei Li Radu Horaud

This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement.The proposed method is developed in the short time Fourier transform (STFT) domain.Online processing requires frame-byframe signal reception and processing.A paramount feature of that same used across frequencies, which drastically reduces number parameters, amount training data computational burden.Training performed manner: input consists one frequency, together with some context...

10.21437/interspeech.2020-2091 article EN Interspeech 2022 2020-10-25

AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments

OPENALEX - Publications

Tianwei Zhang Huayan Zhang Xiaofei Li Junfeng Chen Tin Lun Lam and 1 more

Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors remove these objects. However, are computationally too expensive mobile robot on-board processing. In practical applications, output noisy sounds that can be effectively detected by sound source localization. The directional...

10.1109/iros51168.2021.9636585 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021-09-27

RCT: Random consistency training for semi-supervised sound event detection

OPENALEX - Publications

Nian Shao Erfan Loweimi Xiaofei Li

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem data deficiency.The integration semi-supervised learning (SSL) largely mitigates such problem.This paper researches on several modules SSL, and introduces random consistency training (RCT) strategy.First, hard mixup augmentation is proposed to account for additive property sounds.Second, scheme applied stochastically combine different types methods with high flexibility.Third,...

10.21437/interspeech.2022-10037 article EN Interspeech 2022 2022-09-16

Coming Soon ...