Christoph Boeddeker

ORCID: 0000-0002-8701-1567
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Advanced Adaptive Filtering Techniques
  • Speech and dialogue systems
  • Blind Source Separation Techniques
  • Natural Language Processing Techniques
  • Ultrasonics and Acoustic Wave Propagation
  • Hearing Loss and Rehabilitation
  • Advanced Data Compression Techniques
  • Algorithms and Data Compression
  • Structural Health Monitoring Techniques
  • Bayesian Methods and Mixture Models
  • Acoustic Wave Phenomena Research
  • Vehicle Noise and Vibration Control
  • Advanced Clustering Algorithms Research
  • Direction-of-Arrival Estimation Techniques
  • Data Management and Algorithms
  • Indoor and Outdoor Localization Technologies
  • Topic Modeling
  • Phonetics and Phonology Research

Paderborn University
2016-2025

Microsoft (United States)
2018-2023

Mitsubishi Electric (United States)
2023

Following the success of 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize 6th Speech Separation Recognition Challenge (CHiME-6). The new challenge revisits previous CHiME-5 further considers problem distant multi-microphone conversational speech diarization recognition in everyday home environments. material is same as recordings except for accurate array synchronization. was elicited using a dinner party scenario with efforts taken to capture data that representative natural speech....

10.21437/chime.2020-1 preprint EN 2020-05-04

This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks statistically optimum beamformer is jointly trained with acoustic modeling. To update its parameters, we propagate the gradients from model all way through feature extraction and complex valued beamforming operation. Besides avoiding mismatch between front-end back-end, this also eliminates need stereo data, i.e., parallel availability of clean noisy...

10.1109/icassp.2017.7953173 article EN 2017-03-01

Following the success of 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize 6th Speech Separation Recognition Challenge (CHiME-6). The new challenge revisits previous CHiME-5 further considers problem distant multi-microphone conversational speech diarization recognition in everyday home environments. material is same as recordings except for accurate array synchronization. was elicited using a dinner party scenario with efforts taken to capture data that representative natural speech....

10.48550/arxiv.2004.09249 preprint EN other-oa arXiv (Cornell University) 2020-01-01

We present ESPnet-SE, which is designed for the quick development of speech enhancement and separation systems in a single framework, along with optional downstream recognition module. ESPnet-SE new project integrates rich automatic related models, resources to support validate proposed front-end implementation (i.e. separation).It capable processing both single-channel multi-channel data, various functionalities including dereverberation, denoising source separation. provide all-in-one...

10.1109/slt48900.2021.9383615 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon target-speaker voice activity detection (TS-VAD) approach, which assumes that initial speaker embeddings available. We replace final combined estimation network TS-VAD with a produces estimates at time-frequency resolution. Those act as masks for extraction, either via masking or beamforming. The technique can be applied both...

10.1109/taslp.2024.3350887 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

This work examines acoustic beamformers employing neural networks (NNs) for mask prediction as front -end automatic speech recognition (ASR) systems practical scenarios like voice-enabled home devices. To test the versatility of predicting network, system is evaluated with different recording hardware, microphone array designs, and models downstream ASR system. Significant gains in accuracy are obtained all configurations despite fact that NN had been trained on mismatched data. Unlike...

10.1109/icassp.2018.8461669 article EN 2018-04-01

In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario.The main challenges of ASR systems recordings obtained by multiple microphone arrays are (1) heavy overlaps, (2) severe noise reverberation, (3) very natural conversational content, possibly (4) insufficient training data.As an example scenario, have chosen the data presented during CHiME-5 challenge, where baseline had 73.3% word error rate (WER), even...

10.21437/interspeech.2019-1167 article EN Interspeech 2022 2019-09-13

We present a multi-channel database of overlapping speech for training, evaluation, and detailed analysis source separation extraction algorithms: SMS-WSJ -- Spatialized Multi-Speaker Wall Street Journal. It consists artificially mixed taken from the WSJ database, but unlike earlier databases we consider all WSJ0+1 utterances take care strictly separating speaker sets in validation test sets. When spatializing data ensure high degree randomness w.r.t. room size, array center rotation, as...

10.48550/arxiv.1910.13934 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In recent years time domain speech separation has excelled over frequency in single channel scenarios and noise-free environments. this paper we dissect the gains of time-domain audio network (TasNet) approach by gradually replacing components an utterance-level permutation invariant training (u-PIT) based system until TasNet is reached, thus blending approaches with those approaches. Some intermediate variants achieve comparable signal-to-distortion ratio (SDR) to TasNet, but retain...

10.1109/icassp40776.2020.9052981 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and produce output sequences (MIMO). Such are typically required, e.g., for meeting transcription. provide an efficient implementation based on dynamic programming search in multi-dimensional Levenshtein distance tensor under constraint reference utterance must be matched consistently with one hypothesis output. This also results ORC WER which...

10.1109/icassp49357.2023.10094784 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

This article proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in computationally efficient way. Conventionally, cascade configuration, composed of Weighted Prediction Error minimization (WPE) dereverberation filter followed by Minimum Variance Distortionless Response (MVDR) beamformer, has been used as the state-of-the-art frontend far-field speech recognition, even though this approach's...

10.1109/taslp.2020.3013118 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-01-01

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it typically unknown.To cope with this, we extend an iterative extraction system mechanisms count sources combine a single-talker recognizer form first end-to-end automatic for unknown speakers.Our experiments show very promising performance counting accuracy, source on simulated clean mixtures from WSJ0-2mix...

10.21437/interspeech.2020-2519 article EN Interspeech 2022 2020-10-25

10.1109/icassp49660.2025.10888445 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

The weighted prediction error (WPE) algorithm has proven to be a very successful dereverberation method for the REVERB challenge. Likewise, neural network based mask estimation beamforming demonstrated good noise suppression in CHiME 3 and 4 challenges. Recently, it been shown that this estimator can also trained perform denoising jointly. However, up now comparison of beamformer WPE is still missing, so an investigation into combination two. Therefore, we here provide extensive evaluation...

10.21437/interspeech.2018-2196 article EN Interspeech 2022 2018-08-28

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to recognition. However, up until now, state-of-the-art neural network-based time domain source has not yet been combined with E2E We here demonstrate how combine a module based on Convolutional Time Audio Separation Network (Conv-TasNet) an recognizer and train such model jointly by distributing it over multiple GPUs or approximating truncated back-propagation for the...

10.1109/icassp40776.2020.9053461 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available. However, there a longstanding debate whether should also be carried out on ASR training data. In an extensive experimental evaluation acoustically very challenging CHiME-5 dinner party we show that: (i) cleaning up can lead substantial reductions, and (ii) in advisable as long test at least training. This...

10.1109/asru46091.2019.9003785 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, showed its superiority over widely used cascade of a Weighted Prediction Error (WPE) dereverberation filter conventional Minimum-Power Distortionless Response (MPDR) beamformer. However, it has not been fully investigated which components in yield such superiority. To this end, paper presents new derivation allows us to factorize into WPE...

10.1109/icassp40776.2020.9054393 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel multichannel conditions. However, severe performance degradation is still observed reverberant noisy scenarios, there a large gap between anechoic In this work, we focus on condition, propose extend our previous framework for dereverberation, beamforming, with improved numerical stability advanced frontend subnetworks including voice activity detection...

10.1109/icassp39728.2021.9414464 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Impressive progress in neural network-based single-channel speech source separation has been made recent years. But those improvements have mostly reported on anechoic data, a situation that is hardly met practice. Taking the SepFormer as starting point, which achieves state-of-the-art performance mixtures, we gradually modify it to optimize its reverberant mixtures. Although this leads word error rate improvement by 7 percentage points compared standard implementation, system ends up with...

10.1109/iwaenc53105.2022.9914794 preprint EN 2022-09-05

Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if network reconstructs reference signal perfectly or contains silence, e.g., when two-output separator processes single-speaker recording. modifications to plain have been proposed that trade-off between making loss more robust and distorting its value. We propose switch from mean over SDRs of each...

10.1109/icassp43922.2022.9746757 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant enhancement and source separation. Here, we propose combine neural network supported with a time-domain objective function. For use convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is well-known evaluation metric (BSS...

10.1109/icassp39728.2021.9414661 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Continuous speech separation (CSS) is an arising task in aiming at separating overlap-free targets from a long, partially-overlapped recording. A straightforward extension of previously proposed sentence-level models to this segment the long recording into fixed-length blocks and perform on them independently. However, such simple does not fully address cross-block dependencies performance may be satisfactory. In paper, we focus how block-level can improved by exploring methods utilize...

10.1109/slt48900.2021.9383514 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Using a Teacher-Student training approach we developed speaker embedding extraction system that outputs embeddings at frame rate. Given this high temporal resolution and the fact student produces sensible even for segments with speech overlap, frame-wise serve as an appropriate representation of input signal end-to-end neural meeting diarization (EEND) system. We show in experiments helps mitigate well-known problem EEND systems: when increasing number speakers performance drop is...

10.1109/icassp49357.2023.10095370 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

In this paper we show how a neural network for spectral mask estimation an acoustic beamformer can be optimized by algorithmic differentiation. Using the output SNR as objective function to maximize, gradient is propagated through all way which provides clean speech and noise masks from coefficients are estimated eigenvalue decomposition. A key theoretical result derivative of problem involving complex-valued eigenvectors. Experimental results on CHiME-3 challenge database demonstrate...

10.1109/icassp.2017.7952140 article EN 2017-03-01
Coming Soon ...