Goeric Huybrechts

ORCID: 0000-0003-0222-3008
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Topic Modeling
  • Natural Language Processing Techniques
  • Voice and Speech Disorders
  • Speech and dialogue systems
  • Advanced Data Compression Techniques
  • Model Reduction and Neural Networks
  • Adversarial Robustness in Machine Learning
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Medical Imaging Techniques and Applications
  • Human Pose and Action Recognition
  • Phonetics and Phonology Research
  • Multimodal Machine Learning Applications
  • Reinforcement Learning in Robotics

Amazon (United States)
2022

Amazon (United Kingdom)
2019-2021

Amazon (Germany)
2018-2021

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in desired speaking style. In this work, we present novel 3-step methodology to circumvent costly operation recording large amounts data order build expressive style voices with as little 15 minutes such recordings. First, augment via voice conversion by leveraging other speakers. Next, use that synthetic on top available train...

10.1109/icassp39728.2021.9413466 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques convert normally phonated speech whispered speech. investigate using Gaussian Mixture Models (GMM) Deep Neural Networks (DNN) model the mapping between acoustic features of normal those evaluate naturalness speaker similarity converted on internal corpus publicly available wTIMIT corpus. show that VC is significantly better than rule-based methods it achieves...

10.1109/lsp.2019.2961213 article EN IEEE Signal Processing Letters 2019-12-24

Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, 3-step method was proposed to generate TTS while greatly reducing data required for training. However, we have observed ceiling effect in level naturalness achievable highly expressive voices when using this approach. paper, present building with as little 15 minutes speech Compared current state-of-the-art approach,...

10.21437/ssw.2021-17 article EN 2021-08-24

We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. assume to have a corpus neutral non-expressive from target speaker and supporting conversational expressive different speakers. Our goal is build TTS system that expressive, while retaining speaker's identity. The proposed approach relies on conversion first generate high-quality set converted then pooled with natural used train single-speaker multi-style system....

10.1109/icassp43922.2022.9746179 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts training data, standard TTS models suffer from quality and intelligibility degradations, making low-resource problematic. In this paper, we propose a novel extremely method called Voice Filter that uses as little one minute target speaker. It voice conversion (VC) post-processing module appended pre-existing system marks conceptual...

10.1109/icassp43922.2022.9747239 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Pitch detection is a fundamental problem in speech processing as F0 used large number of applications.Recent articles have proposed deep learning for robust pitch tracking.In this paper, we consider voicing classification and contour estimation regression problem.For both tasks, acoustic features from multiple domains traditional machine methods are used.The discrimination power existing assessed through mutual information.Multiple supervised unsupervised approaches compared.A significant...

10.1109/lsp.2018.2874155 article EN IEEE Signal Processing Letters 2018-10-04

Emotional voice conversion models adapt the emotion in speech without changing speaker identity or linguistic content.They are less data hungry than text-to-speech and allow to generate large amounts of emotional for downstream tasks.In this work we propose EmoCat, a language-agnostic model.It achieves high-quality German with 45 minutes recordings by exploiting US English.EmoCat is an encoder-decoder model based on CopyCat, system which transfers prosody.We use adversarial training remove...

10.21437/ssw.2021-13 article EN 2021-08-24

Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy causal convolutions minimize the degradation due streaming. However, performance gap still remains relatively large between a full-contextual model trained independently. To address this, we propose convolution replacing hybrid...

10.1109/icassp49357.2023.10097062 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Integrated Speech and Large Language Models (SLMs) that can follow speech instructions generate relevant text responses have gained popularity lately. However, the safety robustness of these models remains largely unclear. In this work, we investigate potential vulnerabilities such instruction-following speech-language to adversarial attacks jailbreaking. Specifically, design algorithms examples jailbreak SLMs in both white-box black-box attack settings without human involvement....

10.48550/arxiv.2405.08317 preprint EN arXiv (Cornell University) 2024-05-14

Convolution augmented Transformer architectures have dominated the field of automatic speech recognition by showing better WER results when models are trained on relatively smaller training data. In this work, we revisit necessity convolution modules in ASR encoder architecture, given that inductive bias brought may only boost performance a low data regime. We show with architectural improvements to block, convolution-free architecture (namely, Transformer++) can catch up best Conformer as...

10.21437/interspeech.2024-588 article EN Interspeech 2022 2024-09-01

Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach enhance both efficiency effectiveness of understanding by utilizing large language models (LLMs) their tool-harnessing ability. A key aspect our method is query-adaptive frame sampling, which leverages reasoning capabilities LLMs process only most relevant frames in real-time, addresses...

10.48550/arxiv.2410.20252 preprint EN arXiv (Cornell University) 2024-10-26

Despite recent advancements in speech processing, zero-resource translation (ST) and automatic recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) perform ST ASR languages for which the model has never seen paired audio-text data. We achieve by using pre-trained encoder, LLM, lightweight adaptation module that maps audio representations token embedding space of LLM. several experiments both understand how best train...

10.48550/arxiv.2412.18566 preprint EN arXiv (Cornell University) 2024-12-24

We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. assume to have a corpus neutral non-expressive from target speaker and supporting conversational expressive different speakers. Our goal is build TTS system that expressive, while retaining speaker's identity. The proposed approach relies on conversion first generate high-quality set converted then pooled with natural used train single-speaker multi-style system....

10.48550/arxiv.2202.05083 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The availability of data in expressive styles across languages is limited, and recording sessions are costly time consuming.To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour conversational speech, when no other available the same language.Assuming non-expressive speech that language, propose a 3-step technology: 1) train an F0-conditioned voice conversion (VC) model as augmentation technique; 2) F0 predictor control...

10.21437/interspeech.2022-10338 article EN Interspeech 2022 2022-09-16

Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, 3-step method was proposed to generate TTS while greatly reducing data required for training. However, we have observed ceiling effect in level naturalness achievable highly expressive voices when using this approach. paper, present building with as little 15 minutes speech Compared current state-of-the-art approach,...

10.48550/arxiv.2106.12896 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy causal convolutions minimize the degradation due streaming. However, performance gap still remains relatively large between a full-contextual model trained independently. To address this, we propose convolution replacing hybrid...

10.48550/arxiv.2304.09325 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming non-streaming automatic speech recognition (ASR). Techniques like dual-mode dynamic chunk training helped unify systems. However, there remains a performance gap between with full limited past context. To address this issue, we propose the integration of novel contextual carry-over mechanism state-of-the-art (SOTA) unified ASR system. Our proposed context Conformer (DCTX-Conformer)...

10.48550/arxiv.2306.08175 preprint EN other-oa arXiv (Cornell University) 2023-01-01

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in desired speaking style. In this work, we present novel 3-step methodology to circumvent costly operation recording large amounts data order build expressive style voices with as little 15 minutes such recordings. First, augment via voice conversion by leveraging other speakers. Next, use that synthetic on top available train...

10.48550/arxiv.2011.05707 preprint EN other-oa arXiv (Cornell University) 2020-01-01

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts training data, standard TTS models suffer from quality and intelligibility degradations, making low-resource problematic. In this paper, we propose a novel extremely method called Voice Filter that uses as little one minute target speaker. It voice conversion (VC) post-processing module appended pre-existing system marks conceptual...

10.48550/arxiv.2202.08164 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The availability of data in expressive styles across languages is limited, and recording sessions are costly time consuming. To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour conversational speech, when no other available the same language. Assuming non-expressive speech that language, propose a 3-step technology: 1) train an F0-conditioned voice conversion (VC) model as augmentation technique; 2) F0 predictor control...

10.48550/arxiv.2207.14607 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...