NFDI4DS | UHH-SEMS - Publication Details

Bajibabu Bollepalli

ORCID: 0000-0003-1268-0579

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5064828339

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Speech and dialogue systems
Voice and Speech Disorders
Natural Language Processing Techniques
Phonetics and Phonology Research
Topic Modeling
Advanced Data Compression Techniques
Neural Networks and Applications
Multimodal Machine Learning Applications
Intelligent Tutoring Systems and Adaptive Learning
Algorithms and Data Compression
Advanced Adaptive Filtering Techniques
Emotion and Mood Recognition
Subtitles and Audiovisual Media
Language, Metaphor, and Cognition
ICT in Developing Communities
Sentiment Analysis and Opinion Mining
Multi-Agent Systems and Negotiation
Social Robot Interaction and HRI
Innovative Teaching and Learning Methods

Amazon (United States)
2022

Amazon (United Kingdom)
2021

Aalto University
2016-2019

KTH Royal Institute of Technology
2013-2014

International Institute of Information Technology, Hyderabad
2012

Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Xin Wang Hirokazu Kameoka Manu Airaksinen and 2 more

This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in applications, such as ASR, but generally considered unusable synthesis. First, we predict fundamental and voicing information MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope contained is converted to all-pole filters, pitch-synchronous excitation model matched these filters trained. Finally, introduce generative adversarial...

10.1109/icassp.2018.8461852 article EN 2018-04-01

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Vassilis Tsiaras Paavo Alku

Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of art in text-to-speech synthesis (TTS). Moreover, there is increasing interest using these statistical vocoders for generating speech waveforms from various acoustic features. However, also a need to reduce model complexity, without compromising quality. Previously, glottal pulseforms (i.e., time-domain corresponding source human voice production mechanism) been...

10.1109/taslp.2019.2906484 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-03-27

Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis

OPENALEX - Publications

Bajibabu Bollepalli Lauri Juvela Paavo Alku

Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders parameterize speech into two parts, the excitation and vocal tract, occur in human production apparatus. Current generate waveform deep neural networks (DNNs). However, squared error-based training of present models is limited generating conditional average waveforms, which fails capture stochastic variation waveforms. As a result, shaped noise added as...

10.21437/interspeech.2017-1288 preprint EN Interspeech 2022 2017-08-16

A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

OPENALEX - Publications

Manu Airaksinen Lauri Juvela Bajibabu Bollepalli Junichi Yamagishi Paavo Alku

A vocoder is used to express a speech waveform with controllable parametric representation that can be converted back into waveform. Vocoders representing their main categories (mixed excitation, glottal, and sinusoidal vocoders) were compared in this study formal crowd-sourced listening tests. The quality was measured within the context of analysis-synthesis as well text-to-speech (TTS) synthesis modern statistical framework. Furthermore, TTS experiments divided vocoder-specific features...

10.1109/taslp.2018.2835720 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2018-05-18

Speaker-independent Raw Waveform Model for Glottal Excitation

OPENALEX - Publications

Lauri Juvela Vassilis Tsiaras Bajibabu Bollepalli Manu Airaksinen Junichi Yamagishi and 1 more

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating waveforms from acoustic features.These models have been shown to improve the generated quality over classical vocoders many tasks, such text-to-speech synthesis and voice conversion.Furthermore, conditioning with features allows sharing waveform generator model across multiple speakers without additional speaker codes.However, multi-speaker WaveNet require large amounts...

10.21437/interspeech.2018-1635 article EN Interspeech 2022 2018-08-28

GlottDNN — A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis

OPENALEX - Publications

Manu Airaksinen Bajibabu Bollepalli Lauri Juvela Zhizheng Wu Simon King and 1 more

GlottHMM is a previously developed vocoder that has been successfully used in HMM-based synthesis by parameterizing speech into two parts (glottal flow, vocal tract) according to the functioning of real human voice production mechanism. In this study, new glottal vocoding method, GlottDNN, proposed. The GlottDNN built on principles its predecessor, GlottHMM, but introduces three main improvements: (1) takes advantage new, more accurate inverse filtering (2) uses method deep neural network...

10.21437/interspeech.2016-342 article EN Interspeech 2022 2016-08-29

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-Spectrogram

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Junichi Yamagishi Paavo Alku

for agreeing to read the book in detail, as well all past discussions over years and many come.I do not understand how we have so far managed avoid writing a paper together.Thanks friends colleagues at various Aalto Speech groups, Acoustics laboratory, National Institute of Informatics Tokyo.Working such diverse nourishing environment has resulted some very fruitful cross-pollination ideas across our research disciplines.

10.21437/interspeech.2019-2008 article EN Interspeech 2022 2019-09-13

Multiscale System for Alzheimer’s Dementia Recognition Through Spontaneous Speech

OPENALEX - Publications

Erik Edwards Charles Dognin Bajibabu Bollepalli Maneesh Singh

10.21437/interspeech.2020-2781 article EN Interspeech 2022 2020-10-25

High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Manu Airaksinen Paavo Alku

Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances the study area. Vocoding is one such key element all speech synthesizers that known affect naturalness. The present focuses on a special type vocoding, glottal vocoders, which aim parameterize based modelling real excitation (voiced) speech, flow. More specifically, we compare three different vocoders by aiming at improved voices. Two are previously...

10.1109/icassp.2016.7472653 article EN 2016-03-01

Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Junichi Yamagishi Paavo Alku

The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while parallel versions are difficult train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results image making way into audio applications; is among lucrative properties. By adopting recent...

10.1109/icassp.2019.8683271 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

OPENALEX - Publications

Bajibabu Bollepalli Lauri Juvela Paavo Alku

CSMAPLR Constrained structured maximum a posterior linear regression CSS

10.21437/interspeech.2019-1333 article EN Interspeech 2022 2019-09-13

Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks

OPENALEX - Publications

Bajibabu Bollepalli Lauri Juvela Manu Airaksinen Cassia Valentini-Botinhao Paavo Alku

10.1016/j.specom.2019.04.008 article EN Speech Communication 2019-04-18

Lombard speech synthesis using long short-term memory recurrent neural networks

OPENALEX - Publications

Bajibabu Bollepalli Manu Airaksinen Paavo Alku

In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks demonstrated promising results in SPSS, long short-term memory recurrent (LSTMs). The however, has not been studied LSTM-based this study, we propose three methods for adaptation synthesis. particular, (1) augment specific information with linguistic features as input, (2) scale activations...

10.1109/icassp.2017.7953209 article EN 2017-03-01

Analysis of breathy voice based on excitation characteristics of speech production

OPENALEX - Publications

Sathya Adithya Thati Bajibabu Bollepalli Peri Bhaskararao B. Yegnanarayana

The objective of this paper is to find the fundamental difference between breathy and modal voices based on differences in speech production as reflected signal. We propose signal processing methods for analyzing phonation voice. These include technique zero-frequency filtering, loudness measurement, computation periodic aperiodic energy ratio extraction formants their amplitudes using group-delay technique. Parameters derived these capture excitation source characteristics which play a...

10.1109/spcom.2012.6290015 article EN 2012-07-01

DNN-based Speech Synthesis for Indian Languages from ASCII text

OPENALEX - Publications

Srikanth Ronanki Siva Reddy Bajibabu Bollepalli Simon King

Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to annual Blizzard challenges.These systems assume text be written Devanagari or Dravidian scripts which are nearly phonemic orthography scripts.However, most common form computer interaction among Indians is ASCII transliterated text.Such generally noisy with many variations spelling for same word.In this paper we evaluate three approaches synthesize speech from such text: naive Uni-Grapheme...

10.21437/ssw.2016-12 article EN 2016-09-13

Modelling a noisy-channel for voice conversion using articulatory features

OPENALEX - Publications

Bajibabu Bollepalli Alan W. Black Kishore Prahallad

In this paper, we propose modeling a noisy-channel for the task of voice conversion (VC). We have used artificial neural networks (ANN) to capture speaker-specific characteristics target speaker which avoid need any training utterance from source speaker. use articulatory features (AFs) as canonical form or speaker-independent representation speech signal. Our studies show that AFs contain significant amount information in their trajectories. Suitable techniques are proposed normalize AF...

10.21437/interspeech.2012-587 article EN Interspeech 2022 2012-09-09

Distribution Augmentation for Low-Resource Expressive Text-To-Speech

OPENALEX - Publications

Mateusz Łajszczak Animesh Prasad Arent van Korlaar Bajibabu Bollepalli Antonio Bonafonte and 6 more

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is in-crease diversity of text conditionings available during training. helps reduce overfitting, especially in low-resource settings. method relies on substituting and audio fragments way preserves syntactical correctness. We take measures ensure synthesized speech does not contain artifacts caused by...

10.1109/icassp43922.2022.9746291 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Glottal Vocoding With Frequency-Warped Time-Weighted Linear Prediction

OPENALEX - Publications

Manu Airaksinen Bajibabu Bollepalli Jouni Pohjalainen Paavo Alku

Linear prediction (LP) is a prevalent source-filter separation method of speech production. One the drawbacks conventional LP-based approaches biasing estimated formants by harmonic peaks. Methods such as discrete all-pole modeling and weighted LP have been proposed to overcome this problem, but they all use linear frequency scale. This study proposes new technique, frequency-warped time-weighted (WWLP), provide spectral envelope estimates robust peaks that work on warped scale approximates...

10.1109/lsp.2017.2665687 article EN IEEE Signal Processing Letters 2017-02-08

A comparative evaluation of vocoding techniques for HMM-based laughter synthesis

OPENALEX - Publications

Bajibabu Bollepalli Jérôme Urbain Tuomo Raitio Joakim Gustafson Hüseyin Çakmak

This paper presents an experimental comparison of various leading vocoders for the application HMM-based laughter synthesis. Four vocoders, commonly used in speech synthesis, are copy-synthesis and synthesis both male female laughter. Subjective evaluations conducted to assess performance vocoders. The results show that all perform relatively well copy-synthesis. In using original phonetic transcriptions, synthesized voices were significantly lower quality than copy-synthesis, indicating a...

10.1109/icassp.2014.6853597 article EN 2014-05-01

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue

OPENALEX - Publications

Samer Al Moubayed Jonas Beskow Bajibabu Bollepalli Joakim Gustafson Ahmed Hussen-Abdelaziz and 8 more

In this paper, we describe a project that explores novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction is designed, human-human dialogue corpus collected. The targets the development of system platform to study verbal nonverbal strategies in spoken interactions with robots which are capable dialogue. task centered on two participants involved aiming solve card-ordering game. Along sits tutor (robot) helps...

10.1145/2559636.2563681 article EN 2014-03-03

Reducing Mismatch in Training of DNN-Based Glottal Excitation Models in a Statistical Parametric Text-to-Speech System

OPENALEX - Publications

Lauri Juvela Bajibabu Bollepalli Junichi Yamagishi Paavo Alku

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis.Until now, however, these trained separately the model.This creates mismatch between training and synthesis, as synthesized used for model input differ original inputs, with which was on.Furthermore, due errors predicting vocal tract filter, do not provide perfect reconstruction of waveform even if predicted without...

10.21437/interspeech.2017-848 article EN Interspeech 2022 2017-08-16

Coming Soon ...