Bajibabu Bollepalli

ORCID: 0000-0003-1268-0579
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech and dialogue systems
  • Voice and Speech Disorders
  • Natural Language Processing Techniques
  • Phonetics and Phonology Research
  • Topic Modeling
  • Advanced Data Compression Techniques
  • Neural Networks and Applications
  • Multimodal Machine Learning Applications
  • Intelligent Tutoring Systems and Adaptive Learning
  • Algorithms and Data Compression
  • Advanced Adaptive Filtering Techniques
  • Emotion and Mood Recognition
  • Subtitles and Audiovisual Media
  • Language, Metaphor, and Cognition
  • ICT in Developing Communities
  • Sentiment Analysis and Opinion Mining
  • Multi-Agent Systems and Negotiation
  • Social Robot Interaction and HRI
  • Innovative Teaching and Learning Methods

Amazon (United States)
2022

Amazon (United Kingdom)
2021

Aalto University
2016-2019

KTH Royal Institute of Technology
2013-2014

International Institute of Information Technology, Hyderabad
2012

This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in applications, such as ASR, but generally considered unusable synthesis. First, we predict fundamental and voicing information MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope contained is converted to all-pole filters, pitch-synchronous excitation model matched these filters trained. Finally, introduce generative adversarial...

10.1109/icassp.2018.8461852 article EN 2018-04-01

Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of art in text-to-speech synthesis (TTS). Moreover, there is increasing interest using these statistical vocoders for generating speech waveforms from various acoustic features. However, also a need to reduce model complexity, without compromising quality. Previously, glottal pulseforms (i.e., time-domain corresponding source human voice production mechanism) been...

10.1109/taslp.2019.2906484 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-03-27

Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders parameterize speech into two parts, the excitation and vocal tract, occur in human production apparatus. Current generate waveform deep neural networks (DNNs). However, squared error-based training of present models is limited generating conditional average waveforms, which fails capture stochastic variation waveforms. As a result, shaped noise added as...

10.21437/interspeech.2017-1288 preprint EN Interspeech 2022 2017-08-16

A vocoder is used to express a speech waveform with controllable parametric representation that can be converted back into waveform. Vocoders representing their main categories (mixed excitation, glottal, and sinusoidal vocoders) were compared in this study formal crowd-sourced listening tests. The quality was measured within the context of analysis-synthesis as well text-to-speech (TTS) synthesis modern statistical framework. Furthermore, TTS experiments divided vocoder-specific features...

10.1109/taslp.2018.2835720 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2018-05-18

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating waveforms from acoustic features.These models have been shown to improve the generated quality over classical vocoders many tasks, such text-to-speech synthesis and voice conversion.Furthermore, conditioning with features allows sharing waveform generator model across multiple speakers without additional speaker codes.However, multi-speaker WaveNet require large amounts...

10.21437/interspeech.2018-1635 article EN Interspeech 2022 2018-08-28

GlottHMM is a previously developed vocoder that has been successfully used in HMM-based synthesis by parameterizing speech into two parts (glottal flow, vocal tract) according to the functioning of real human voice production mechanism. In this study, new glottal vocoding method, GlottDNN, proposed. The GlottDNN built on principles its predecessor, GlottHMM, but introduces three main improvements: (1) takes advantage new, more accurate inverse filtering (2) uses method deep neural network...

10.21437/interspeech.2016-342 article EN Interspeech 2022 2016-08-29

for agreeing to read the book in detail, as well all past discussions over years and many come.I do not understand how we have so far managed avoid writing a paper together.Thanks friends colleagues at various Aalto Speech groups, Acoustics laboratory, National Institute of Informatics Tokyo.Working such diverse nourishing environment has resulted some very fruitful cross-pollination ideas across our research disciplines.

10.21437/interspeech.2019-2008 article EN Interspeech 2022 2019-09-13

Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances the study area. Vocoding is one such key element all speech synthesizers that known affect naturalness. The present focuses on a special type vocoding, glottal vocoders, which aim parameterize based modelling real excitation (voiced) speech, flow. More specifically, we compare three different vocoders by aiming at improved voices. Two are previously...

10.1109/icassp.2016.7472653 article EN 2016-03-01

The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while parallel versions are difficult train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results image making way into audio applications; is among lucrative properties. By adopting recent...

10.1109/icassp.2019.8683271 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

CSMAPLR Constrained structured maximum a posterior linear regression CSS

10.21437/interspeech.2019-1333 article EN Interspeech 2022 2019-09-13

In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks demonstrated promising results in SPSS, long short-term memory recurrent (LSTMs). The however, has not been studied LSTM-based this study, we propose three methods for adaptation synthesis. particular, (1) augment specific information with linguistic features as input, (2) scale activations...

10.1109/icassp.2017.7953209 article EN 2017-03-01

The objective of this paper is to find the fundamental difference between breathy and modal voices based on differences in speech production as reflected signal. We propose signal processing methods for analyzing phonation voice. These include technique zero-frequency filtering, loudness measurement, computation periodic aperiodic energy ratio extraction formants their amplitudes using group-delay technique. Parameters derived these capture excitation source characteristics which play a...

10.1109/spcom.2012.6290015 article EN 2012-07-01

Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to annual Blizzard challenges.These systems assume text be written Devanagari or Dravidian scripts which are nearly phonemic orthography scripts.However, most common form computer interaction among Indians is ASCII transliterated text.Such generally noisy with many variations spelling for same word.In this paper we evaluate three approaches synthesize speech from such text: naive Uni-Grapheme...

10.21437/ssw.2016-12 article EN 2016-09-13

In this paper, we propose modeling a noisy-channel for the task of voice conversion (VC). We have used artificial neural networks (ANN) to capture speaker-specific characteristics target speaker which avoid need any training utterance from source speaker. use articulatory features (AFs) as canonical form or speaker-independent representation speech signal. Our studies show that AFs contain significant amount information in their trajectories. Suitable techniques are proposed normalize AF...

10.21437/interspeech.2012-587 article EN Interspeech 2022 2012-09-09

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is in-crease diversity of text conditionings available during training. helps reduce overfitting, especially in low-resource settings. method relies on substituting and audio fragments way preserves syntactical correctness. We take measures ensure synthesized speech does not contain artifacts caused by...

10.1109/icassp43922.2022.9746291 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Linear prediction (LP) is a prevalent source-filter separation method of speech production. One the drawbacks conventional LP-based approaches biasing estimated formants by harmonic peaks. Methods such as discrete all-pole modeling and weighted LP have been proposed to overcome this problem, but they all use linear frequency scale. This study proposes new technique, frequency-warped time-weighted (WWLP), provide spectral envelope estimates robust peaks that work on warped scale approximates...

10.1109/lsp.2017.2665687 article EN IEEE Signal Processing Letters 2017-02-08

This paper presents an experimental comparison of various leading vocoders for the application HMM-based laughter synthesis. Four vocoders, commonly used in speech synthesis, are copy-synthesis and synthesis both male female laughter. Subjective evaluations conducted to assess performance vocoders. The results show that all perform relatively well copy-synthesis. In using original phonetic transcriptions, synthesized voices were significantly lower quality than copy-synthesis, indicating a...

10.1109/icassp.2014.6853597 article EN 2014-05-01

In this paper, we describe a project that explores novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction is designed, human-human dialogue corpus collected. The targets the development of system platform to study verbal nonverbal strategies in spoken interactions with robots which are capable dialogue. task centered on two participants involved aiming solve card-ordering game. Along sits tutor (robot) helps...

10.1145/2559636.2563681 article EN 2014-03-03

Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis.Until now, however, these trained separately the model.This creates mismatch between training and synthesis, as synthesized used for model input differ original inputs, with which was on.Furthermore, due errors predicting vocal tract filter, do not provide perfect reconstruction of waveform even if predicted without...

10.21437/interspeech.2017-848 article EN Interspeech 2022 2017-08-16
Coming Soon ...