Edresson Casanova

ORCID: 0000-0003-0160-7173
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Music and Audio Processing
  • Speech and Audio Processing
  • Speech and dialogue systems
  • Topic Modeling
  • Voice and Speech Disorders
  • COVID-19 diagnosis using AI
  • Advanced Data Compression Techniques
  • Phonetics and Phonology Research
  • Text Readability and Simplification
  • Hate Speech and Cyberbullying Detection
  • Digital Radiography and Breast Imaging
  • Advanced Text Analysis Techniques
  • Anomaly Detection Techniques and Applications
  • Medical Image Segmentation Techniques
  • Multimodal Machine Learning Applications
  • Medical Imaging Techniques and Applications
  • Advanced X-ray and CT Imaging
  • Semantic Web and Ontologies
  • Neurobiology of Language and Bilingualism
  • Digital Media Forensic Detection

Nvidia (United States)
2025

Universidade Federal de Mato Grosso
2025

Universidade Tecnológica Federal do Paraná
2021-2022

Brazilian Society of Computational and Applied Mathematics
2020-2022

Universidade de São Paulo
2020-2022

Hospital Universitário da Universidade de São Paulo
2022

British Society of Periodontology
2022

Coordenação de Aperfeicoamento de Pessoal de Nível Superior
2022

In this paper, we propose SC-GlowTTS: an efficient zeroshot multi-speaker text-to-speech model that improves similarity for speakers unseen during training.We a speaker-conditional architecture explores flow-based decoder works in zero-shot scenario.As text encoders, explore dilated residual convolutional-based encoder, gated and transformer-based encoder.Additionally, have shown adjusting GAN-based vocoder the spectrograms predicted by TTS on training dataset can significantly improve...

10.21437/interspeech.2021-1774 article EN Interspeech 2022 2021-08-27

YourTTS brings the power of a multilingual approach to task zero-shot multi-speaker TTS. Our method builds upon VITS model and adds several novel modifications for training. We achieved state-of-the-art (SOTA) results in TTS comparable SOTA voice conversion on VCTK dataset. Additionally, our achieves promising target language with single-speaker dataset, opening possibilities systems low-resource languages. Finally, it is possible fine-tune less than 1 minute speech achieve similarity...

10.48550/arxiv.2112.02418 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just few high/medium resource languages, limiting the applications of these in most low/medium languages. In this paper, we aim alleviate issue by proposing making publicly available XTTS system. Our method builds upon Tortoise model adds several novel modifications enable multilingual training,...

10.21437/interspeech.2024-2016 article EN Interspeech 2022 2024-09-01

This work presents FreeSVC, a promising multilingual singing voice conversion approach that leverages an enhanced VITS model with Speaker-invariant Clustering (SPIN) for better content representation and the State-of-the-Art (SOTA) speaker encoder ECAPA2. FreeSVC incorporates trainable language embeddings to handle multiple languages employs advanced disentangle characteristics from linguistic content. Designed zero-shot learning, enables cross-lingual without extensive language-specific...

10.48550/arxiv.2501.05586 preprint EN arXiv (Cornell University) 2025-01-09

While autoregressive speech token generation models produce with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite enhanced encoder-decoder Transformer TTS address these challenges by incorporating preference alignment techniques guided automatic recognition speaker verification models. Additionally, we incorporate...

10.48550/arxiv.2502.05236 preprint EN arXiv (Cornell University) 2025-02-07

10.1109/icassp49660.2025.10888202 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS screen readers and accessibility tools. However, not all languages on the same level when terms of resources synthesis. This work consists creating publicly available Brazilian Portuguese form novel dataset along with deep learning models end-to-end Such has 10.5 hours from single speaker, which Tacotron 2 model RTISI-LA...

10.1007/s10579-021-09570-4 article EN cc-by Language Resources and Evaluation 2022-01-12

In this work, we propose several techniques to address data scarceness in ComParE 2021 COVID-19 identification tasks for the application of deep models such as Convolutional Neural Networks.Data is initially preprocessed into spectrogram or MFCC-gram formats.After preprocessing, combine three different augmentation be applied model training.Then employ transfer learning from pretrained audio neural networks.Those are distinct architectures.For speech segments, obtained competitive results.On...

10.21437/interspeech.2021-1798 article EN Interspeech 2022 2021-08-27

Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals may propagate misinformation, while the latter undermine biometric security systems. AI-based detection can help address these challenges by automatically differentiating between genuine fabricated recordings. However, models are only good their training data, which currently is severely limited due...

10.48550/arxiv.2401.09512 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just few high/medium resource languages, limiting the applications of these in most low/medium languages. In this paper, we aim alleviate issue by proposing making publicly available XTTS system. Our method builds upon Tortoise model adds several novel modifications enable multilingual training,...

10.48550/arxiv.2406.04904 preprint EN arXiv (Cornell University) 2024-06-07

Abstract Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. particular, for Brazilian Portuguese (BP) language, were around 376 h publicly available ASR task until second half of 2020. With release new datasets early 2021, this number increased to 574 h. The existing resources, however, are composed audios containing only read prepared speech. There lack including spontaneous speech, which essential several...

10.1007/s10579-022-09621-4 article EN cc-by Language Resources and Evaluation 2022-11-21

Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. particular, for Brazilian Portuguese (BP) language, were about 376 hours public available ASR task until second half of 2020. With release new datasets early 2021, this number increased to 574 hours. The existing resources, however, are composed audios containing only read prepared speech. There lack including spontaneous speech, which essential different...

10.48550/arxiv.2110.15731 preprint EN other-oa arXiv (Cornell University) 2021-01-01

BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa.The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development high-quality text-to-speech models.The represented are: Akuapem Twi, Asante Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba.This derivative work Bible made released by Open.Bible project from Biblica.We have cleaned, filtered original...

10.21437/interspeech.2022-10850 article EN Interspeech 2022 2022-09-16

Edresson Casanova, Lucas Gris, Augusto Camargo, Daniel da Silva, Murilo Gazzola, Ester Sabino, Anna Levin, Arnaldo Candido Jr, Sandra Aluisio, Marcelo Finger. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.

10.18653/v1/2021.findings-acl.55 article EN cc-by 2021-01-01

During the coronavirus disease 2019 (COVID-19) pandemic, various research disciplines collaborated to address impacts of severe acute respiratory syndrome coronavirus-2 infections. This paper presents an interpretability analysis a convolutional neural network-based model designed for COVID-19 detection using audio data. We explore input features that play crucial role in model’s decision-making process, including spectrograms, fundamental frequency (F0), F0 standard deviation,...

10.36922/aih.2992 article EN cc-by Deleted Journal 2024-07-30

Large language models (LLMs) have significantly advanced audio processing through codecs that convert into discrete tokens, enabling the application of modeling techniques to data. However, often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present Low Frame-rate Speech Codec (LFSC): a neural codec leverages finite scalar quantization adversarial with large speech achieve high-quality compression...

10.48550/arxiv.2409.12117 preprint EN arXiv (Cornell University) 2024-09-18
Coming Soon ...