Theodoros Kouzelis

ORCID: 0000-0002-1938-9250
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Music Technology and Sound Studies
  • Neuroscience and Music Perception
  • Model Reduction and Neural Networks
  • Machine Learning in Materials Science
  • Medical Image Segmentation Techniques
  • Topic Modeling
  • Voice and Speech Disorders
  • AI in cancer detection
  • Generative Adversarial Networks and Image Synthesis
  • Image Retrieval and Classification Techniques

Institute for Language and Speech Processing
2023-2024

National Technical University of Athens
2023

Latent generative models have emerged as a leading approach for high-quality image synthesis. These rely on an autoencoder to compress images into latent space, followed by model learn the distribution. We identify that existing autoencoders lack equivariance semantic-preserving transformations like scaling and rotation, resulting in complex spaces hinder performance. To address this, we propose EQ-VAE, simple regularization enforces reducing its complexity without degrading reconstruction...

10.48550/arxiv.2502.09509 preprint EN arXiv (Cornell University) 2025-02-13

In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances computer vision domain, are first to explore combination pre-trained text-to-audio diffusers with two established methods. We experiment effect audio-specific data augmentation on overall system performance and assess different training strategies. For evaluation, construct novel dataset prompts music clips. consider both embedding-based music-specific...

10.1109/icassp48485.2024.10446869 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent representations. For...

10.48550/arxiv.2301.00304 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Recent advances in Diffusion Models (DMs) have led to significant progress visual synthesis and editing tasks, establishing them as a strong competitor Generative Adversarial Networks (GANs). However, the latent space of DMs is not well understood that GANs. research has focused on unsupervised semantic discovery by leveraging bottleneck layer denoising network, which been shown exhibit properties space. these approaches are limited discovering global attributes. In this paper we address,...

10.48550/arxiv.2408.16845 preprint EN arXiv (Cornell University) 2024-08-29

In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances computer vision domain, are first to explore combination pre-trained text-to-audio diffusers with two established methods. We experiment effect audio-specific data augmentation on overall system performance and assess different training strategies. For evaluation, construct novel dataset prompts music clips. consider both embedding-based music-specific...

10.48550/arxiv.2309.11140 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data limited. In this work, we propose M2DS2, a simple and sample-efficient fine-tuning strategy for large pre-trained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse latent representations. For...

10.1109/taslp.2023.3328280 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-10-30

<p>Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited.</p> <p>In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent...

10.36227/techrxiv.21792920.v1 preprint EN cc-by 2023-01-09

<p>Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited.</p> <p>In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent...

10.36227/techrxiv.21792920 preprint EN cc-by 2023-01-09

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent cause rapid performance degradation for modern aligners, hindering the use automatic approaches. In this work, we propose a simple and effective modification alignment graph construction CTC-based models using Weighted Finite State Transducers. proposed weakly-supervised approach alleviates need verbatim transcription disfluencies forced alignment. During construction, allow...

10.48550/arxiv.2306.00996 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive time-consuming to collect a sufficient number captions. Motivated by the advances Contrastive Language-Audio Pretraining (CLAP), we propose weakly-supervised approach train an AAC model assuming only text data pre-trained CLAP model, alleviating need target data. Our leverages...

10.48550/arxiv.2309.12242 preprint EN cc-by arXiv (Cornell University) 2023-01-01
Coming Soon ...