- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Music Technology and Sound Studies
- Neuroscience and Music Perception
- Model Reduction and Neural Networks
- Machine Learning in Materials Science
- Medical Image Segmentation Techniques
- Topic Modeling
- Voice and Speech Disorders
- AI in cancer detection
- Generative Adversarial Networks and Image Synthesis
- Image Retrieval and Classification Techniques
Institute for Language and Speech Processing
2023-2024
National Technical University of Athens
2023
Latent generative models have emerged as a leading approach for high-quality image synthesis. These rely on an autoencoder to compress images into latent space, followed by model learn the distribution. We identify that existing autoencoders lack equivariance semantic-preserving transformations like scaling and rotation, resulting in complex spaces hinder performance. To address this, we propose EQ-VAE, simple regularization enforces reducing its complexity without degrading reconstruction...
In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances computer vision domain, are first to explore combination pre-trained text-to-audio diffusers with two established methods. We experiment effect audio-specific data augmentation on overall system performance and assess different training strategies. For evaluation, construct novel dataset prompts music clips. consider both embedding-based music-specific...
Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent representations. For...
Recent advances in Diffusion Models (DMs) have led to significant progress visual synthesis and editing tasks, establishing them as a strong competitor Generative Adversarial Networks (GANs). However, the latent space of DMs is not well understood that GANs. research has focused on unsupervised semantic discovery by leveraging bottleneck layer denoising network, which been shown exhibit properties space. these approaches are limited discovering global attributes. In this paper we address,...
In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances computer vision domain, are first to explore combination pre-trained text-to-audio diffusers with two established methods. We experiment effect audio-specific data augmentation on overall system performance and assess different training strategies. For evaluation, construct novel dataset prompts music clips. consider both embedding-based music-specific...
Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data limited. In this work, we propose M2DS2, a simple and sample-efficient fine-tuning strategy for large pre-trained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse latent representations. For...
<p>Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited.</p> <p>In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent...
<p>Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data limited.</p> <p>In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained models, based on mixed source target self-supervision. We find that including self-supervision stabilizes avoids mode collapse the latent...
The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent cause rapid performance degradation for modern aligners, hindering the use automatic approaches. In this work, we propose a simple and effective modification alignment graph construction CTC-based models using Weighted Finite State Transducers. proposed weakly-supervised approach alleviates need verbatim transcription disfluencies forced alignment. During construction, allow...
In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive time-consuming to collect a sufficient number captions. Motivated by the advances Contrastive Language-Audio Pretraining (CLAP), we propose weakly-supervised approach train an AAC model assuming only text data pre-trained CLAP model, alleviating need target data. Our leverages...