- Speech Recognition and Synthesis
- Music and Audio Processing
- Music Technology and Sound Studies
- Speech and dialogue systems
- Natural Language Processing Techniques
- Topic Modeling
- Multimodal Machine Learning Applications
- Phonetics and Phonology Research
- Domain Adaptation and Few-Shot Learning
- Stuttering Research and Treatment
- Speech and Audio Processing
- Neuroscience and Music Perception
- Text Readability and Simplification
- Text and Document Classification Technologies
- Hearing Loss and Rehabilitation
- Advanced Image and Video Retrieval Techniques
- Digital Media Forensic Detection
- Intelligent Tutoring Systems and Adaptive Learning
The University of Texas at Austin
2024
National Taiwan University
2022-2024
Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with user-specified sequence, popular approach is to take that conditioning sequence as priming and ask decoder generate continuation. However, this <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">prompt-based conditioning</i> cannot guarantee would develop or even simply repeat itself in...
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, generalization abilities learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation unimodal audio/visual bimodal fusion 7 datasets covering 5 audio-visual tasks in speech audio processing. We evaluate recent...
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed data is costly. Therefore, we propose Speech-CLIP, novel framework bridging and through images to enhance without transcriptions. We leverage state-of-the-art pre-trained HuBERT CLIP, aligning them via paired spoken captions minimal fine-tuning. SpeechCLIP outperforms prior on image-speech retrieval performs zero-shot speech-text direct supervision from Moreover, can...
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English retrieval, we outperform current state-of-the-art performance by a wide margin both when training separate each language, with single model which processes speech in all three languages. We identify key differences behavior between English settings, attributable to pre-training CLIP HuBERT, investigate how fine-tuning impacts these...
Self-supervised speech (SSL) models have recently become widely adopted for many downstream processing tasks. The general usage pattern is to employ SSL as feature extractors, and then train a prediction head solve specific task. However, different layers of been shown capture types information, the methods combining them are not well studied. To this end, we extend framework model utilization by proposing interface that connects upstream downstream. Under view, dominant technique features...
Recent advances in self-supervised speech models have shown significant improvement many downstream tasks. However, these predominantly centered on frame-level training objectives, which can fall short spoken language understanding tasks that require semantic comprehension. Existing works often rely additional speech-text data as intermediate targets, is costly the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework integrates pseudo...
The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges and text through images via CLIP without relying on transcription. On this basis, paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module replace a fixed number of CLS tokens in cascaded architecture. Second, propose new hybrid architecture merges parallel architectures into multi-task learning framework. Our experimental evaluation...
Self-supervised speech (SSL) models have recently become widely adopted for many downstream processing tasks. The general usage pattern is to employ SSL as feature extractors, and then train a prediction head solve specific task. However, different layers of been shown capture types information, the methods combining them are not well studied. To this end, we extend framework model utilization by proposing interface that connects upstream downstream. Under view, dominant technique features...
Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and clinicians with training experience in fluency disorders. Unfortunately, only small percentage pathologists report being comfortable working individuals who stutter, which inadequate to accommodate for the 80 million stutter worldwide. Developing machine learning models detecting stuttered speech would enable universal automated screening stuttering,...
Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether demonstrate non-arbitrary associations between sounds visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images audio samples assessed these using non-parametric approach zero-shot setting. Our findings...
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends wide range natural instructions is critical for bridging communication gaps facilitating more intuitive interactions. However, the absence comprehensive evaluation benchmark poses significant challenge. We present Dynamic-SUPERB Phase-2, an open evolving instruction-based speech...
Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with user-specified sequence, popular approach is to take that conditioning sequence as priming and ask decoder generate continuation. However, this prompt-based cannot guarantee would develop or even simply repeat itself in generated In paper, we propose an alternative approach, called theme-based conditioning, explicitly trains treat...
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, generalization abilities learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation unimodal audio/visual bimodal fusion 7 datasets covering 5 audio-visual tasks in speech audio processing. We evaluate recent...
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed data is costly. Therefore, we propose SpeechCLIP, novel framework bridging and through images to enhance without transcriptions. We leverage state-of-the-art pre-trained HuBERT CLIP, aligning them via paired spoken captions minimal fine-tuning. SpeechCLIP outperforms prior on image-speech retrieval performs zero-shot speech-text direct supervision from Moreover, can...
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English retrieval, we outperform current state-of-the-art performance by a wide margin both when training separate each language, with single model which processes speech in all three languages. We identify key differences behavior between English settings, attributable to pre-training CLIP HuBERT, investigate how fine-tuning impacts these...