- Speech Recognition and Synthesis
- Music and Audio Processing
- Speech and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Multimodal Machine Learning Applications
- Algorithms and Data Compression
- Phonetics and Phonology Research
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Face recognition and analysis
- Digital Games and Media
- Electronic Health Records Systems
- Voice and Speech Disorders
- Participatory Visual Research Methods
- Hate Speech and Cyberbullying Detection
- Health Literacy and Information Accessibility
- Spectroscopy and Chemometric Analyses
- Industrial Vision Systems and Defect Detection
- Domain Adaptation and Few-Shot Learning
- Web Data Mining and Analysis
- Blind Source Separation Techniques
- Advanced Data Compression Techniques
- Molecular Spectroscopy and Structure
Menlo School
2024
META Health
2022-2024
University of Illinois Urbana-Champaign
2018-2021
Meta (United States)
2021
Mitsubishi Electric (United States)
2020
Boğaziçi University
2002-2016
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These not only generate high fidelity outputs, but are also generalists which can solve tasks explicitly taught. In contrast, speech still primitive in terms of scale task generalization. this paper, we present Voicebox, most versatile text-guided model for at scale. Voicebox is a non-autoregressive flow-matching trained to infill speech, given audio context text, on over 50K hours that filtered...
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the stream provides complementary information can often be leveraged in conjunction with acoustics speech to improve performance. this study, we explore audio-visual approaches verification, starting standard fusion techniques learn joint (AV) embeddings, then propose a novel approach handle cross-modal at test time....
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
A singing voice conversion model converts a song in the of an arbitrary source singer to target singer. Recently, methods that leverage self-supervised audio representations such as HuBERT and Wav2Vec 2.0 have helped further state-of-the-art. Though these produce more natural melodic outputs, they often rely on confusion disentanglement losses render speaker pitch-invariant. In this paper, we circumvent training propose new leverages ASR fine-tuned inputs HiFi-GAN neural vocoder for...
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds i-vectors extracted from training data and reads relevant through attention mechanism. resulting vector (M-vector) is concatenated to acoustic features or hidden layer activations of E2E network model. ASR system based on joint connectionist temporal classification attention-based...
End-to-end spoken language understanding (SLU) systems are typically trained on large amounts of data. In many practical scenarios, the amount labeled speech is often limited as opposed to text. this study, we investigate use non-parallel and text improve performance dialog act recognition an example SLU task. We propose a multiview architecture that can handle each modality separately. To effectively train such data, model enforces internal encodings be similar using shared classifier. On...
The problem of machine learning systems demonstrating bias towards specific groups individuals has been studied extensively, particularly in the Facial Recognition area, but much less so Automatic Speech (ASR). This paper presents initial results on "Casual Conversations" – a publicly released 846 hour corpus designed to help researchers evaluate their computer vision and audio models for accuracy across diverse set metadata, including age, gender, skin tone. entire manually transcribed,...
MP2/6-31G**//MP2/6-31G**, PMP2/6-31G**//MP2/6-31G**, MP4/6-311G(3df,2p)//MP2/6-31G**, PMP4/6-311G(3df,2p)//MP2/6-31G** and CCSD(T)/6-311++G**//MP2/6-31G** calculations have been used to investigate the H-abstraction reaction from CH3OCH3 (DME) whereas MP2/6-31G**//MP2/6-31G** PMP2/6-31G**//MP2/6-31G** levels model (CH3)3COCH3 (MTBE) by ˙OH. The methodology has proved be adequate reproduce experimental geometrical parameters for reactants C–H bond energies. rate constants DME, calculated...
In this work, we investigate pre-training of neural network based speaker embeddings for low-latency change detection. Our proposed system takes two speech segments, generates using shared Siamese layers and then classifies the concatenated depending on whether they are spoken by same speaker. We gender classification, contrastive loss triplet embedding also joint training along with a same/different classifier. Training is performed 2-second single segments ground truth segmentation...
Purpose: The Speech Accessibility Project (SAP) intends to facilitate research and development in automatic speech recognition (ASR) other machine learning tasks for people with disabilities. purpose of this article is introduce project as a resource researchers, including baseline analysis the first released data package. Method: aims ASR by collecting, curating, distributing transcribed U.S. English from and/or language Participants record their place residence connecting personal...
In this paper, we address the problem of defect detection in textile images, and present a novel hybrid method where independent vector analysis, statistical method, is combined with wavelet transformation, spectral method. Independent generalization component uses vectorized signals, thus, enables exploiting multiple datasets offers fully multivariate analysis. study, are generated by transforming texture image blocks predetermined size, consequently, sub bands provide dependent which...
Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: is implemented first obtain single-speaker regions, then performed using the derived segments for improved ASR. However, in an online setting, we want achieve goals a single pass. this study, propose neural network architecture that learns embedding from which it can perform ASR detection. The proposed...
Concerns have been raised regarding performance disparity in automatic speech recognition (ASR) systems as they provide unequal transcription accuracy for different user groups defined by attributes that include gender, dialect, and race. In this paper, we propose "equal ratio", a novel inclusiveness measure ASR can be seamlessly integrated into the standard connectionist temporal classification (CTC) training pipeline of an end-to-end neural recognizer to increase recognizer's...
The awareness for biased ASR datasets or models has increased notably in recent years. Even English, despite a vast amount of available training data, systems perform worse non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge the ACM to provide accurate feedback about how well certain pronunciation patterns were recovered synthesized waveform. Furthermore,...
In this work, a template-based search approach is adopted for the Keyword Search (KWS) problem on two of low-resource languages (Turkish and Swahili). languages, use Large Vocabulary Continuous Speech Recognition (LVCSR) systems in KWS tasks may perform poorly especially out-of-vocabulary words. proposed method, keywords are modeled to be same form audio document an artificial manner utilizing methods, performance baseline system improved.
Kiran Ramnath, Leda Sari, Mark Hasegawa-Johnson, Chang Yoo. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of and re-training needed to be run each language. In this work, we propose the use adaptive masking approach in two scenarios ASR efficiently, resulting sparse monolingual models or (named as Dynamic Pathways). Our dynamically adapts subnetwork, avoiding premature decisions about fixed sub-network...
As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of LLM understand these aspects without fine-tuning its weights. To do this, utilize end-to-end system a encoder; encoder trained produce token embeddings such that LLM's response expressive prompt aligned...