- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Speech and dialogue systems
- Topic Modeling
- Phonetics and Phonology Research
- Speech and Audio Processing
- Music and Audio Processing
- Deception detection and forensic psychology
- Sentiment Analysis and Opinion Mining
- User Authentication and Security Systems
- Evolutionary Algorithms and Applications
- Text and Document Classification Technologies
- Linguistic Variation and Morphology
- Language, Metaphor, and Cognition
- Metaheuristic Optimization Algorithms Research
- Advanced Text Analysis Techniques
- Digital Communication and Language
- Video Analysis and Summarization
- Cardiac, Anesthesia and Surgical Outcomes
- Humor Studies and Applications
- Meta-analysis and systematic reviews
- Algorithms and Data Compression
- Advanced Multi-Objective Optimization Algorithms
- Complex Network Analysis Techniques
- Handwritten Text Recognition Techniques
Google (United States)
2019-2025
NYU Langone Health
2024
IT University of Copenhagen
2023
Tokyo Institute of Technology
2023
Administration for Community Living
2023
American Jewish Committee
2023
University of Michigan
2004-2019
New York University
2019
The Graduate Center, CUNY
2011-2018
IBM (United States)
2016-2018
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.Moreover, the transfer voices across languages, e.g.synthesize fluent Spanish using an English speaker's voice, without training any bilingual or parallel examples.Such works distantly related e.g.English and Mandarin.Critical achieving this result are: 1. phonemic input representation encourage sharing of capacity 2. incorporating...
This paper describes the AuToBI tool for automatic generation of hypothesized ToBI labels. While research on prosodic annotation has been conducted many years, represents first publicly available to automatically detect and classify breaks tones that make up standard. feature extraction routines as well classifiers used events Additionally, we report performance evaluating models trained Boston Directions Corpus Columbia Games Corpus. By distinct speakers domains recording conditions, this...
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized has raised exciting possibility replacing expensive, manually transcribed, domain-specific, human that is used to train recognizers. The can learn latent embedding spaces prosody, speaker style variations derived from input acoustic representations thereby allowing for manipulation speech. In this paper, we evaluate feasibility enhancing recognition...
End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E are attractive due the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores design an ASR-free end-to-end system for text query-based keyword search (KWS) from trained with minimal supervision. Our KWS consists three...
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody synthesized speech. Such typically incorporate a variational autoencoder (VAE) structure, extracting at each input token (e.g., phonemes). However, generating samples standard VAE prior often results in unnatural and discontinuous speech, dramatic prosodic variation between tokens. This paper proposes sequential discrete space which can generate more naturally sounding samples....
The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source such mismatch. traditional approach to deal multiple involves pooling data from several during building single model multi-task fashion, where tasks correspond individual accents. In this paper, we explore an alternate jointly learn accent classifier acoustic model. Experiments on American English Wall...
Speech synthesis has advanced to the point of being close indistinguishable from human speech. However, efforts train speech recognition systems on synthesized utterances have not been able show that data can be effectively used augment or replace In this work, we demonstrate promoting consistent predictions in response real and enables significantly improved performance. We also find training 460 hours LibriSpeech augmented with 500 transcripts (without audio) performance is within 0.2% WER...
Charisma, the ability to command authority on basis of personal qualities, is more difficult define than identify.How do charismatic leaders such as Fidel Castro or Pope John Paul II attract and retain their followers?We present results an analysis subjective ratings charisma from a corpus American political speech.We identify associations between other attributes.We also examine acoustic/prosodic lexical features this speech correlate these with ratings.
Promoted in part by its use the Interspeech Challenges 2009-2012, Average Recall has emerged as an attractive evaluation measure of classifier performance where data a skewed class distribution. In this paper, we show that importance weighting can be used to optimize directly. We compare approach sampling techniques have been previously classify data. demonstrate on 2009 Emotion Challenge tasks, and prosodic analysis tasks.
In recent years, so-called, "end-to-end" speech recognition systems have emerged as viable alternatives to traditional ASR frameworks. Keyword search, localizing an orthographic query in a corpus, is typically performed by using automatic (ASR) generate index. Previous work has evaluated the use of end-to-end for on well known corpora (WSJ, Switchboard, TIMIT, etc.) high-resource languages like English and Mandarin. this work, we investigate Connectionist Temporal Classification (CTC)...
Detecting deception from different dimensions of human behavior has been a major goal research in psychology and computational linguistics for some years is currently considerable interest to military law enforcement agencies. However, relatively little work done develop automatic methods detect spoken language or compare detection production between cultures. We present results experiments on new corpus deceptive non-deceptive speech, collected native speakers Standard American English...
This paper investigates the effectiveness of knowledge distillation in context multilingual models. We show that with distillation, Long Short-Term Memory(LSTM) models can be used to train standard feed-forward Deep Neural Network (DNN) for a variety low-resource languages. then examine how agreement between teacher's best labels and original affects student model's performance. Next, we easily applied semi-supervised learning improve model also propose promising data selection method filter...
Collecting high-quality studio recordings of audio is challenging, which limits the language coverage text-to-speech (TTS) systems. This paper proposes a framework for scaling multilingual TTS model to 100+ languages using found data without supervision. The proposed combines speech-text encoder pretraining with unsupervised training untranscribed speech and unspoken text sources, thereby leveraging massively joint representation learning. Without any transcribed in new language, this can...
In Brief External anatomic landmarks have traditionally been used to approximate the location of neck blood vessels optimize central venous cannulation internal jugular vein (IJV) while avoiding common carotid artery (CCA). Head rotation affects vessel orientation, but most landmark techniques do not specify its optimal degree. We simulated catheter insertion via both an anterior and approach right IJV using ultrasound probe held in manner a syringe needle 49 volunteers. Increased head from...
Progress in both speech and language processing has spurred efforts to support applications that rely on spoken rather than written input. A key challenge moving from text-based documents such is lacks explicit punctuation formatting, which can be crucial for good performance. This article describes different levels of segmentation, approaches automatically recovering segment boundary locations, experimental results demonstrating impact several tasks. The also show a need optimizing...
The automatic identification of prosodic events such as pitch accent in English has long been a topic interest to speech researchers, with applications variety spoken language processing tasks. However, much remains be understood about the best methods for obtaining high accuracy detection. We describe experiments examining optimal domain analysis. Specifically, we compare at syllable, vowel or word level domains analysis acoustic indicators accent. Our results indicate that word-based...
Articles cited counts are catalogued and help identify landmark papers. This study provides a citation classics of anesthesiology literature using the framework subspecialties to provide review well-developed areas research in anesthesiology.A comprehensive list most-cited articles anesthesia was compiled bibliometric database general search terms such as "anesthesia" well subspecialty-specific terms. Queries were reviewed for relevance practice, categorized by subspecialty, ranked according...