- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Phonetics and Phonology Research
- Music Technology and Sound Studies
- Autism Spectrum Disorder Research
- Text Readability and Simplification
- Voice and Speech Disorders
- Neurobiology of Language and Bilingualism
- Semantic Web and Ontologies
- Language Development and Disorders
- Neural and Behavioral Psychology Studies
- Second Language Acquisition and Learning
- Linguistic Variation and Morphology
- Lexicography and Language Studies
- Reading and Literacy Development
- Authorship Attribution and Profiling
- Virology and Viral Diseases
- Language, Metaphor, and Cognition
- Action Observation and Synchronization
- Sentiment Analysis and Opinion Mining
- Legal Education and Practice Innovations
University of Pennsylvania
2012-2022
Pennsylvania Academic Library Consortium
2013-2022
Authorised Association Consortium
2020
University of Cambridge
2006
Following the success of 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize 6th Speech Separation Recognition Challenge (CHiME-6). The new challenge revisits previous CHiME-5 further considers problem distant multi-microphone conversational speech diarization recognition in everyday home environments. material is same as recordings except for accurate array synchronization. was elicited using a dinner party scenario with efforts taken to capture data that representative natural speech....
This paper introduces the second DIHARD challenge, in a series of speaker diarization challenges intended to improve robustness systems variation recording equipment, noise conditions, and conversational domain.The challenge comprises four tracks evaluating performance under two input conditions (single channel vs. multi-channel) segmentation (diarization from reference speech scratch).In order prevent participants overtuning particular combination domain, recordings are drawn variety...
Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, Xiaoyi Ma. Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation. 2015.
Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution augment the conventional cepstral features with additional, hand-engineered spectral flux, centroid, multiband entropies) which are robust...
Following the success of 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize 6th Speech Separation Recognition Challenge (CHiME-6). The new challenge revisits previous CHiME-5 further considers problem distant multi-microphone conversational speech diarization recognition in everyday home environments. material is same as recordings except for accurate array synchronization. was elicited using a dinner party scenario with efforts taken to capture data that representative natural speech....
DIHARD III was the third in a series of speaker diarization challenges intended to improve robustness systems variability recording equipment, noise conditions, and conversational domain.Speaker evaluated under two speech activity conditions (diarization from reference vs. scratch) 11 diverse domains.The domains span range interaction types, including read audio-books, meeting speech, clinical interviews, web videos, and, for first time, telephone speech.A total 30 organizations (forming 21...
This study attempts to improve automatic phonetic segmentation within the HMM framework. Experiments were conducted investigate use of phone boundary models, precise for training HMMs, and difference between context-dependent contextindependent models in terms forced alignment performance. Results show that combination special one-state monophone HMMs can significantly accuracy. HMM-based systems also benefit from using HMMs. Context-dependent are not better than context-independent when...
Current speech encoding pipelines often rely on separate processing between text and audio, not fully leveraging the inherent overlap these modalities for understanding human communication. Language models excel at capturing semantic meaning from that can complement additional prosodic, emotional, acoustic cues speech. This work bridges gap by proposing WhiSPA (Whisper with Semantic-Psychological Alignment), a novel audio encoder trained contrastive student-teacher learning objective. Using...
Accurate phone-level segmentation of speech remains an important task for many subfields research. We investigate techniques boosting the accuracy automatic phonetic based on HMM acoustic-phonetic models. In prior work [25] we were able to improve state-of-the-art alignment by employing special phone boundary models, trained phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here present further improved results using more powerful statistical...
Julia Parish-Morris, Mark Liberman, Neville Ryant, Christopher Cieri, Leila Bateman, Emily Ferguson, Robert Schultz. Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology. 2016.
Pre-trained acoustic representations such as wav2vec and DeCoAR have attained impressive word error rates (WER) for speech recognition benchmarks, particularly when labeled data is limited. But little known about what phonetic properties these various acquire, how well they encode transferable features of speech. We compare from two conventional four pre-trained systems in some simple frame-level classification tasks, with classifiers trained on one version the TIMIT dataset tested another....
A deep neural network (DNN) based classifier achieved 27.38% frame error rate (FER) and 15.62% segment (SER) in recognizing five tonal categories Mandarin Chinese broadcast news, on 40 mel-frequency cepstral coefficients (MFCCs). The same architecture scored substantially lower when trained tested with F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> amplitude parameters alone: 40.05% FER 22.66% SER. These results are better than the...
In this study, we investigate on the effects of deep learning based speech enhancement as a preprocessor to speaker diarization in quite challenging realistic environments involving background noises, reverberations and overlapping speech. To improve generalization capability, advanced long short-term memory (LSTM) architecture with novel design hidden layers via densely connected progressive output layer multiple-target is proposed for preprocessing. We build model using synthesized...
This paper introduces the third DIHARD challenge, in a series of speaker diarization challenges intended to improve robustness systems variation recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating performance when starting from reference speech segmentation (track 1) raw audio scratch 2). We describe task, metrics, datasets, evaluation protocol.
Purpose This study examines the effect of age on language use with an automated analysis digitized speech obtained from semistructured, narrative samples. Method We examined Cookie Theft picture descriptions produced by 37 older and 76 young healthy participants. Using modern natural processing automatic recognition tools, we automatically annotated part-of-speech categories all tokens, calculated number tense-inflected verbs, mean length clause, vocabulary diversity, rated nouns verbs for...
We conducted experiments on forced alignment in Mandarin Chinese. A corpus of 7,849 utterances was created for the purpose study. Systems differing their use explicit phone boundary models, glottal features, and tone information were trained evaluated corpus. Results showed that employing special one-state HMM models significantly improved accuracy, even when no manual phonetic segmentation available training. Spectral features extracted from waveforms (by performing inverse filtering speech...
The detection of overlapping speech segments is key importance in applications involving analysis multi-party conversations. problem challenging because are typically captured as short utterances far-field microphone recordings. In this paper, we propose overlap using a neural network architecture consisting long-short term memory (LSTM) models. learns the presence by identifying spectrotemporal structure segments. order to evaluate model performance, perform experiments on simulated...