- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Algorithms and Data Compression
- Speech and dialogue systems
- semigroups and automata theory
- Topic Modeling
- Music and Audio Processing
- Speech and Audio Processing
- Machine Learning and Algorithms
- Network Packet Processing and Optimization
- Formal Methods in Verification
- DNA and Biological Computing
- Logic, programming, and type systems
- Neural Networks and Applications
- Privacy-Preserving Technologies in Data
- Security and Verification in Computing
- Millimeter-Wave Propagation and Modeling
- Digital Communication and Language
- Wireless Body Area Networks
- Sensor Technology and Measurement Systems
- Distributed systems and fault tolerance
- Blind Source Separation Techniques
- Music Technology and Sound Studies
- Landfill Environmental Impact Studies
- Radiation Effects in Electronics
Google (United States)
2014-2023
Oracle (United Kingdom)
2023
University of Warwick
2023
University of Utah
2019
Carnegie Mellon University
2017-2018
Alphabet (United States)
2016
University of Cincinnati
2009-2010
New York University
2007
AT&T (United States)
1996-2005
Massachusetts Institute of Technology
2005
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves modularity of conventional automatic speech recognition systems. The HAT provides way to measure quality internal language can be used decide whether inference with an external is beneficial or not. We evaluate our proposed on large-scale voice search task. Our experiments show significant improvements in WER compared state-of-the-art approaches <sup...
Finite-state automata are a very effective tool in natural language processing. However, variety of applications and especially speech precessing, it is necessary to consider more general machines which arcs assigned weights or costs. We briefly describe some the main theoretical algorithmic aspects these machines. In particular, we an efficient composition algorithm for weighted transducers, give examples illustrating value determinization minimization algorithms automata.
Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred speech such as gender, age, dialect, or emotion. These characteristics voice, voice signatures, whether static dynamic, for mining applications the design natural system. paper explores problem extracting automatically and accurately signatures We investigate two approaches speaker traits: first focuses on general acoustic prosodic...
Mingqing Chen, Ananda Theertha Suresh, Rajiv Mathews, Adeline Wong, Cyril Allauzen, Françoise Beaufays, Michael Riley. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). 2019.
We introduce a technique for dynamically applying contextually-derived language models to state-of-the-art speech recognition system. These generally small-footprint can be seen as generalization of cache-based [1], whereby contextually salient n-grams are derived from relevant sources (not just user generated language) produce model intended combination with the baseline model. The applied during first-pass decoding form on-the-fly composition between decoder search graph and set weighted...
Several applications of statistical tree-based modelling are described here to problems in speech and language. Classification regression trees well suited many the pattern recognition encountered this area since they (1) statistically select most significant features involved (2) provide "honest" estimates their performance, (3) permit both categorical continuous be considered, (4) allow human interpretation exploration result. First method is summarized, then its application automatic stop...
This paper explores various static interpolation methods for approximating a single dynamically-interpolated language model used variety of recognition tasks on the Google Android platform. The goal is to find statically-interpolated firstpass LM that best reduces search errors in two-pass system or even allows eliminating more complex dynamic second pass entirely. Static weights are uniform, prior-weighted, and maximum likelihood, posteriori, Bayesian solutions considered. Analysis argues...
We present the concepts of weighted language, transduction and automaton from algebraic automata theory as a general framework for describing implementing decoding cascades in speech language processing. This generality allows us to represent uniformly such information sources pronunciation dictionaries, models lattices, use uniform algorithms building stages optimizing combining them. In particular, single join algorithm can be used either combine dictionary context-dependency model during...
The authors investigate an automatic approach to segmentation of labeled speech and labeling when only the orthographic transcription is available. technique based on a phone recognition system trigram phonotactic model, gamma distribution duration models, spectral model five different structures for models varying contextual dependencies. alignment with given sequence performed as very constrained task sequence. When provided, classification-tree-based prediction most likely realizations...
We combine our earlier approach to context-dependent network representation with algorithm for determining weighted networks build optimized large-vocabulary speech recognition combining an n-gram language model, a pronunciation dictionary and context-dependency modeling. While fully-expanded have been used before in restrictive settings (medium vocabulary or no cross-word contexts), we demonstrate that determination method makes it practical use also full context For the DARPA North...
We showed in previous work that weighted finite-state transducers provide a common representation for many components of speech recognition system and described general algorithms combining these representations to build single optimized compact transducer integrating all components, directly mapping from HMM states words. This approach works well certain well-controlled input transducers, but presents some problems related the efficiency composition applicability determinization...
Methods to predict detailed phonetic pronunciations from a coarse phonemic transcription are described. The base forms, obtainable orthographic text by dictionary lookup and other means, do not specify fine detail such as flapping, glottal stop insertion, or the formation of syllabic nasals liquids. These phenomena depend on context (often spanning word boundaries), stress environment, speaking rate, dialect. A procedure is presented that builds decision trees, trained TIMIT database, using...