- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Natural Language Processing Techniques
- Topic Modeling
- Voice and Speech Disorders
- Phonetics and Phonology Research
- Neural Networks and Applications
- Advanced Data Compression Techniques
- Robotics and Sensor-Based Localization
- Advanced Image and Video Retrieval Techniques
- Geographic Information Systems Studies
- Gene expression and cancer classification
- Industrial Vision Systems and Defect Detection
- Robotics and Automated Systems
- Welding Techniques and Residual Stresses
- Emotion and Mood Recognition
- Algorithms and Data Compression
- Multimodal Machine Learning Applications
- Digital Communication and Language
- Time Series Analysis and Forecasting
- Advanced Text Analysis Techniques
- Face recognition and analysis
- Sensor Technology and Measurement Systems
Intel (Germany)
2021
Intel (United States)
2015-2018
Intel (United Kingdom)
2017
Friedrich-Alexander-Universität Erlangen-Nürnberg
1999-2014
Siemens (Germany)
2006-2010
Istituto Centrale per la Ricerca Scientifica e Tecnologica Applicata al Mare
2006
70% to 90% of patients with Parkinson's disease (PD) show an affected voice. Various studies revealed, that voice and prosody is one the earliest indicators PD. The issue this study automatically detect whether speech/voice a person by We employ acoustic features, prosodic features derived from two-mass model vocal folds on different kinds speech tests: sustained phonations, syllable repetitions, read texts monologues. Classification performed in either case SVMs. A correlation-based feature...
Adaptive training aims at reducing the influence of speaker, channel and environment variability on acoustic models. We describe an normalization approach to adaptive training. Phonetically irrelevant is reduced beginning procedure w.r.t. a set target The models can be HMMs or Gaussian mixture model (GMM). CMLLR applied normalize features. normalized data contains less unwanted used generate train recognition Employing GMM as leads text-independent that embedded into front-end. On broadcast...
We describe a general-purpose end-to-end audio embeddings generator that can be easily adapted to various acoustic scene and event classification applications. In contrast many other models for classification, this does not require separate feature extraction step, but processes samples directly which simplifies its porting into hardware platforms. Our approach learns generic embedding representation is pre-trained on large dataset. It then fine-tuned via transfer learning with limited data...
Young speakers are not represented adequately in current speech recognizers. In this paper we focus on the problem to adapt acoustic frontend of a recognizer which has been trained adults’ achieve better performance from children. We introduce and evaluate method perform non-linear VTLN by an unconstrained data-driven optimization filterbank. A second approach normalizes speaking rate young with PSOLA algorithm. Significant reductions word error have achieved.
This paper focuses on the automatic recognition of a person’s age and gender based only his or her voice. Up to five different systems are compared combined in configurations: three model speaker’s characteristics feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features these concatenated mean vectors. System number 4 uses physical two-mass vocal estimates data-driven optimization procedure 9 glottal from voiced speech sections. For each utterance minimum, maximum...
The paper deals with the development of acoustic models foreign words for a German speech recognizer. recognition quality is crucial overall performance system in application fields like spoken dialogue systems, when occur as proper names. One main problems modeling limitation training data, which must contain samples non-native pronunciation sounds. In order to obtain robust models, are still precise enough, we compare several methods map or merge phonemes, pronounced similar way by...
We develop an acoustic feature set for the estimation of a person’s age from recorded speech signal. The baseline features are Mel-frequency cepstral coefficients (MFCCs) which extended by various prosodic features, pitch and formant frequencies. From experiments on University Florida Vocal Aging Database we can draw different conclusions. On one hand, adding prosodic, to MFCC leads relative reductions mean absolute error between 4-20%. Improvements even larger when perceptual labels taken...
Considering the dereverberation problem using multichannel processing, two main paradigms exist. The first paradigm utilizes long-term correlation of reverberant component for reducing it, e.g. Weighted Prediction Error (WPE) [1]. second paradigm, treats reverberation as a diffuse noise field, statically independent direct speech component, and aims to reduce it superdirective beamformer, [2]. Here we propose combine in two-stages algorithm. stage comprises WPE method, Minimum Variance...
We introduce a new technique to improve the recognition of non-native speech. The underlying assumption is that for each pronunciation speech sound, there at least one sound in target language has similar native pronunciation. adaptation performed by HMM interpolation between adequate acoustic models. partners are determined automatically data-driven manner. Our experiments show this suitable both offline whole group speakers as well unsupervised online single speaker. Results given...
For many aspects of speech therapy an objective evaluation the intelligibility a patient's is needed. We investigate by means automatic recognition. Previous studies have shown that measures like word accuracy are consistent with human experts' ratings. To ease burden, it highly desirable to conduct assessment via phone. However, telephone channel influences quality signal which negatively affects results. reduce inaccuracies, we propose combination two recognizers. Experiments on sets...
In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have capability detect defects which get introduced due various reasons process. We describe how deep-learning methods can be applied weld real-time by recording process with microphones and camera. Our findings are based on large database more than 4000 samples collected covers different types, materials defect categories. All deep...
The paper investigates the integration of heteroscedastic linear discriminant analysis (HLDA) into adaptively trained speech recognizers. Two different approaches are compared: first is a variant CMLLR-SAT, second based on our previously introduced method constrained maximum-likelihood speaker normalization (CMLSN). For latter both HLDA projection and speaker-specific transformations for estimated w.r.t. set simple target-models. It investigated if additional robustness can be achieved by...
The degree of sleepiness in the Sleepy Language Corpus from Interspeech 2011 Speaker State Challenge is predicted with regression and a very large feature vector. Most notable great gender difference which can mainly be attributed to females showing their less than males do.
Most speech recognition systems are based on Mel-frequency cepstral coefficients and their first- second-order derivatives. The derivatives normally approximated by fitting a linear regression line to fixed-length segment of consecutive frames. time resolution smoothness the estimated derivative depends length segment. We present an approach improve representation dynamics, which is combination multiple resolutions. resulting feature vector transformed reduce its dimension correlation...
The problem of the effect accent on performance Automatic Speech Recognition (ASR) systems is well known. In this paper, we study variability Indian English ASR task. We evaluate test vocabularies HMMs trained (a) Accent specific training data (b) pooled which combines all (c) reduced size matching data. demonstrate that set performs best phonetically rich isolated word recognition But perform better than HMMs, indicating a possible approach using first stage identification to choose correct...