- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Neural Networks and Applications
- Gaussian Processes and Bayesian Inference
- Speech and dialogue systems
- Phonetics and Phonology Research
- Time Series Analysis and Forecasting
- Topic Modeling
- Advanced Data Compression Techniques
- Personality Traits and Psychology
- Voice and Speech Disorders
- Emotion and Mood Recognition
- Sentiment Analysis and Opinion Mining
- Hearing Loss and Rehabilitation
- Neuroscience and Music Perception
- Blind Source Separation Techniques
- Advanced Text Analysis Techniques
- Fault Detection and Control Systems
- Mental Health via Writing
- Bayesian Modeling and Causal Inference
- Handwritten Text Recognition Techniques
- Cognitive Science and Education Research
- Software Engineering Research
University of Aizu
2013-2024
The Institute of Statistical Mathematics
2016
Data61
2009
National Institute of Information and Communications Technology
2007-2008
Advanced Telecommunications Research Institute International
2004-2008
KRI
2008
Language Science (South Korea)
2006
Toyohashi University of Technology
1996-2002
In this paper, we describe the ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on between English and Asian languages (Japanese Chinese). There are three main modules of our S2ST system: large-vocabulary continuous speech recognition, machine text-to-text (T2T) translation, text-to-speech synthesis. All them designed using state-of-the-art technologies developed at ATR. A corpus-based statistical learning framework forms basis system design. We use a...
Gaussian Processes (GPs) are Bayesian nonparametric models that becoming more and popular for their superior capabilities to capture highly nonlinear data relationships in various tasks, such as dimensionality reduction, time series analysis, novelty detection, well classical regression classification tasks. In this paper, we investigate the feasibility applicability of GP music genre emotion estimation. These two main tasks information retrieval (MIR) field. So far, support vector machine...
Many approaches have been proposed to automatically infer users personality from their social networks activities. However, the performance of these depends heavily on data representation. In this work, we apply deep learning methods learn suitable representation for recognition task. our experiments, used Facebook status updates data. We investigated several neural network architectures such as fully-connected (FC) networks, convolutional (CNN) and recurrent (RNN) myPersonality shared task...
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features music are invaluable for indexing and recommendation. In this paper we present cross-comparison automatic analysis music. We created public dataset Creative Commons licensed songs. Using valence arousal model, the songs were annotated both in terms emotions that expressed by whole excerpt dynamically with 1 Hz temporal resolution. Each song received 10 annotations on Amazon Mechanical Turk...
In this paper, we describe newhigh-performanceon-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel gender identity classification. Allmodules share a set Gaussian mixturemodels (GMM) representing pause, male female speakers, each individual speaker. Initially, there are only three GMMs for pause two genders, trained in advance from some data. During the process, speech segment...
Many studies have shown that articulatory features can significantly improve the performance of automatic speech recognition systems. Unfortunately, such are not available at time. There two main approaches to solve this problem: a feature-based approach, most popular example which is acoustic-to-articulatory inversion, where missing generated from signal, and model-based information embedded in model structure parameters way allows using only acoustic features. In paper, we propose new...
Detection of speakers which have not been seen before is an essential part every online speaker diarization system. New detection accuracy has direct impact on the overall performance. In our previous system, for novelty we used global GMM likelihood ratio (LR) threshold. However, as system analysis showed, optimal threshold depends gender well number registered speakers. this paper, present results and approach taken to solve problem. First, use different thresholds male female speakers,...
Despite the progress of deep neural networks over last decade, state-of-the-art speech recognizers in noisy environment conditions are still far from reaching satisfactory performance. Methods to improve noise robustness usually include adding components recognition system that often need optimization. For this reason, data augmentation input features derived Short-Time Fourier Transform (STFT) has become a popular approach. However, for many processing tasks, there is an evidence...
Availability of large amounts raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and come from same distribution. This restriction removed self-taught approach where can be different, but nevertheless have similar structure. First, a representation learned via sparse coding then applied to used for classification. this work, we implemented method music genre classification task using two different...
In this paper, we describe a method for phoneme set selection based on combination of phonological and statistical information its application Russian speech recognition. For language, currently used sets are mostly rule-based or heuristically derived from the standard SAMPA IPA phonetic alphabets. However, some other languages, methods have been found useful optimization. almost all phonemes come in pairs: consonants can be hard soft vowels stressed unstressed. First, start with big then...
Availability of large amounts raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and come from same distribution. This restriction removed self-taught algorithm where can be different, but nevertheless have similar structure. First, a representation learned samples by decomposing their matrix into two matrices called bases activations respectively. procedure justified assumption each sample linear...
In the speaker recognition, when cepstral coefficients are calculated from LPC analysis parameters, prediction error, or residual signal, is usually ignored. However, there an evidence that it contains a specific information. The fundamental frequency of speech signal pitch, which extracted residual, has been used for recognition purposes, but because high intraspeaker variability pitch also often This paper describes our approach to integrating and LPC-residual with LPC-cepstrum in Gaussian...
In this paper, we present a review of the latest developments in Russian speech recognition research. Although underlying technology is mostly language-independent, differences between languages with respect to their structure and grammar have substantial effect on systems performance. The language has complicated word formation system, which characterized by high degree inflection unrigidness order. This greatly reduces predictive power conventional models consequently increases error rate....
Automatic emotion recognition from speech has been focused mainly on identifying categorical or static affect states, but the spectrum of human is continuous and time-varying. In this paper, we present a system for dynamic based state-space models (SSMs). The prediction unknown trajectory in space spanned by Arousal, Valence, Dominance (A-V-D) descriptors cast as time series filtering task. state investigated include standard linear model (Kalman filter) well novel non-linear, non-parametric...
It is difficult to recognize speech distorted by various factors, especially when an ASR system contains only a single acoustic model. One solution use multiple models, one model for each different condition. In this paper, we discuss parallel decoding-based that robust the noise type, SNR, speaker gender and speaking style. Our consists of two recognition channels based on MFCC Differential (DMFCC) features. Each channel has several models depending style, adapted fast adaptation. From...
In this paper we introduce Gaussian Process (GP) models for music genre classification. Processes are widely used various regression and classification tasks, but there relatively few studies where GPs applied in the audio signal processing systems. The GP non-parametric discriminative classifiers similar to well known SVMs terms of usage. contrast SVMs, however, produce truly probabilistic output allow kernel function parameters be learned from training data. work compare performance as...