- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Blind Source Separation Techniques
- Music Technology and Sound Studies
- Allergic Rhinitis and Sensitization
- Advanced Adaptive Filtering Techniques
- Diverse Musicological Studies
- Food Allergy and Anaphylaxis Research
- Asthma and respiratory diseases
- Underwater Acoustics Research
- Contact Dermatitis and Allergies
- Hearing Loss and Rehabilitation
- Acoustic Wave Phenomena Research
- Video Analysis and Summarization
- Animal Vocal Communication and Behavior
- Noise Effects and Management
- Occupational exposure and asthma
- Natural Language Processing Techniques
- Geophysical Methods and Applications
- Arctic and Antarctic ice dynamics
- Advanced Data Compression Techniques
- Structural Health Monitoring Techniques
- Monoclonal and Polyclonal Antibodies Research
- Anomaly Detection Techniques and Applications
Tampere University
2016-2025
Nokia (Finland)
2025
University of Surrey
2023
University of Eastern Finland
2010-2022
Signal Processing (United States)
2015-2021
Institute of Electrical and Electronics Engineers
2021
Tampere University of Applied Sciences
2007-2018
Tampere University
2008-2018
Shenyang Institute of Automation
2016
Chinese Academy of Sciences
2016
An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The based on factorizing magnitude spectrogram an input signal into a sum components, each which has fixed spectrum and time-varying gain. Each source, turn, modeled as one or more components. parameters components are estimated by minimizing reconstruction error between model, while restricting component spectrograms to be nonnegative favoring whose gains slowly varying sparse....
Given the recent surge in developments of deep learning, this paper provides a review state-of-the-art learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, order to point out similarities differences between domains, highlighting general methods, problems, key references, potential cross fertilization areas. The dominant feature representations (in particular, log-mel spectra raw waveform) models reviewed, including...
Sound events often occur in unstructured environments where they exhibit wide variations their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that invariant local spectral variations. Recurrent (RNNs) powerful learning the longer term context audio signals. CNNs RNNs as classifiers have recently shown improved performances over established methods various sound recognition tasks. We combine these two approaches a Neural...
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sources active simultaneously. The system output this case contains overlapping events, marked as sounds detected being at the same time. requires a suitable procedure against reference. Metrics from neighboring fields such speech recognition speaker diarization can be used, but they need to partially redefined...
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset this database, called Sound Events 2016, contains annotations individual events, specifically created event detection. consists residential area and home environments, is manually annotated to mark onset, offset label events. In paper we present the recording annotation procedure, content, a recommended cross-validation setup...
Type I allergy is an immunoglobulin E (IgE)-mediated hypersensitivity disease affecting more than 25% of the population. Currently, diagnosis performed by provocation testing and IgE serology using allergen extracts. This process defines allergen-containing sources but cannot identify disease-eliciting allergenic molecules. We have applied microarray technology to develop a miniaturized test containing 94 purified molecules that represent most common sources. The allows determination...
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping events in three-dimensional (3D) space. The proposed takes sequence consecutive spectrogram time-frames as input maps it to two outputs parallel. As the first output, (SED) is performed multi-label classification task on each time-frame producing temporal activity all classes. second by estimating 3D Cartesian coordinates direction-of-arrival...
This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how can be modeled as a linear combination of small number exemplars from large exemplar dictionary. The are time-frequency patches real speech, each spanning multiple time frames. We then propose model corrupted by additive and exemplars, derive an algorithm recovering this the observed noisy speech. framework used doing hybrid exemplar-based/HMM recognition...
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained map acoustic features of a mixture signal consisting sounds from multiple classes, binary activity indicators each class. Our method tested large database real-life recordings, with 61 classes (e.g. music, car, speech) 10 different everyday contexts. The proposed...
Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition the challenge on detection classification acoustic scenes events (DCASE 2016) has offered such an opportunity for state-of-the-art methods, succeeded drawing together a large number participants from academic industrial backgrounds. In this paper, we report tasks outcomes DCASE 2016 challenge. comprised four tasks: scene classification,...
In this paper, the use of multi label neural networks are proposed for detection temporally overlapping sound events in realistic environments. Real-life recordings typically have many events, making it hard to recognize each event with standard methods. Frame-wise spectral-domain features used as inputs train a deep network classification work. The model is evaluated from everyday environments and obtained overall accuracy 63.8%. method compared against state-of-the-art using non-negative...
This paper proposes a deep neural network for estimating the directions of arrival (DOA) multiple sound sources. The proposed stacked convolutional and recurrent (DOAnet) generates spatial pseudo-spectrum (SPS) along with DOA estimates in both azimuth elevation. We avoid any explicit feature extraction step by using magnitudes phases spectrograms all channels as input to network. DOAnet is evaluated DOAs concurrently present sources anechoic, matched unmatched reverberant conditions. results...
The work presented in this article studies how the context information can be used automatic sound event detection process, and system benefit from such information. Humans are using to make more accurate predictions about events ruling out unlikely given context. We propose a similar utilization of process. proposed approach is composed two stages: recognition stage stage. Contexts modeled Gaussian mixture models three-state left-to-right hidden Markov models. In first stage, audio tested...
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and TUT Urban Acoustic Scenes dataset provided for task, evaluates performance a baseline system in task. As previous years challenge, is defined short audio samples into one predefined classes, using supervised, closed-set setup. The newly recorded consists ten different scenes was six large European cities, therefore it has higher variability than datasets used this addition to high-quality binaural...
Audio captioning is the novel task of general audio content description using free text. It an intermodal translation (not speech-to-text), where a system accepts as input signal and outputs textual (i.e. caption) that signal. In this paper we present Clotho, dataset for consisting 4981 samples 15 to 30 seconds duration 24 905 captions eight 20 words length, baseline method provide initial results. Clotho built with focus on caption diversity, splits data are not hampering training or...
To investigate the production of antioxidant activity during fermentation with commonly used dairy starter cultures. Moreover, to study development fermentation, and connection proteolysis bacterial growth.Antioxidant was measured by analysing radical scavenging using a spectrophotometric decolorization assay lipid peroxidation inhibition assayed liposomal model system fluorescence method. Milk fermented 25 lactic acid (LAB) strains, from these six exhibiting highest selected for further...
Voice conversion can be formulated as finding a mapping function which transforms the features of source speaker to those target speaker. Gaussian mixture model (GMM)-based is commonly used, but it subject overfitting. In this paper, we propose use partial least squares (PLS)-based in voice conversion. To prevent overfitting, degrees freedom controlled by choosing suitable number components. We technique combine PLS with GMMs, enabling multiple local linear mappings. further improve...
We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, spectrogram is reconstructed as weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The weights are constrained to be avoid over-smoothing, and high-resolution spectra employed in the exemplars directly without dimensionality reduction maintain spectral details. addition, compression...
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network handle more than one type of these by learning each them separately in initial stages. show that instead concatenating channel into a single feature vector learns events better when they are presented as separate layers volume. Using proposed over monaural on same gives an absolute F-score improvement 6.1% publicly available...
A drawback of many voice conversion algorithms is that they rely on linear models and/or require a lot tuning. In addition, them ignore the inherent time-dependency between speech features. To address these issues, we propose to use dynamic kernel partial least squares (DKPLS) technique model nonlinearities as well capture dynamics in data. The method based transformation source features allow non-linear modeling and concatenation previous next frames dynamics. Partial regression used find...