Tuomas Virtanen

ORCID: 0000-0002-4604-9729
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Blind Source Separation Techniques
  • Music Technology and Sound Studies
  • Allergic Rhinitis and Sensitization
  • Advanced Adaptive Filtering Techniques
  • Diverse Musicological Studies
  • Food Allergy and Anaphylaxis Research
  • Asthma and respiratory diseases
  • Underwater Acoustics Research
  • Contact Dermatitis and Allergies
  • Hearing Loss and Rehabilitation
  • Acoustic Wave Phenomena Research
  • Video Analysis and Summarization
  • Animal Vocal Communication and Behavior
  • Noise Effects and Management
  • Occupational exposure and asthma
  • Natural Language Processing Techniques
  • Geophysical Methods and Applications
  • Arctic and Antarctic ice dynamics
  • Advanced Data Compression Techniques
  • Structural Health Monitoring Techniques
  • Monoclonal and Polyclonal Antibodies Research
  • Anomaly Detection Techniques and Applications

Tampere University
2016-2025

Nokia (Finland)
2025

University of Surrey
2023

University of Eastern Finland
2010-2022

Signal Processing (United States)
2015-2021

Institute of Electrical and Electronics Engineers
2021

Tampere University of Applied Sciences
2007-2018

Tampere University
2008-2018

Shenyang Institute of Automation
2016

Chinese Academy of Sciences
2016

An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The based on factorizing magnitude spectrogram an input signal into a sum components, each which has fixed spectrum and time-varying gain. Each source, turn, modeled as one or more components. parameters components are estimated by minimizing reconstruction error between model, while restricting component spectrograms to be nonnegative favoring whose gains slowly varying sparse....

10.1109/tasl.2006.885253 article EN IEEE Transactions on Audio Speech and Language Processing 2007-03-01

Given the recent surge in developments of deep learning, this paper provides a review state-of-the-art learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, order to point out similarities differences between domains, highlighting general methods, problems, key references, potential cross fertilization areas. The dominant feature representations (in particular, log-mel spectra raw waveform) models reviewed, including...

10.1109/jstsp.2019.2908700 article EN IEEE Journal of Selected Topics in Signal Processing 2019-04-01

Sound events often occur in unstructured environments where they exhibit wide variations their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that invariant local spectral variations. Recurrent (RNNs) powerful learning the longer term context audio signals. CNNs RNNs as classifiers have recently shown improved performances over established methods various sound recognition tasks. We combine these two approaches a Neural...

10.1109/taslp.2017.2690575 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-05-23

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sources active simultaneously. The system output this case contains overlapping events, marked as sounds detected being at the same time. requires a suitable procedure against reference. Metrics from neighboring fields such speech recognition speaker diarization can be used, but they need to partially redefined...

10.3390/app6060162 article EN cc-by Applied Sciences 2016-05-25

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset this database, called Sound Events 2016, contains annotations individual events, specifically created event detection. consists residential area and home environments, is manually annotated to mark onset, offset label events. In paper we present the recording annotation procedure, content, a recommended cross-validation setup...

10.1109/eusipco.2016.7760424 article EN 2021 29th European Signal Processing Conference (EUSIPCO) 2016-08-01

Type I allergy is an immunoglobulin E (IgE)-mediated hypersensitivity disease affecting more than 25% of the population. Currently, diagnosis performed by provocation testing and IgE serology using allergen extracts. This process defines allergen-containing sources but cannot identify disease-eliciting allergenic molecules. We have applied microarray technology to develop a miniaturized test containing 94 purified molecules that represent most common sources. The allows determination...

10.1096/fj.01-0711fje article EN The FASEB Journal 2002-01-14

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping events in three-dimensional (3D) space. The proposed takes sequence consecutive spectrogram time-frames as input maps it to two outputs parallel. As the first output, (SED) is performed multi-label classification task on each time-frame producing temporal activity all classes. second by estimating 3D Cartesian coordinates direction-of-arrival...

10.1109/jstsp.2018.2885636 article EN IEEE Journal of Selected Topics in Signal Processing 2018-12-07

This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how can be modeled as a linear combination of small number exemplars from large exemplar dictionary. The are time-frequency patches real speech, each spanning multiple time frames. We then propose model corrupted by additive and exemplars, derive an algorithm recovering this the observed noisy speech. framework used doing hybrid exemplar-based/HMM recognition...

10.1109/tasl.2011.2112350 article EN IEEE Transactions on Audio Speech and Language Processing 2011-02-09

In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained map acoustic features of a mixture signal consisting sounds from multiple classes, binary activity indicators each class. Our method tested large database real-life recordings, with 61 classes (e.g. music, car, speech) 10 different everyday contexts. The proposed...

10.1109/icassp.2016.7472917 preprint EN 2016-03-01

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition the challenge on detection classification acoustic scenes events (DCASE 2016) has offered such an opportunity for state-of-the-art methods, succeeded drawing together a large number participants from academic industrial backgrounds. In this paper, we report tasks outcomes DCASE 2016 challenge. comprised four tasks: scene classification,...

10.1109/taslp.2017.2778423 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-11-30

In this paper, the use of multi label neural networks are proposed for detection temporally overlapping sound events in realistic environments. Real-life recordings typically have many events, making it hard to recognize each event with standard methods. Frame-wise spectral-domain features used as inputs train a deep network classification work. The model is evaluated from everyday environments and obtained overall accuracy 63.8%. method compared against state-of-the-art using non-negative...

10.1109/ijcnn.2015.7280624 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2015-07-01

This paper proposes a deep neural network for estimating the directions of arrival (DOA) multiple sound sources. The proposed stacked convolutional and recurrent (DOAnet) generates spatial pseudo-spectrum (SPS) along with DOA estimates in both azimuth elevation. We avoid any explicit feature extraction step by using magnitudes phases spectrograms all channels as input to network. DOAnet is evaluated DOAs concurrently present sources anechoic, matched unmatched reverberant conditions. results...

10.23919/eusipco.2018.8553182 article EN 2021 29th European Signal Processing Conference (EUSIPCO) 2018-09-01

The work presented in this article studies how the context information can be used automatic sound event detection process, and system benefit from such information. Humans are using to make more accurate predictions about events ruling out unlikely given context. We propose a similar utilization of process. proposed approach is composed two stages: recognition stage stage. Contexts modeled Gaussian mixture models three-state left-to-right hidden Markov models. In first stage, audio tested...

10.1186/1687-4722-2013-1 article EN cc-by EURASIP Journal on Audio Speech and Music Processing 2013-01-09

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and TUT Urban Acoustic Scenes dataset provided for task, evaluates performance a baseline system in task. As previous years challenge, is defined short audio samples into one predefined classes, using supervised, closed-set setup. The newly recorded consists ten different scenes was six large European cities, therefore it has higher variability than datasets used this addition to high-quality binaural...

10.48550/arxiv.1807.09840 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Audio captioning is the novel task of general audio content description using free text. It an intermodal translation (not speech-to-text), where a system accepts as input signal and outputs textual (i.e. caption) that signal. In this paper we present Clotho, dataset for consisting 4981 samples 15 to 30 seconds duration 24 905 captions eight 20 words length, baseline method provide initial results. Clotho built with focus on caption diversity, splits data are not hampering training or...

10.1109/icassp40776.2020.9052990 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

To investigate the production of antioxidant activity during fermentation with commonly used dairy starter cultures. Moreover, to study development fermentation, and connection proteolysis bacterial growth.Antioxidant was measured by analysing radical scavenging using a spectrophotometric decolorization assay lipid peroxidation inhibition assayed liposomal model system fluorescence method. Milk fermented 25 lactic acid (LAB) strains, from these six exhibiting highest selected for further...

10.1111/j.1365-2672.2006.03072.x article EN Journal of Applied Microbiology 2006-08-01

Voice conversion can be formulated as finding a mapping function which transforms the features of source speaker to those target speaker. Gaussian mixture model (GMM)-based is commonly used, but it subject overfitting. In this paper, we propose use partial least squares (PLS)-based in voice conversion. To prevent overfitting, degrees freedom controlled by choosing suitable number components. We technique combine PLS with GMMs, enabling multiple local linear mappings. further improve...

10.1109/tasl.2010.2041699 article EN IEEE Transactions on Audio Speech and Language Processing 2010-04-09

We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, spectrogram is reconstructed as weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The weights are constrained to be avoid over-smoothing, and high-resolution spectra employed in the exemplars directly without dimensionality reduction maintain spectral details. addition, compression...

10.1109/taslp.2014.2333242 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2014-06-25

This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network handle more than one type of these by learning each them separately in initial stages. show that instead concatenating channel into a single feature vector learns events better when they are presented as separate layers volume. Using proposed over monaural on same gives an absolute F-score improvement 6.1% publicly available...

10.1109/icassp.2017.7952260 article EN 2017-03-01

A drawback of many voice conversion algorithms is that they rely on linear models and/or require a lot tuning. In addition, them ignore the inherent time-dependency between speech features. To address these issues, we propose to use dynamic kernel partial least squares (DKPLS) technique model nonlinearities as well capture dynamics in data. The method based transformation source features allow non-linear modeling and concatenation previous next frames dynamics. Partial regression used find...

10.1109/tasl.2011.2165944 article EN IEEE Transactions on Audio Speech and Language Processing 2011-08-25
Coming Soon ...