- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Advanced Data Compression Techniques
- Image and Signal Denoising Methods
- Digital Filter Design and Implementation
- Natural Language Processing Techniques
- Phonocardiography and Auscultation Techniques
- Infant Health and Development
- Generative Adversarial Networks and Image Synthesis
- Speech and dialogue systems
Universitat Politècnica de Catalunya
2014-2021
European Media Laboratory (Germany)
2019-2021
Shahid Beheshti University
2009-2011
The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors a speaker verification task. authors propose adapt the network parameters each from background model, which will be referred as Universal DBN (UDBN). It also suggested backpropagate class errors up only one layer for few iterations before train network. Additionally, an selection method introduced helps outperform cosine distance classifier. evaluation performed on core...
The promising performance of Deep Learning (DL) in speech recognition has motivated the use DL other technology applications such as speaker recognition. Given i-vectors inputs, authors proposed an impostor selection algorithm and a universal model adaptation process hybrid system based on Belief Networks (DBN) Neural (DNN) to discriminatively each target speaker. In order have more insight into behavior techniques both single multi-session enrollment tasks, some experiments been carried out...
Over the last few years, i-vectors have been state-of-the-art technique in speaker recognition. Recent advances Deep Learning (DL) technology improved quality of but DL techniques use are computationally expensive and need phonetically labeled background data. The aim this work is to develop an efficient alternative vector representation speech by keeping computational cost as low possible avoiding phonetic labels, which not always accessible. proposed vectors will be based on both Gaussian...
In this paper we propose an impostor selection method for a Deep Belief Network (DBN) based system which models i-vectors in multi-session speaker verification task. the proposed method, instead of choosing fixed number most informative impostors, threshold is defined according to frequencies impostors. The selected impostors are then clustered and centroids considered as final for target speakers. first trains each unsupervisingly by adaptation method models discriminatively using...
Restricted Boltzmann Machines (RBMs) have shown success in different stages of speaker recognition systems.In this paper, we propose a novel framework to produce vector-based representation for each speaker, which will be referred as RBMvector.This new approach maps the spectral features single fixed-dimensional vector carrying speaker-specific information.In work, global model, Universal RBM (URBM), is trained taking advantage unsupervised learning capabilities.Then, URBM adapted data...
In this paper, we propose to discriminatively model target and impostor spectral features using Deep Belief Networks (DBNs) for speaker recognition. the feature level, number of samples is considerably large compared previous works based on i-vectors. Therefore, those i-vector selection algorithms are not computationally practical. On other hand, each different from one another which makes training process more difficult. work, take advantage DBN unsupervised learning train a global model,...
The use of Restricted Boltzmann Machines (RBM) is proposed in this paper as a non-linear transformation GMM supervectors for speaker recognition. It will be shown that the RBM increase discrimination power raw experimental results on core test condition NIST SRE 2006 corpus show achieve comparable performance to i-vectors. Furthermore, combination supevectors and i-vectors score level improves i-vector approach by more than 10% terms EER.
The acoustic environment of a typical neonatal intensive care unit (NICU) is very rich and may contain large number different sounds, which come either from the equipment or human activities taking place in it. There exists medical concern about effect that acoustical on preterm infants, since loud sounds particular be harmful for their further neurological development. In this work, first all, an initial description characteristics NICU has been carried out using set diverse recordings...
This paper reports on the results of four re-encoding schemes perceptually quantized wavelet packet transform (WPT) coefficients audio and high quality speech. These comprises: 1- embedded zero-tree (EZW) 2- The set partitioning in hierarchical trees (SPIHT) 3-JPEG-based entropy/run length Huffman 4-JPEG-type coding algorithms. Since EZW SPIHT are designed for image compression, some new modifications have been implemented these their better matching with signals. performances re-encoders...
This paper evaluates the problems of implementing two well-known zero-tree-based re-encoding schemes Embedded Zero-tree Wavelet (EZW) and set partitioning in hierarchical trees (SPIHT) for perceptually audio high quality speech coding. Since original EZW SPIHT algorithms are designed image compression, some new modifications have been implemented these their better matching with signals. The performances re-encoders compared terms average output bit rate computation time a same codec. It is...
This paper is focused on the application of Language Identification (LID) technology for intelligent vehicles. We cope with short sentences or words spoken in moving cars four languages: English, Spanish, German, and Finnish. As response time LID system crucial user acceptance this particular task, speech signals different durations total average 3.8s are analyzed. In paper, authors propose use Deep Neural Networks (DNN) to model effectively i-vector space languages. Both raw i-vectors...
In this paper an efficient and low complexity perceptual method is proposed for quantizing the wavelet packet coefficients of high quality speech signals. The performance compared, using same codec, with case where all are quantized a fixed number bits. results on 500 TIMIT files show that based some basic considerations achieves about 15-35% reduction in average bit-rates almost or even better qualities.
A fast, efficient and scalable algorithm is proposed, in this paper, for re-encoding of perceptually quantized wavelet-packet transform (WPT) coefficients audio high quality speech called "adaptive variable degree-k zero-trees" (AVDZ). The quantization process carried out by taking into account some basic perceptual considerations, achieves good subjective with low complexity. performance the proposed AVDZ compared two other zero-tree-based schemes comprising: 1- Embedded Zero-tree Wavelet...
In this paper an adaptive variable degree-k zero-tree (AVDZ) algorithm is proposed for re-encoding of perceptually quantized wavelet packet transform coefficients high quality wideband speech. Its performance compared with two well-known schemes comprising: 1- Embedded Zero-tree Wavelet (EZW) and 2- The set partitioning in hierarchical trees (SPIHT). This comparison carried out using the speech modified versions these schemes. It shown that AVDZ outperforms methods by about 6-10% coding...
This technical report describes the EML submission to first VoxCeleb speaker diarization challenge. Although aim of challenge has been offline processing signals, submitted system is basically online algorithm which decides about labels in runtime approximately every 1.2 sec. For phase challenge, only VoxCeleb2 dev dataset was used for training. The results on provided VoxConverse set show much better accuracy terms both DER and JER compared baseline real-time factor whole process 0.01 using...
Speech Activity Detection (SAD), locating speech segments within an audio recording, is a main part of most technology applications.Robust SAD usually more difficult in noisy conditions with varying signal-to-noise ratios (SNR).The Fearless Steps challenge has recently provided such data from the NASA Apollo-11 mission for different processing tasks including SAD.Most recordings are degraded by kinds and levels noise between channels.This paper describes EML online algorithm recent phase...
Speech Activity Detection (SAD), locating speech segments within an audio recording, is a main part of most technology applications. Robust SAD usually more difficult in noisy conditions with varying signal-to-noise ratios (SNR). The Fearless Steps challenge has recently provided such data from the NASA Apollo-11 mission for different processing tasks including SAD. Most recordings are degraded by kinds and levels noise between channels. This paper describes EML online algorithm recent phase...