Javier Hernando

ORCID: 0000-0002-1730-8154
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Advanced Data Compression Techniques
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Video Surveillance and Tracking Methods
  • Advanced Adaptive Filtering Techniques
  • Indoor and Outdoor Localization Technologies
  • Face and Expression Recognition
  • Face recognition and analysis
  • Biometric Identification and Security
  • Historical Art and Architecture Studies
  • User Authentication and Security Systems
  • Linguistic Studies and Language Acquisition
  • Gait Recognition and Analysis
  • Social Sciences and Policies
  • Photographic and Visual Arts
  • Spanish Linguistics and Language Studies
  • Multi-Agent Systems and Negotiation
  • Emotion and Mood Recognition
  • Phonetics and Phonology Research
  • Advanced Image and Video Retrieval Techniques
  • Time Series Analysis and Forecasting
  • Anomaly Detection Techniques and Applications

Universitat Politècnica de Catalunya
2015-2024

Universitat Ramon Llull
2024

Barcelona Supercomputing Center
2024

National Student Clearinghouse Research Center
2020

Louisiana State University
2018

University of Surrey
2003

Universidad de Zaragoza
1994

When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage microphones, they either too computationally expensive not easily scalable or cannot outperform simpler case using best single microphone. In this paper, use classic acoustic beamforming techniques is together with novel algorithms create a...

10.1109/tasl.2007.902460 article EN IEEE Transactions on Audio Speech and Language Processing 2007-08-22

Comunicacio presentada a: 8th Annual Conference of the International Speech Communication Association a Antwerp (Belgium) celebrada del 27 al 31 d'agost de 2007.

10.21437/interspeech.2007-147 article EN Interspeech 2022 2007-08-27

Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level.Given the speech signal, these algorithms extract sequence of embeddings from segments and those are averaged to obtain an level representation.In this we propose use attention mechanism discriminative embedding given non fixed length utterances.Our system is based Convolutional Neural Network (CNN) that encodes short-term features spectrogram self multi-head model maps representations...

10.21437/interspeech.2019-2616 article EN Interspeech 2022 2019-09-13

The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors a speaker verification task. authors propose adapt the network parameters each from background model, which will be referred as Universal DBN (UDBN). It also suggested backpropagate class errors up only one layer for few iterations before train network. Additionally, an selection method introduced helps outperform cosine distance classifier. evaluation performed on core...

10.1109/icassp.2014.6853888 article EN 2014-05-01

The computing power of mobile devices limits the end-user applications in terms storage size, processing, memory and energy consumption.These limitations motivate researchers for design more efficient deep models.On other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities strong performance a variety Natural Language Processing (NLP) applications.Inspired by Transformer, we propose tandem...

10.21437/interspeech.2020-1446 article EN Interspeech 2022 2020-10-25

Jitter and shimmer are measures of the fundamental frequency amplitude cycle-to-cycle variations, respectively. Both features have been largely used for description pathological voices, since they characterise some aspects concerning particular expected to a certain degree speaker specificity. In current work, jitter successfully in verification experiment. Moreover, both combined with spectral prosodic using several types normalisation fusion techniques order obtain better results. The...

10.1049/iet-spr.2008.0147 article EN IET Signal Processing 2009-06-24

The promising performance of Deep Learning (DL) in speech recognition has motivated the use DL other technology applications such as speaker recognition. Given i-vectors inputs, authors proposed an impostor selection algorithm and a universal model adaptation process hybrid system based on Belief Networks (DBN) Neural (DNN) to discriminatively each target speaker. In order have more insight into behavior techniques both single multi-session enrollment tasks, some experiments been carried out...

10.1109/taslp.2017.2661705 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-02-08

The aim of this paper is to investigate the benefit combining both language and acoustic modelling for speaker diarization. Although conventional systems only use features, in some scenarios linguistic data contain high discriminative information, even more reliable than ones. In study we analyze how an appropriate fusion kind features able obtain good results these cases. proposed system based on iterative algorithm where a LSTM network used as classifier. fed with character-level word...

10.48550/arxiv.2501.17893 preprint EN arXiv (Cornell University) 2025-01-28

Cepstral coefficients are widely used in speech recognition.In this paper, we claim that they not the best way of representing spectral envelope, at least for some usual recognition systems.In fact, cepstrum has several disadvantages: poor physical meaning, need transformation, and low capacity adaptation to propose a new representation significantly outperforms both mel-cepstrum LPC-cepstrum techniques rate computational cost.It consists filtering frequency sequence filter-bank energies...

10.21437/eurospeech.1995-220 article EN 1995-09-18

Over the last few years, i-vectors have been state-of-the-art technique in speaker recognition. Recent advances Deep Learning (DL) technology improved quality of but DL techniques use are computationally expensive and need phonetically labeled background data. The aim this work is to develop an efficient alternative vector representation speech by keeping computational cost as low possible avoiding phonetic labels, which not always accessible. proposed vectors will be based on both Gaussian...

10.1016/j.csl.2017.06.007 article EN cc-by-nc-nd Computer Speech & Language 2017-06-28

The article presents a robust representation of speech based on AR modeling the causal part autocorrelation sequence. In noisy recognition, this new achieves better results than several other related techniques.

10.1109/89.554273 article EN IEEE Transactions on Speech and Audio Processing 1997-01-01

One of the sub-tasks Spring 2004 and 2005 NIST Meetings evaluations requires segmenting multi-party meetings into speaker-homogeneous regions using data from multiple distant microphones (the "MDM" sub-task). approach to this task is run a speaker segmentation system on each microphone channels separately, then merge results. This can be thought as many-to-one post-processing approach. In paper we propose an alternative in which use delay-and-sum beamforming techniques fuse signals single...

10.1109/asru.2005.1566478 article EN 2005-01-01

Modern GPUs have evolved into fully programmable parallel stream multiprocessors. Due to the nature of graphic workloads, computer vision algorithms are in good position leverage computing power these devices. An interesting problem that greatly benefits from parallelism is face detection. This paper presents a highly optimized Haar-based detector works real time over high definition videos. The proposed kernel operations exploit both coarse and fine grain for performing integral image...

10.1109/iccvw.2011.6130288 article EN 2011-11-01

Voices can be deliberately disguised by means of human imitation or voice conversion. The question arises as to what extent they modified using either both methods. In the current paper, a set speaker identification experiments are conducted; first, analysing some prosodic features extracted from voices professional impersonators attempting mimic target and, second, intragender and crossgender converted in spectral-based recognition system. results obtained show that error rate increases...

10.1558/ijsll.v17i1.119 article EN International Journal of Speech Language and the Law 2010-06-15

Simultaneous speech poses a challenging problem for conventional speaker diarization systems. In meeting data, substantial amount of missed error is due to overlaps, since usually only one label per segment assigned. Furthermore, simultaneous included in training data can lead corrupt models and thus worse segmentation performance. this paper, we propose the use three spatial cross-correlation-based features together with spectral information overlap detection on distant microphones....

10.1109/tasl.2011.2160167 article EN IEEE Transactions on Audio Speech and Language Processing 2012-01-31

Abstract Overlapping speech is responsible for a certain amount of er-rors produced by standard speaker diarization systems in meet-ing environment. We are investigating set prosody-basedlong-term features as potential complement to our overlap de-tection system relying on short-term spectral parameters. Themost relevant selected two-step process. Theyare firstly evaluated and sorted according mRMR criterionand then the optimal number determined iterative wrapperapproach. show that addition...

10.21437/interspeech.2011-389 article EN Interspeech 2022 2011-08-27

In this article, we present the evaluation results for task of speaker diarization broadcast news, which was part Albayzin 2010 campaign language and speech technologies. The data consists a subset Catalan news database recorded from 3/24 TV channel. description five submitted systems different research labs is given, marking common as well distinctive system features. performance analyzed in context error rate, number detected speakers also acoustic background conditions. An effort made to...

10.1186/1687-4722-2012-19 article EN cc-by EURASIP Journal on Audio Speech and Music Processing 2012-07-31

In this paper we propose an impostor selection method for a Deep Belief Network (DBN) based system which models i-vectors in multi-session speaker verification task. the proposed method, instead of choosing fixed number most informative impostors, threshold is defined according to frequencies impostors. The selected impostors are then clustered and centroids considered as final for target speakers. first trains each unsupervisingly by adaptation method models discriminatively using...

10.21437/odyssey.2014-46 article EN 2014-06-16

With the emergence of GPU computing, deep neural networks have become a widely used technique for advancing research in field image and speech processing. In context object event detection, sliding-window classifiers require to choose best among all positively discriminated candidate windows. this paper, we introduce first GPU-based non-maximum suppression (NMS) algorithm embedded architectures. The obtained results show that proposed parallel reduces NMS latency by wide margin when compared...

10.1109/icassp.2016.7471831 article EN 2016-03-01

This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the orientation is basis for many forms more sophisticated interactions between humans technical devices can also be used sensor selection (camera, microphone) communications surveillance systems. The use particle filters unified framework both monomodal cases proposed. In video, we...

10.1155/2008/276846 article EN cc-by EURASIP Journal on Advances in Signal Processing 2007-06-12

Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in audio signals. When applied to spontaneously generated acoustic events, AED based only on information shows a large amount errors, which are mostly due overlaps. Actually, overlaps accounted for more than 70% errors real-world interactive seminar recordings used CLEAR 2007 evaluations. In this paper, we improve recognition rate events using from both video modalities. First, data...

10.1155/2011/485738 article EN cc-by EURASIP Journal on Advances in Signal Processing 2011-02-13
Coming Soon ...