Bolaji Yusuf

ORCID: 0000-0001-9852-3456
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Music and Audio Processing
  • Speech and Audio Processing
  • Topic Modeling
  • Speech and dialogue systems
  • Advanced Text Analysis Techniques
  • Fault Detection and Control Systems
  • Text Readability and Simplification
  • Multimodal Machine Learning Applications
  • Web Data Mining and Analysis
  • Distributed and Parallel Computing Systems
  • Target Tracking and Data Fusion in Sensor Networks
  • Distributed Sensor Networks and Detection Algorithms
  • Parallel Computing and Optimization Techniques
  • Text and Document Classification Technologies
  • Particle Detector Development and Performance

Boğaziçi University
2017-2023

Brno University of Technology
2020-2023

Amazon (United States)
2022-2023

The transfer of acoustic data across languages has been shown to improve keyword search (KWS) performance in data-scarce settings. In this paper, we propose a way performing that reduces the impact prevalence out-of-vocabulary (OOV) terms on KWS such setting. We investigate novel usage multilingual features for with very little training target languages. crux our approach is use synthetic phone exemplars convert into query-by-example task, which solve dynamic time warping algorithm. Using...

10.1109/taslp.2019.2911164 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2019-05-09

In this paper, we propose a novel approach to keyword search (KWS) in low-resource languages, which provides an alternative method for retrieving the terms of interest, especially out vocabulary (OOV) ones. Our system incorporates techniques query-by-example retrieval tasks into KWS and conducts by means subsequence dynamic time warping (sDTW) algorithm. For this, text queries are modeled as sequences feature vectors used templates search. A Siamese neural network-based model is trained...

10.1109/jstsp.2017.2762080 article EN IEEE Journal of Selected Topics in Signal Processing 2017-10-11

The modeling of text queries as sequences embeddings for conducting similarity matching based search within speech features has been recently shown to improve keyword (KWS) performance, especially the out-of-vocabulary (OOV) terms. This technique uses a dynamic time warping methodology, converting KWS problem into pattern by artificially pronunciation-based embedding sequences. query is done concatenating and repeating frame representations each phoneme in keyword's pronunciation. In this...

10.1109/lsp.2018.2881610 article EN IEEE Signal Processing Letters 2018-11-15

Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There recent focus on training E2E ASR models that get the performance benefits of without incurring extra cost evaluating an language model at inference time. In this work, we propose jointly with set text-to-text auxiliary tasks which it shares decoder and parts encoder. When train masked 960-hour Librispeech Opensubtitles respectively, observe WER reductions 16% 20%...

10.1109/icassp43922.2022.9746554 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and pipeline. This has led interest in ASR-free approaches simplify the procedure. We recently proposed neural model achieves competitive performance while maintaining an efficient simplified pipeline, where queries documents are encoded with pair of recurrent network encoders encodings combined dot-product. In this article, we extend work multilingual...

10.1109/taslp.2023.3301239 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

In this work, we propose a hierarchical subspace model for acoustic unit discovery. approach, frame the task as one of learning embeddings on low-dimensional phonetic subspace, and simultaneously specify itself an embedding hyper-subspace. We train hyper-subspace set transcribed languages transfer it to target language. language, infer both language in unsupervised manner, so doing, learn units specific that dwell it. conduct experiments TIMIT two low-resource languages: Mboshi Yoruba....

10.1109/icassp39728.2021.9414899 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Recently, neural approaches to spoken content retrieval have become popular.However, they tend be restricted in their vocabulary or ability deal with imbalanced test settings.These restrictions limit applicability keyword search, where the set of queries is not known beforehand, and system should return just whether an utterance contains a query but exact location any such occurrences.In this work, we propose model directly optimized for search.The takes as input returns sequence...

10.21437/interspeech.2021-1399 article EN Interspeech 2022 2021-08-27

10.1109/taslp.2024.3407476 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

End-to-end (E2E) approaches to keyword search (KWS) are considerably simpler in terms of training and indexing complexity when compared which use the output automatic speech recognition (ASR) systems. This simplification however has drawbacks due loss modularity. In particular, where ASR-based KWS systems can benefit from external unpaired text via a language model, current formulations E2E have no such mechanism. Therefore, this paper, we propose multitask objective allows be integrated...

10.48550/arxiv.2308.08027 preprint EN arXiv (Cornell University) 2024-07-05

This paper explores speculative speech recognition (SSR), where we empower conventional automatic (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and propose model which does by combining RNN-Transducer-based ASR system an audio-prefixed language (LM). The transcribes ongoing audio feeds resulting transcripts, along audio-dependent prefix, LM, speculates likely completions transcriptions. experiment...

10.48550/arxiv.2407.04641 preprint EN arXiv (Cornell University) 2024-07-05

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify KWS pipeline, they generally have worse performance than their ASR-based counterparts, can benefit from pretraining with untranscribed data. In this work, we propose a method for systems data, involves using acoustic unit discovery (AUD) obtain discrete units data then...

10.48550/arxiv.2407.04652 preprint EN arXiv (Cornell University) 2024-07-05

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify KWS pipeline, they generally have worse performance than their ASR-based counterparts, can benefit from pretraining with untranscribed data. In this work, we propose a method for systems data, involves using acoustic unit discovery (AUD) obtain discrete units data then...

10.21437/interspeech.2024-1713 article EN Interspeech 2022 2024-09-01

This paper explores speculative speech recognition (SSR), where we empower conventional automatic (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and propose model which does by combining RNN-Transducer-based ASR system an audio-prefixed language (LM). The transcribes ongoing audio feeds resulting transcripts, along audio-dependent prefix, LM, speculates likely completions transcriptions. experiment...

10.21437/interspeech.2024-298 article EN Interspeech 2022 2024-09-01

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), novel approach target-speaker ASR that leverages diarization outputs as conditioning information. DiCoW extends the pre-trained model by integrating labels directly, eliminating reliance and reducing need for...

10.48550/arxiv.2501.00114 preprint EN arXiv (Cornell University) 2024-12-30

End-to-end speech recognition models are improved by incorporating external text sources, typically fusion with an language model. Such have to be retrained whenever the corpus of interest changes. Furthermore, since they store entire in their parameters, rare words can challenging recall. In this work, we propose augmenting a transducer-based ASR model retrieval model, which directly retrieves from plausible completions for partial hypothesis. These then integrated into subsequent...

10.1109/icassp49357.2023.10095857 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Template matching approaches have been proposed as an alternative to large vocabulary continuous speech recognition (LVCSR) based systems for keyword search. These shown no performance discrepancy between terms in the training and out of (OOV) terms. Those methods often relied on use posteriorgram features In this paper, we propose bottleneck instead because their potential cross-lingual transfer learning. We show feasibility template-matching search using by learning different...

10.1109/siu.2018.8404534 article EN 2022 30th Signal Processing and Communications Applications Conference (SIU) 2018-05-01

Documenting languages helps to prevent the extinction of endangered dialects, many which are otherwise expected disappear by end century. When documenting oral languages, unsupervised word segmentation (UWS) from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding words, being performed phonetic transcriptions, or absence these, output discretization models. These models trained using raw only, discrete...

10.48550/arxiv.2106.04298 preprint EN cc-by arXiv (Cornell University) 2021-01-01

End-to-end speech recognition models are improved by incorporating external text sources, typically fusion with an language model. Such have to be retrained whenever the corpus of interest changes. Furthermore, since they store entire in their parameters, rare words can challenging recall. In this work, we propose augmenting a transducer-based ASR model retrieval model, which directly retrieves from plausible completions for partial hypothesis. These then integrated into subsequent...

10.48550/arxiv.2303.10942 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Fusion of data obtained by different sensors or the systems that process them, in order to increase accuracy detection, is a widely applied method. In event detection problems, on other hand, time-alignment and fusion hypotheses proposed becomes computationally expensive task as number increase. this study, new methodology proposed, which facilitates fast alignment provided keyword search and, it was shown system performs better than best individual system.

10.1109/siu.2018.8404627 article EN 2022 30th Signal Processing and Communications Applications Conference (SIU) 2018-05-01
Coming Soon ...