NFDI4DS | UHH-SEMS - Publication Details

Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

OPENALEX - Publications

Samuel Kriman Stanislav Beliaev Boris Ginsburg Jocelyn Huang Oleksii Kuchaiev and 4 more

We propose a new end-to-end neural acoustic model for automatic speech recognition. The is composed of multiple blocks with residual connections between them. Each block consists one or more modules 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It trained CTC loss. proposed network achieves near state-of-the-art accuracy on LibriSpeech Wall Street Journal, while having fewer parameters than all competing models. also demonstrate that this can be...

10.1109/icassp40776.2020.9053889 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Jasper: An End-to-End Convolutional Neural Acoustic Model

OPENALEX - Publications

Jason Li Vitaly Lavrukhin Boris Ginsburg R. Bret Leary Oleksii Kuchaiev and 3 more

In this paper we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections.To improve training, further introduce a new layer-wise optimizer called NovoGrad.Through experiments, demonstrate that the proposed deep architecture performs as well or better than more complex choices.Our deepest Jasper variant 54 convolutional...

10.21437/interspeech.2019-1819 article EN Interspeech 2022 2019-09-13

NeMo: a toolkit for building AI applications using Neural Modules

OPENALEX - Publications

Oleksii Kuchaiev Jason Li Huyen Nguyen Oleksii Hrinchuk R. Bret Leary and 9 more

NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. built around neural modules, conceptual blocks of networks that take typed inputs produce outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods combining activations. makes it easy to combine re-use these building while providing level semantic correctness checking via its type system....

10.48550/arxiv.1909.09577 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

OPENALEX - Publications

Boris Ginsburg Patrice Castonguay Oleksii Hrinchuk Oleksii Kuchaiev Vitaly Lavrukhin and 4 more

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, language modeling, it performs par or better than well tuned SGD momentum Adam AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate initialization, (2) works in a large batch setting, (3) has two times smaller memory footprint Adam.

10.48550/arxiv.1905.11286 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Hi-Fi Multi-Speaker English TTS Dataset

OPENALEX - Publications

Evelina Bakhturina Vitaly Lavrukhin Boris Ginsburg Yang Zhang

This paper introduces a new multi-speaker English dataset for training text-to-speech models.The is based on Lib-riVox audiobooks and Project Gutenberg texts, both in the public domain.The contains about 292 hours of speech from 10 speakers with at least 17 per speaker sampled 44.1 kHz.To select samples high quality, we considered audio recordings signal bandwidth 13 kHz signal-to-noise ratio (SNR) 32 dB.The publicly released "http://www.openslr.org/109/".

10.21437/interspeech.2021-1599 article EN Interspeech 2022 2021-08-27

Jasper: An End-to-End Convolutional Neural Acoustic Model

OPENALEX - Publications

Jason Li Vitaly Lavrukhin Boris Ginsburg R. Bret Leary Oleksii Kuchaiev and 3 more

In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, further introduce a new layer-wise optimizer called NovoGrad. Through experiments, demonstrate that the proposed deep architecture performs as well or better than more complex choices. deepest Jasper variant 54...

10.48550/arxiv.1904.03288 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

OPENALEX - Publications

Somshubra Majumdar Jagadeesh Balam Oleksii Hrinchuk Vitaly Lavrukhin Vahid Noroozi and 1 more

We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive sequence-to-sequence transducer models. evaluate on LibriSpeech, TED-LIUM2, AISHELL-1 Multilingual LibriSpeech (MLS) English...

10.48550/arxiv.2104.01721 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

OPENALEX - Publications

Oleksii Kuchaiev Boris Ginsburg Igor Gitman Vitaly Lavrukhin Jason Li and 3 more

We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation speech recognition tasks show built using give state-of-the-art performance at 1.5-3x less time. currently provides building blocks solve wide range of including neural translation, automatic recognition, synthesis.

10.48550/arxiv.1805.10387 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

OPENALEX - Publications

Jason Li Ravi Teja Gadde Boris Ginsburg Vitaly Lavrukhin

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled samples produced by diverse set speakers. The lack such open free datasets is one the main issues preventing advancements in ASR research. To address this problem, we propose to augment natural with synthetic speech. We train very end-to-end neural models using LibriSpeech augmented These new achieve state art Word Error Rate (WER) for character-level based without...

10.48550/arxiv.1811.00707 preprint EN other-oa arXiv (Cornell University) 2018-01-01

SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition

OPENALEX - Publications

Patrick O’Neill Vitaly Lavrukhin Somshubra Majumdar Vahid Noroozi Yuekai Zhang and 8 more

In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, denormalization of non-standard words) is imputed by separate post-processing models.This adds complexity limits performance, many formatting tasks benefit from semantic information present in signal but absent transcription.Here we propose a new STT task: endto-end neural transcription with...

10.21437/interspeech.2021-1860 article EN Interspeech 2022 2021-08-27

SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification

OPENALEX - Publications

Nithin Rao Koluguri Jason Li Vitaly Lavrukhin Boris Ginsburg

We propose SpeakerNet - a new neural architecture for speaker recognition and verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, ReLU layers. This uses x-vector based statistics pooling layer to map variable-length utterances fixed-length embedding (q-vector). SpeakerNet-M simple lightweight model just 5M parameters. doesn't use voice activity detection (VAD) achieves close state-of-the-art performance scoring an Equal Error...

10.48550/arxiv.2010.12653 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

OPENALEX - Publications

Alexan Ayrapetyan Sofia Kostandian Ara Yeroyan Mher Yerznkanyan Nikolay Karpov and 3 more

This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced preprocessing and various permissive sources audiobooks, Common Voice, YouTube. While these are well-explored highresource languages, their application remains underexplored. Using Armenian Georgian case studies, we demonstrate how linguistic resource-specific characteristics influence the success of methods. work provides practical guidance...

10.48550/arxiv.2501.14788 preprint EN arXiv (Cornell University) 2025-01-08

Chain-of-Thought Prompting for Speech Translation

OPENALEX - Publications

Ke Hu Zhehuai Chen Chao-Han Huck Yang Piotr Żelasko Oleksii Hrinchuk and 3 more

10.1109/icassp49660.2025.10890560 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

OPENALEX - Publications

L. A. Grigoryan Nikolay Karpov Enas Albasiri Vitaly Lavrukhin Boris Ginsburg

10.1109/icassp49660.2025.10889643 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

OPENALEX - Publications

В. А. Батаев Subhankar Ghosh Vitaly Lavrukhin Jason Li

10.1109/icassp49660.2025.10890256 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

EMMeTT: Efficient Multimodal Machine Translation Training

OPENALEX - Publications

Piotr Żelasko Zhehuai Chen Mengru Wang Daniel Gálvez Oleksii Hrinchuk and 5 more

10.1109/icassp49660.2025.10890312 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition

OPENALEX - Publications

Jocelyn Huang Oleksii Kuchaiev Patrick O’Neill Vitaly Lavrukhin Jason Li and 3 more

In this paper, we demonstrate the efficacy of transfer learning and continuous for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model show that can be effectively easily performed on: (1) different accents, (2) languages (German, Spanish Russian) (3) application-specific domains. Our experiments in all three cases, from good base has higher accuracy than trained scratch. It is preferred to fine-tune large models small models, even if dataset...

10.48550/arxiv.2005.04290 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

OPENALEX - Publications

В. А. Батаев Roman Korostik E. P. Shabalin Vitaly Lavrukhin Boris Ginsburg

We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only or a mixture of both.The proposed model uses integrated auxiliary block for text-based training.This combines non-autoregressive multi-speaker text-to-melspectrogram generator with GAN-based enhancer to improve the spectrogram quality.The generate mel-spectrogram dynamically during training.It used adapt ASR new domain by using data from this domain.We demonstrate...

10.21437/interspeech.2023-906 article EN Interspeech 2022 2023-08-14

QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

OPENALEX - Publications

Samuel Kriman Stanislav Beliaev Boris Ginsburg Jocelyn Huang Oleksii Kuchaiev and 4 more

We propose a new end-to-end neural acoustic model for automatic speech recognition. The is composed of multiple blocks with residual connections between them. Each block consists one or more modules 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It trained CTC loss. proposed network achieves near state-of-the-art accuracy on LibriSpeech Wall Street Journal, while having fewer parameters than all competing models. also demonstrate that this can be...

10.48550/arxiv.1910.10261 preprint EN other-oa arXiv (Cornell University) 2019-01-01

OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models

OPENALEX - Publications

Oleksii Kuchaiev Boris Ginsburg Igor Gitman Vitaly Lavrukhin Carl Case and 1 more

We present OpenSeq2Seq – an open-source toolkit for training sequence-to-sequence models. The main goal of our is to allow researchers most effectively explore different architectures. efficiency achieved by fully supporting distributed and mixed-precision training. provides building blocks encoder-decoder models neural machine translation automatic speech recognition. plan extend it with other modalities in the future.

10.18653/v1/w18-2507 article EN cc-by 2018-01-01

Cross-Language Transfer Learning and Domain Adaptation for End-to-End Automatic Speech Recognition

OPENALEX - Publications

Jian Luo Jianzong Wang Ning Cheng Edward Xiao Jing Xiao and 10 more

In this paper, we demonstrate the efficacy of transfer learning and continuous for various automatic speech recognition (ASR) tasks using end-to-end models trained with CTC loss. We start a large pre-trained English ASR model show that can be effectively easily performed on: (1) different accents, (2) languages (from to German, Spanish, Russian, or from Mandarin Cantonese) (3) application-specific domains. Our extensive set experiments in all three cases, good base has higher accuracy than...

10.1109/icme51207.2021.9428334 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2021-06-09

Conformer-Based Target-Speaker Automatic Speech Recognition For Single-Channel Audio

OPENALEX - Publications

Yang Zhang Krishna C. Puvvada Vitaly Lavrukhin Boris Ginsburg

We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of TitaNet based speaker embedding module, Conformer masking as well ASR modules. These modules are jointly optimized to transcribe target-speaker, while ignoring from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce scale-invariant spectrogram reconstruction...

10.1109/icassp49357.2023.10095115 preprint EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

OPENALEX - Publications

Krishna C. Puvvada Piotr Żelasko He Huang Oleksii Hrinchuk Nithin Rao Koluguri and 7 more

Recent advances in speech recognition and translation rely on hundreds of thousands hours Internet data. We argue that state-of-the art accuracy can be reached without relying web-scale Canary - multilingual ASR model, outperforms current state-of-the-art models – Whisper, OWSM, Seamless-M4T English, French, Spanish, German languages, while being trained an order magnitude less data than these models. Three key factors enables such dataefficient model: (1) a FastConformer-based attention...

10.21437/interspeech.2024-2294 article EN Interspeech 2022 2024-09-01

Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition

OPENALEX - Publications

Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg

Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation domains is catastrophic forgetting, where the Word Error Rate on original domain significantly degraded. This paper addresses situation when we want simultaneously adapt automatic and limit degradation without access training dataset. We propose several techniques such as limited strategy regularized adapter modules for Transducer encoder, prediction,...

10.1109/slt54892.2023.10023219 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09