- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Music and Audio Processing
- Speech and Audio Processing
- Topic Modeling
- Speech and dialogue systems
- Text Readability and Simplification
- Algorithms and Data Compression
- Stochastic Gradient Optimization Techniques
- Neural Networks and Applications
- Advanced Data Compression Techniques
- Machine Learning and ELM
- Digital Filter Design and Implementation
- Security, Politics, and Digital Transformation
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Embedded Systems Design Techniques
- Phonetics and Phonology Research
- Cybersecurity and Information Systems
Nvidia (United States)
2018-2025
Nvidia (United Kingdom)
2020-2021
We propose a new end-to-end neural acoustic model for automatic speech recognition. The is composed of multiple blocks with residual connections between them. Each block consists one or more modules 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It trained CTC loss. proposed network achieves near state-of-the-art accuracy on LibriSpeech Wall Street Journal, while having fewer parameters than all competing models. also demonstrate that this can be...
In this paper we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections.To improve training, further introduce a new layer-wise optimizer called NovoGrad.Through experiments, demonstrate that the proposed deep architecture performs as well or better than more complex choices.Our deepest Jasper variant 54 convolutional...
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. built around neural modules, conceptual blocks of networks that take typed inputs produce outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods combining activations. makes it easy to combine re-use these building while providing level semantic correctness checking via its type system....
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, language modeling, it performs par or better than well tuned SGD momentum Adam AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate initialization, (2) works in a large batch setting, (3) has two times smaller memory footprint Adam.
This paper introduces a new multi-speaker English dataset for training text-to-speech models.The is based on Lib-riVox audiobooks and Project Gutenberg texts, both in the public domain.The contains about 292 hours of speech from 10 speakers with at least 17 per speaker sampled 44.1 kHz.To select samples high quality, we considered audio recordings signal bandwidth 13 kHz signal-to-noise ratio (SNR) 32 dB.The publicly released "http://www.openslr.org/109/".
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, further introduce a new layer-wise optimizer called NovoGrad. Through experiments, demonstrate that the proposed deep architecture performs as well or better than more complex choices. deepest Jasper variant 54...
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive sequence-to-sequence transducer models. evaluate on LibriSpeech, TED-LIUM2, AISHELL-1 Multilingual LibriSpeech (MLS) English...
We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation speech recognition tasks show built using give state-of-the-art performance at 1.5-3x less time. currently provides building blocks solve wide range of including neural translation, automatic recognition, synthesis.
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled samples produced by diverse set speakers. The lack such open free datasets is one the main issues preventing advancements in ASR research. To address this problem, we propose to augment natural with synthetic speech. We train very end-to-end neural models using LibriSpeech augmented These new achieve state art Word Error Rate (WER) for character-level based without...
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, denormalization of non-standard words) is imputed by separate post-processing models.This adds complexity limits performance, many formatting tasks benefit from semantic information present in signal but absent transcription.Here we propose a new STT task: endto-end neural transcription with...
We propose SpeakerNet - a new neural architecture for speaker recognition and verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, ReLU layers. This uses x-vector based statistics pooling layer to map variable-length utterances fixed-length embedding (q-vector). SpeakerNet-M simple lightweight model just 5M parameters. doesn't use voice activity detection (VAD) achieves close state-of-the-art performance scoring an Equal Error...
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced preprocessing and various permissive sources audiobooks, Common Voice, YouTube. While these are well-explored highresource languages, their application remains underexplored. Using Armenian Georgian case studies, we demonstrate how linguistic resource-specific characteristics influence the success of methods. work provides practical guidance...
In this paper, we demonstrate the efficacy of transfer learning and continuous for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model show that can be effectively easily performed on: (1) different accents, (2) languages (German, Spanish Russian) (3) application-specific domains. Our experiments in all three cases, from good base has higher accuracy than trained scratch. It is preferred to fine-tune large models small models, even if dataset...
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only or a mixture of both.The proposed model uses integrated auxiliary block for text-based training.This combines non-autoregressive multi-speaker text-to-melspectrogram generator with GAN-based enhancer to improve the spectrogram quality.The generate mel-spectrogram dynamically during training.It used adapt ASR new domain by using data from this domain.We demonstrate...
We propose a new end-to-end neural acoustic model for automatic speech recognition. The is composed of multiple blocks with residual connections between them. Each block consists one or more modules 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It trained CTC loss. proposed network achieves near state-of-the-art accuracy on LibriSpeech Wall Street Journal, while having fewer parameters than all competing models. also demonstrate that this can be...
We present OpenSeq2Seq – an open-source toolkit for training sequence-to-sequence models. The main goal of our is to allow researchers most effectively explore different architectures. efficiency achieved by fully supporting distributed and mixed-precision training. provides building blocks encoder-decoder models neural machine translation automatic speech recognition. plan extend it with other modalities in the future.
In this paper, we demonstrate the efficacy of transfer learning and continuous for various automatic speech recognition (ASR) tasks using end-to-end models trained with CTC loss. We start a large pre-trained English ASR model show that can be effectively easily performed on: (1) different accents, (2) languages (from to German, Spanish, Russian, or from Mandarin Cantonese) (3) application-specific domains. Our extensive set experiments in all three cases, good base has higher accuracy than...
We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of TitaNet based speaker embedding module, Conformer masking as well ASR modules. These modules are jointly optimized to transcribe target-speaker, while ignoring from other speakers. For training we use Connectionist Temporal Classification (CTC) loss and introduce scale-invariant spectrogram reconstruction...
Recent advances in speech recognition and translation rely on hundreds of thousands hours Internet data. We argue that state-of-the art accuracy can be reached without relying web-scale Canary - multilingual ASR model, outperforms current state-of-the-art models – Whisper, OWSM, Seamless-M4T English, French, Spanish, German languages, while being trained an order magnitude less data than these models. Three key factors enables such dataefficient model: (1) a FastConformer-based attention...
Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation domains is catastrophic forgetting, where the Word Error Rate on original domain significantly degraded. This paper addresses situation when we want simultaneously adapt automatic and limit degradation without access training dataset. We propose several techniques such as limited strategy regularized adapter modules for Transducer encoder, prediction,...