- Speech Recognition and Synthesis
- Speech and Audio Processing
- Advanced Image Processing Techniques
- Advanced Data Compression Techniques
- Music and Audio Processing
- Image and Signal Denoising Methods
- Speech and dialogue systems
- Voice and Speech Disorders
- Colorectal Cancer Screening and Detection
- Advanced Vision and Imaging
- Radiomics and Machine Learning in Medical Imaging
- Wireless Communication Networks Research
- Topic Modeling
- Wireless Networks and Protocols
- Influenza Virus Research Studies
- COVID-19 epidemiological studies
- Cooperative Communication and Network Coding
- Empathy and Medical Education
- Language Development and Disorders
- AI in cancer detection
- Lung Cancer Diagnosis and Treatment
- COVID-19 diagnosis using AI
- Natural Language Processing Techniques
- Data-Driven Disease Surveillance
- Algorithms and Data Compression
Enzo Life Sciences (United States)
2024
Alphabet (United States)
2024
Sapporo Medical University
2024
Kyushu University
2024
Google (United States)
2016-2022
Shibuya (Japan)
2021
Google (Israel)
2020
Stony Brook University
2002
Northrop Grumman (United States)
1993
This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each the architectures we describe can provide variable rates during deployment without requiring retraining network: each network need only be trained once. All our consist recurrent (RNN)-based encoder and decoder, binarizer, for entropy coding. We compare RNN types (LSTM, associative LSTM) introduce new hybrid GRU ResNet. also study "one-shot" versus additive reconstruction...
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, state-of-the-art end-to-end speech synthesis system. The with no explicit labels, yet learn to model large range acoustic expressiveness. GSTs lead rich set significant results. soft interpretable "labels" they generate can be used control in novel ways, such as varying speed and speaking - independently the text content. They also for transfer, replicating single audio clip...
We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0), WebP, JPEG2000, and JPEG as measured by MS-SSIM. introduce three improvements over previous research lead to this state-of-the-art result using single model. First, we modify the recurrent architecture improve spatial diffusion, which allows network more effectively capture propagate information through network's hidden state. Second, in addition lossless entropy...
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from reference acoustic representation containing desired prosody. show conditioning on this learned results in synthesized audio matches prosody signal with fine time detail even when and speakers are different. Additionally, we can be used synthesize text is different utterance. define several quantitative subjective metrics for evaluating transfer, report...
We summarize the results of a host efforts using giant automatic speech recognition (ASR) models pre-trained large, diverse unlabeled datasets containing approximately million hours audio. find that combination pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens thousands labeled data. In particular, on an ASR task 34k data, by fine-tuning 8 billion parameter Conformer we can match state-of-the-art (SoTA)...
The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks.The visual and language communities have established benchmarks compare embeddings, but the speech community has yet do so.This paper proposes benchmark comparing representations on non-semantic tasks, representation based an unsupervised triplet-loss objective.The proposed outperforms other benchmark, even exceeds state-of-the-art...
Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR are most often trained from 'typical' speech, which means that underrepresented groups don't experience same level of improvement. In this paper, we present and evaluate finetuning techniques to improve for users with non-standard speech. We focus on two types speech: people amyotrophic lateral sclerosis (ALS) accented train personalized models achieve 62% 35% relative WER improvement these...
Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce new state-of-the-art paralinguistic representation derived large-scale, fully self-supervised training of 600M+ parameter Conformer-based architecture. We benchmark on diverse set tasks and demonstrate that simple linear classifiers trained top our time-averaged...
Deep neural networks represent a powerful class of function approximators that can learn to compress and reconstruct images. Existing image compression algorithms based on quantized representations with constant spatial bit rate across each image. While entropy coding introduces some variation, traditional codecs have benefited significantly by explicitly adapting the local complexity visual saliency. This paper an algorithm combines deep quality-sensitive adaptation using tiled network. We...
Prosodic modeling is a core problem in speech synthesis. The key challenge producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" Tacotron, recently proposed end-to-end neural synthesis model. Using style tokens, aim to extract independent prosodic styles training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn variety...
Recent advances in self-supervision have dramatically improved the quality of speech representations. However, deployment state-of-the-art embedding models on devices has been restricted due to their limited public availability and large resource footprint. Our work addresses these issues by publicly releasing a collection paralinguistic that are small near performance. approach is based knowledge distillation, our distilled data only. We explore different architectures thoroughly evaluate...
Abstract The COVID-19 pandemic has highlighted the global need for reliable models of disease spread. We propose an AI-augmented forecast modeling framework that provides daily predictions expected number confirmed deaths, cases, and hospitalizations during following 4 weeks. present international, prospective evaluation our models’ performance across all states counties in USA prefectures Japan. Nationally, incident mean absolute percentage error (MAPE) predicting associated deaths...
The generation of transmission schedules for self-organizing radio networks by traffic-sensitive algorithms is described. A centralized 'Traffic' algorithm that can be used as a performance benchmark presented. Also described distributed 'degree' traffic-sensitized version an developed A. Ephremides and T. Truong (1990). Two measures comparing simulation results are also presented.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>
Learned speech representations can drastically improve performance on tasks with limited labeled data.However, due to their size and complexity, learned have utility in mobile settings where run-time be a significant bottleneck.In this work, we propose class of lightweight non-semantic embedding models that run efficiently devices based the recently proposed TRILL embedding.We combine novel architectural modifications existing speed-up techniques create are fast enough real-time device...
We introduce a stop-code tolerant (SCT) approach to training recurrent convolutional neural networks for lossy image compression. Our methods multi-pass method combine the goals of high-quality reconstructions in areas around masking as well highly-detailed areas. These lead lower true bitrates given recursion count, both pre- and post-entropy coding, even using unstructured LZ77 code The pre-LZ77 gains are achieved by trimming stop codes. post-LZ77 due highly unequal distributions 0/1 codes...
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity impairment. Classification approaches also help identify hard-to-recognize samples to teach ASR systems about variable manifestations impaired speech. Here, we develop compare different deep learning techniques classify intelligibility on selected phrases. We collected from a diverse set 661 speakers with variety self-reported disorders speaking 29 words or phrases, which...
This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each the architectures we describe can provide variable rates during deployment without requiring retraining network: each network need only be trained once. All our consist recurrent (RNN)-based encoder and decoder, binarizer, for entropy coding. We compare RNN types (LSTM, associative LSTM) introduce new hybrid GRU ResNet. also study "one-shot" versus additive reconstruction...
Health anxiety has many damaging effects on patients with chronic illness. Physicians are often unable to alleviate concerns related living a disease that an impact daily life, and unregulated websites can overrepresent extreme anxiety-inducing outcomes. Educational clinician video interventions have shown some success as acute anxiolytic in health settings. However, little research evaluated if peer-based would be feasible alternative or improvement.This pilot study assesses the efficacy of...