- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Speech and Audio Processing
- Music and Audio Processing
- Biomedical Text Mining and Ontologies
- Tensor decomposition and applications
- Radiomics and Machine Learning in Medical Imaging
- Hydrocarbon exploration and reservoir analysis
- Hydraulic Fracturing and Reservoir Analysis
- Model Reduction and Neural Networks
- AI in cancer detection
- Soil Mechanics and Vehicle Dynamics
- Neural Networks and Applications
- Gaussian Processes and Bayesian Inference
- Artificial Intelligence in Healthcare and Education
- COVID-19 diagnosis using AI
- Machine Learning in Healthcare
- Vehicle Dynamics and Control Systems
- Geographic Information Systems Studies
- Numerical methods for differential equations
- Intelligent Tutoring Systems and Adaptive Learning
- Control Systems in Engineering
- Data Management and Algorithms
Carnegie Mellon University
2022-2025
Chongqing University
2025
RS Dynamics (Czechia)
2025
State Key Laboratory of Coal Mine Disaster Dynamics and Control
2025
Cornell University
2022-2023
Weill Cornell Medicine
2022-2023
Hunan University of Science and Technology
2023
University of Pittsburgh
2023
The University of Texas at Austin
2023
Shanghai Jiao Tong University
2022
Conformer, combining convolution and self-attention sequentially to capture both local global information, has shown remarkable performance is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating but they not managed match Conformer's performance. The recently introduced Branchformer achieves comparable Conformer by using dedicated branches of merging context from each branch. In this paper, we propose...
As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language (NLP) tasks. However, few open source toolkits that can be used generate reproducible results on different Spoken Understanding (SLU) benchmarks. Hence, a need build standard have faster start into SLU research. We present ESPnet-SLU, which designed for quick development spoken language understanding in single framework. ESPnet-SLU project...
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions OWSM (v1 to v3) are still based on standard Transformer, which might lead inferior performance compared state-of-the-art speech encoder architectures. This work aims improve efficiency without additional data. We present a series...
Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global self-attention. Inspired by this, we propose a more flexible, interpretable customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged end-to-end processing. In each layer, one branch employs self-attention or its variant capture long-range dependencies, while other utilizes an MLP module...
Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety languages. With how many languages these handle, however, key understanding their imbalanced performance across different is examine if model actually knows which language it should transcribe. In this paper, we introduce our work on improving FLEURS, 102-language open ASR benchmark, by conditioning entire identity (LID). We investigate techniques inspired from recent...
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim reduce the model size computation without degradation accuracy. Prior studies focus on of Transformers; however, not only utilize a stack Transformer blocks, also combine frontend network based multiple convolutional layers for low-level feature learning. This small heavy computational cost. In...
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, benchmark designed for building universal capable leveraging instruction tuning perform multiple fashion. To achieve...
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, synthesis, text generation, and continuation. VoxtLM integrates vocabulary with discrete tokens from self-supervised features uses special to enable multitask learning. Compared single-task exhibits significant improvement in improvements both intelligibility 28.9 5.6 objective quality 2.68 3.90. also improves generation recognition performance over the counterpart. Further, is trained publicly...
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these have been extensively characterized in other modalities, their behavior speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual recognition and translation models spanning 0.25B to 18B parameters, with the version being largest model, best our knowledge. OWLS leverages up 360K hours public data across 150...
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained 680k hours supervised data. It generalizes well to various recognition and translation benchmarks even in zero-shot setup. However, the full pipeline for developing such (from collection training) not publicly accessible, which makes it difficult researchers further improve its performance address training-related issues as efficiency, robustness,...
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, employ four types of their combinations SLU. We leverage self-supervised speech (LM) on large quantities un-paired extract strong text...
Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polak, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 3: System Demonstrations). 2023.
While human evaluation is the most reliable metric for evaluating speech generation systems, it generally costly and time-consuming. Previous studies on automatic quality assessment address problem by predicting scores with machine learning models. However, they rely supervised thus suffer from high annotation costs domain-shift problems. We propose SpeechLMScore, an unsupervised to evaluate generated using a language model. SpeechLMScore computes average log-probability of signal mapping...
Automatic radiology report summarization is a crucial clinical task, whose key challenge to maintain factual accuracy between produced summaries and ground truth findings. Existing research adopts reinforcement learning directly optimize consistency metrics such as CheXBert or RadGraph score. However, their decoding method using greedy search beam considers no when picking the optimal candidate, leading limited improvement. To address it, we propose novel second-stage summarizing approach...
In mammography, calcifications are one of the most common signs breast cancer. Detection such lesions is an active area research for computer-aided diagnosis and machine learning algorithms. Due to limited numbers positive cases, many supervised detection models suffer from overfitting fail generalize. We present a one-class, semi-supervised framework using deep convolutional autoencoder trained with over 50,000 images 11,000 negative-only cases. Since model learned only normal parenchymal...
Brian Yan, Patrick Fernandes, Siddharth Dalmia, Jiatong Shi, Yifan Peng, Dan Berrebbi, Xinyi Wang, Graham Neubig, Shinji Watanabe. Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022). 2022.
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such is intrinsically linked to context present in training data, E2E-ASR do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, must be easily contextualized by or developer. This paper proposes an attention-based contextual biasing method that can customized using editable phrase list (referred a bias list)....