- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Hearing Loss and Rehabilitation
- Music Technology and Sound Studies
- Topic Modeling
- Speech and dialogue systems
- Diverse Musicological Studies
- Video Analysis and Summarization
- Natural Language Processing Techniques
- Machine Learning in Healthcare
- Acoustic Wave Phenomena Research
- Blind Source Separation Techniques
- Text and Document Classification Technologies
- Animal Vocal Communication and Behavior
- Indoor and Outdoor Localization Technologies
- Artificial Intelligence in Healthcare
- Optical measurement and interference techniques
- Data Stream Mining Techniques
- Cardiovascular Health and Risk Factors
- Digital Radiography and Breast Imaging
- Breast Cancer Treatment Studies
- stochastic dynamics and bifurcation
- Noise Effects and Management
- Artificial Intelligence in Healthcare and Education
Johns Hopkins University
2023-2025
Peking University
2019-2023
Harbin Institute of Technology
2023
Arizona State University
2023
Changchun University of Science and Technology
2023
Peking University Shenzhen Hospital
2022
China Medical University
2018
Generating sound effects that people want is an important topic. However, there are limited studies in this area for generation. In study, we investigate generating conditioned on a text prompt and propose novel text-to-sound generation framework consists of encoder, Vector Quantized Variational Autoencoder (VQ-VAE), token-decoder, vocoder. The first uses the token-decoder to transfer features extracted from encoder mel-spectrogram with help VQ-VAE, then vocoder used transform generated into...
Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in audio domain, most transformer-based for tasks are finetuned from pre-trained other domains (e.g. image), which has a notable gap with domain. Other methods explore self-supervised learning approaches directly domain but currently do not perform downstream tasks. In this paper, we present novel method models, called masked...
Convolutional neural networks (CNN) are one of the bestperforming network architectures for environmental sound classification (ESC).Recently, temporal attention mechanisms have been used in CNN to capture useful information from relevant time frames audio classification, especially weakly labelled data where onset and offset times events not applied.In these methods, however, inherent spectral characteristics variations explicitly exploited when obtaining deep features.In this paper, we...
Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach text-independent task. The core principle CSSL lies in minimizing distance between embeddings augmented segments truncated same as well maximizing those...
An ideal multimodal agent should be aware of the quality its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware speech they process. This limitation arises because evaluation is typically excluded from multi-task training due lack suitable datasets. To address this, we introduce first natural language-based corpus, generated authentic human ratings. In...
Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs medical is challenging due to lack standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from Chinese National Medical Licensing Examination. CMExam consists 60K+ multiple-choice questions for objective evaluations, as well solution explanations model reasoning evaluation an open-ended manner. For in-depth analyses LLMs,...
In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC).Different from other popular methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both space hidden of enhance intermediate feature representations.For an state, techniques consist masking blocks frequency channels time frames, which improve generalization by enabling model attend not most discriminative...
While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Reading task which mines textual inputs (paragraphs and questions) predict answers (choices or text spans). However, there are a lot MC tasks that accept audio input addition input, e.g. English listening comprehension test. In this paper, we target problem Audio-Oriented Multimodal Comprehension, its goal is answer questions based on given...
Although prototypical network (ProtoNet) has proved to be an effective method for few-shot sound event detection, two problems still exist. Firstly, the small-scaled support set is insufficient so that class prototypes may not represent center accurately. Secondly, feature extractor task-agnostic (or class-agnostic): trained with base-class data and directly applied unseen-class data. To address these issues, we present a novel mutual learning framework transductive learning, which aims at...
As a multi-label classification task, audio tagging aims to predict the presence or absence of certain sound events in an recording. Existing works do not explicitly consider probabilities co-occurrences between events, which is termed as label dependencies this study. To address issue, we propose model via graph-based method, where each node graph represents label. An adjacency matrix constructed by mining statistical relations labels represent structure information, and convolutional...
In this paper, we present SpecAugment++, a novel data aug-mentation method for deep neural networks based acousticscene classification (ASC). Different from other popular dataaugmentation methods such as SpecAugment and mixup thatonly work on the input space, SpecAugment++ is applied toboth space hidden of neuralnetworks to enhance intermediate feature rep-resentations. For an state, augmentationtechniques consist masking blocks frequency channels andmasking time frames, which improve...
Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language to generate human-readable sentences for clips. The current models are generally based on the neural encoder-decoder architecture, their decoder mainly uses information that is extracted from CNN-based encoder. However, they have ignored semantic could help AAC model meaningful descriptions. This paper proposes a novel approach automated incorporating information....
Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to networks. In this paper, we study problem of efficiently taking advantage different representations through discriminative processing strategies. There two main contributions. first contribution exploring impact combination multiple at stages,...
In this paper, we exploit the effective way to leverage contextual information improve speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on deep neural network (DNN) for environment-aware dereverberation, which can adaptively attend information. More specifically, FullBand based Temporal Attention (FTA) is proposed, models correlations between fullband of context frames. addition, considering difference attenuation...
In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both space hidden of enhance intermediate feature representations. For an state, techniques consist masking blocks frequency channels time frames, which improve generalization by enabling model attend not most discriminative...
Weakly labelled audio tagging aims to predict the classes of sound events within an clip, where onset and offset times are not provided. Previous works have used multiple instance learning (MIL) framework, exploited information whole clip by MIL pooling functions. However, detailed such as their durations may be considered under this framework. To address issue, we propose a novel two-stream framework for exploiting global local events. The stream analyze in order capture clips that need...
Convolutional neural networks (CNN) have played an important role in Audio Event Classification (AEC). Both 1D-CNN and 2D-CNN methods been applied to improve the classification accuracy of AEC, there are many factors affecting performance models based on CNN. In this paper, we study different CNN for including sampling rate, signal segmentation methods, window size, mel bins filter size. The method event is one among them. It may lead overfitting problem because audio events usually happen...
It is well known that the mismatch between training (source) and test (target) data distribution will significantly decrease performance of acoustic scene classification (ASC) systems.To address this issue, domain adaptation (DA) one solution many unsupervised DA methods have been proposed.These focus on a scenario single source to target domain.However, we face such problem comes from multiple domains.This can be addressed by producing model per domain, but too costly.In paper, propose...
Target sound extraction (TSE) aims to extract the part of a target event class from mixture audio with multiple events.The previous works mainly focus on problems weakly-labelled data, jointly learning and new classes, however, no one cares about onset offset times event, which has been emphasized in auditory scene analysis.In this paper, we study utilize such timestamp information help via detection network target-weighted time-frequency loss function.More specifically, use result (TSD) as...
Automated audio captioning (AAC) aims at generating natural language descriptions for an clip. Due to the difficulty and high cost of annotating audio-caption pairs, existing dataset is a very small scale which leads unsatisfied performance AAC models. One intuitive effective solution augment training data boost instead more data. To this end, we propose online augmentation method (FeatureCut) incorporating encoder-decoder framework enable decoder fully make use acoustic features in...