NFDI4DS | UHH-SEMS - Publication Details

Helin Wang

ORCID: 0000-0001-6088-0378

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5101664824

Research Areas

Speech and Audio Processing
Music and Audio Processing
Speech Recognition and Synthesis
Hearing Loss and Rehabilitation
Music Technology and Sound Studies
Topic Modeling
Speech and dialogue systems
Diverse Musicological Studies
Video Analysis and Summarization
Natural Language Processing Techniques
Machine Learning in Healthcare
Acoustic Wave Phenomena Research
Blind Source Separation Techniques
Text and Document Classification Technologies
Animal Vocal Communication and Behavior
Indoor and Outdoor Localization Technologies
Artificial Intelligence in Healthcare
Optical measurement and interference techniques
Data Stream Mining Techniques
Cardiovascular Health and Risk Factors
Digital Radiography and Breast Imaging
Breast Cancer Treatment Studies
stochastic dynamics and bifurcation
Noise Effects and Management
Artificial Intelligence in Healthcare and Education

Johns Hopkins University
2023-2025

Peking University
2019-2023

Harbin Institute of Technology
2023

Arizona State University
2023

Changchun University of Science and Technology
2023

Peking University Shenzhen Hospital
2022

China Medical University
2018

Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

OPENALEX - Publications

Dongchao Yang Jianwei Yu Helin Wang Wen Wang Chao Weng and 2 more

Generating sound effects that people want is an important topic. However, there are limited studies in this area for generation. In study, we investigate generating conditioned on a text prompt and propose novel text-to-sound generation framework consists of encoder, Vector Quantized Variational Autoencoder (VQ-VAE), token-decoder, vocoder. The first uses the token-decoder to transfer features extracted from encoder mel-spectrogram with help VQ-VAE, then vocoder used transform generated into...

10.1109/taslp.2023.3268730 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training

OPENALEX - Publications

Dading Chong Helin Wang Peilin Zhou Qingcheng Zeng

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in audio domain, most transformer-based for tasks are finetuned from pre-trained other domains (e.g. image), which has a notable gap with domain. Other methods explore self-supervised learning approaches directly domain but currently do not perform downstream tasks. In this paper, we present novel method models, called masked...

10.1109/icassp49357.2023.10095691 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Environmental Sound Classification with Parallel Temporal-Spectral Attention

OPENALEX - Publications

Helin Wang Yuexian Zou Dading Chong Wenwu Wang

Convolutional neural networks (CNN) are one of the bestperforming network architectures for environmental sound classification (ESC).Recently, temporal attention mechanisms have been used in CNN to capture useful information from relevant time frames audio classification, especially weakly labelled data where onset and offset times events not applied.In these methods, however, inherent spectral characteristics variations explicitly exploited when obtaining deep features.In this paper, we...

10.21437/interspeech.2020-1219 article EN Interspeech 2022 2020-10-25

Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

OPENALEX - Publications

Haoran Zhang Yuexian Zou Helin Wang

Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach text-independent task. The core principle CSSL lies in minimizing distance between embeddings augmented segments truncated same as well maximizing those...

10.1109/icassp39728.2021.9413351 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

OPENALEX - Publications

Chen Chen Yuchen Hu Siyin Wang Helin Wang Zhehuai Chen and 3 more

An ideal multimodal agent should be aware of the quality its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware speech they process. This limitation arises because evaluation is typically excluded from multi-task training due lack suitable datasets. To address this, we introduce first natural language-based corpus, generated authentic human ratings. In...

10.48550/arxiv.2501.17202 preprint EN arXiv (Cornell University) 2025-01-27

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

OPENALEX - Publications

Helin Wang Jiarui Hai Yen-Ju Lu Karan Thakkar Mounya Elhilali and 1 more

10.1109/icassp49660.2025.10890066 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

OPENALEX - Publications

Helin Wang Yu Meng Jiarui Hai Chen Chen Yuchen Hu and 3 more

10.1109/icassp49660.2025.10889119 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

OPENALEX - Publications

Junling Liu Peilin Zhou Yining Hua Dading Chong Zhongyu Tian and 6 more

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs medical is challenging due to lack standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from Chinese National Medical Licensing Examination. CMExam consists 60K+ multiple-choice questions for objective evaluations, as well solution explanations model reasoning evaluation an open-ended manner. For in-depth analyses LLMs,...

10.48550/arxiv.2306.03030 preprint EN other-oa arXiv (Cornell University) 2023-01-01

DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model

OPENALEX - Publications

Helin Wang Thomas Thebaud Jesús Villalba Myra Sydnor Becky Lammers and 2 more

10.21437/interspeech.2023-2203 article EN Interspeech 2022 2023-08-14

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

OPENALEX - Publications

Helin Wang Yuexian Zou Wenwu Wang

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC).Different from other popular methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both space hidden of enhance intermediate feature representations.For an state, techniques consist masking blocks frequency channels time frames, which improve generalization by enabling model attend not most discriminative...

10.21437/interspeech.2021-140 article EN Interspeech 2022 2021-08-27

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

OPENALEX - Publications

Dongchao Yang Songxiang Liu Helin Wang Jianwei Yu Chao Weng and 1 more

10.21437/interspeech.2023-645 article EN Interspeech 2022 2023-08-14

Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention

OPENALEX - Publications

Zhiqi Huang Fenglin Liu Xian Wu Shen Ge Helin Wang and 2 more

While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Reading task which mines textual inputs (paragraphs and questions) predict answers (choices or text spans). However, there are a lot MC tasks that accept audio input addition input, e.g. English listening comprehension test. In this paper, we target problem Audio-Oriented Multimodal Comprehension, its goal is answer questions based on given...

10.1609/aaai.v35i14.17548 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

A Mutual Learning Framework for Few-Shot Sound Event Detection

OPENALEX - Publications

Dongchao Yang Helin Wang Yuexian Zou Zhongjie Ye Wenwu Wang

Although prototypical network (ProtoNet) has proved to be an effective method for few-shot sound event detection, two problems still exist. Firstly, the small-scaled support set is insufficient so that class prototypes may not represent center accurately. Secondly, feature extractor task-agnostic (or class-agnostic): trained with base-class data and directly applied unseen-class data. To address these issues, we present a novel mutual learning framework transductive learning, which aims at...

10.1109/icassp43922.2022.9746042 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network

OPENALEX - Publications

Helin Wang Yuexian Zou Dading Chong Wenwu Wang

As a multi-label classification task, audio tagging aims to predict the presence or absence of certain sound events in an recording. Existing works do not explicitly consider probabilities co-occurrences between events, which is termed as label dependencies this study. To address issue, we propose model via graph-based method, where each node graph represents label. An adjacency matrix constructed by mining statistical relations labels represent structure information, and convolutional...

10.1109/lsp.2020.3019702 article EN IEEE Signal Processing Letters 2020-01-01

SpecAugment++: A Hidden Space Data Augmentation Method for AcousticScene Classification

OPENALEX - Publications

Helin Wang Yuexian Zou Wenwu Wang

In this paper, we present SpecAugment++, a novel data aug-mentation method for deep neural networks based acousticscene classification (ASC). Different from other popular dataaugmentation methods such as SpecAugment and mixup thatonly work on the input space, SpecAugment++ is applied toboth space hidden of neuralnetworks to enhance intermediate feature rep-resentations. For an state, augmentationtechniques consist masking blocks frequency channels andmasking time frames, which improve...

10.31219/osf.io/3mwa7 preprint EN 2021-10-02

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

OPENALEX - Publications

Zhongjie Ye Helin Wang Dongchao Yang Yuexian Zou

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language to generate human-readable sentences for clips. The current models are generally based on the neural encoder-decoder architecture, their decoder mainly uses information that is extracted from CNN-based encoder. However, they have ignored semantic could help AAC model meaningful descriptions. This paper proposes a novel approach automated incorporating information....

10.48550/arxiv.2110.06100 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Acoustic Scene Classification with Spectrogram Processing Strategies

OPENALEX - Publications

Helin Wang Yuexian Zou Dading Chong

Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to networks. In this paper, we study problem of efficiently taking advantage different representations through discriminative processing strategies. There two main contributions. first contribution exploring impact combination multiple at stages,...

10.48550/arxiv.2007.03781 preprint EN other-oa arXiv (Cornell University) 2020-01-01

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

OPENALEX - Publications

Helin Wang Bo Wu Lianwu Chen Meng Yu Jianwei Yu and 5 more

In this paper, we exploit the effective way to leverage contextual information improve speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on deep neural network (DNN) for environment-aware dereverberation, which can adaptively attend information. More specifically, FullBand based Temporal Attention (FTA) is proposed, models correlations between fullband of context frames. addition, considering difference attenuation...

10.21437/interspeech.2021-481 article EN Interspeech 2022 2021-08-27

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

OPENALEX - Publications

Helin Wang Yuexian Zou Wenwu Wang

In this paper, we present SpecAugment++, a novel data augmentation method for deep neural networks based acoustic scene classification (ASC). Different from other popular methods such as SpecAugment and mixup that only work on the input space, SpecAugment++ is applied to both space hidden of enhance intermediate feature representations. For an state, techniques consist masking blocks frequency channels time frames, which improve generalization by enabling model attend not most discriminative...

10.48550/arxiv.2103.16858 preprint EN other-oa arXiv (Cornell University) 2021-01-01

A Global-Local Attention Framework for Weakly Labelled Audio Tagging

OPENALEX - Publications

Helin Wang Yuexian Zou Wenwu Wang

Weakly labelled audio tagging aims to predict the classes of sound events within an clip, where onset and offset times are not provided. Previous works have used multiple instance learning (MIL) framework, exploited information whole clip by MIL pooling functions. However, detailed such as their durations may be considered under this framework. To address issue, we propose a novel two-stream framework for exploiting global local events. The stream analyze in order capture clips that need...

10.1109/icassp39728.2021.9414357 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

What Affects the Performance of Convolutional Neural Networks for Audio Event Classification

OPENALEX - Publications

Helin Wang Dading Chong Dongyan Huang Yuexian Zou

Convolutional neural networks (CNN) have played an important role in Audio Event Classification (AEC). Both 1D-CNN and 2D-CNN methods been applied to improve the classification accuracy of AEC, there are many factors affecting performance models based on CNN. In this paper, we study different CNN for including sampling rate, signal segmentation methods, window size, mel bins filter size. The method event is one among them. It may lead overfitting problem because audio events usually happen...

10.1109/aciiw.2019.8925277 article EN 2019-09-01

Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification

OPENALEX - Publications

Dongchao Yang Helin Wang Yuexian Zou

It is well known that the mismatch between training (source) and test (target) data distribution will significantly decrease performance of acoustic scene classification (ASC) systems.To address this issue, domain adaptation (DA) one solution many unsupervised DA methods have been proposed.These focus on a scenario single source to target domain.However, we face such problem comes from multiple domains.This can be addressed by producing model per domain, but too costly.In paper, propose...

10.21437/interspeech.2021-300 article EN Interspeech 2022 2021-08-27

Improving Target Sound Extraction with Timestamp Information

OPENALEX - Publications

Helin Wang Dongchao Yang Chao Weng Jianwei Yu Yuexian Zou

Target sound extraction (TSE) aims to extract the part of a target event class from mixture audio with multiple events.The previous works mainly focus on problems weakly-labelled data, jointly learning and new classes, however, no one cares about onset offset times event, which has been emphasized in auditory scene analysis.In this paper, we study utilize such timestamp information help via detection network target-weighted time-frequency loss function.More specifically, use result (TSD) as...

10.21437/interspeech.2022-676 article EN Interspeech 2022 2022-09-16

FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning

OPENALEX - Publications

Zhongjie Ye Yuqing Wang Helin Wang Dongchao Yang Yuexian Zou

Automated audio captioning (AAC) aims at generating natural language descriptions for an clip. Due to the difficulty and high cost of annotating audio-caption pairs, existing dataset is a very small scale which leads unsatisfied performance AAC models. One intuitive effective solution augment training data boost instead more data. To this end, we propose online augmentation method (FeatureCut) incorporating encoder-decoder framework enable decoder fully make use acoustic features in...

10.23919/apsipaasc55919.2022.9980325 article EN 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022-11-07

Coming Soon ...