Guanglu Wan

ORCID: 0009-0003-1061-3724
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Topic Modeling
  • Speech and dialogue systems
  • Music and Audio Processing
  • Speech and Audio Processing
  • Sentiment Analysis and Opinion Mining
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Adversarial Robustness in Machine Learning
  • Advanced Text Analysis Techniques
  • Machine Learning in Healthcare
  • Web Data Mining and Analysis
  • Emotion and Mood Recognition
  • Advancements in Photolithography Techniques
  • Spam and Phishing Detection
  • Robotics and Automated Systems
  • Advanced Data Storage Technologies
  • Fault Detection and Control Systems
  • Text and Document Classification Technologies
  • Anomaly Detection Techniques and Applications
  • Imbalanced Data Classification Techniques
  • Service-Oriented Architecture and Web Services
  • AI in Service Interactions
  • Advancements in Semiconductor Devices and Circuit Design

Meizu (China)
2021-2023

Data efficient voice cloning aims at synthesizing target speaker's with only a few enrollment samples hand.To this end, speaker adaptation and encoding are two typical methods based on base model trained from multiple speakers.The former uses small set of data to transfer the multi-speaker through direct update, while in latter, seconds audio directly goes an extra along synthesize without update.Nevertheless, need clean data.However, provided by user may inevitably contain acoustic noise...

10.21437/interspeech.2020-2530 article EN Interspeech 2022 2020-10-25

Dialogue topic segmentation is a challenging task in which dialogues are split into segments with pre-defined topics. Existing works on adopt two-stage paradigm, including text and segment labeling. However, such methods tend to focus the local context segmentation, inter-segment dependency not well captured. Besides, ambiguity labeling noise dialogue bounds bring further challenges existing models. In this work, we propose Parallel Extraction Network Neighbor Smoothing (PEN-NS) address...

10.1145/3477495.3531817 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022-07-06

Pre-trained language models have achieved noticeable performance on the intent detection task. However, due to assigning an identical weight each sample, they suffer from overfitting of simple samples and failure learn complex well. To handle this problem, we propose a density-based dynamic curriculum learning model. Our model defines sample's difficulty level according their eigenvectors' density. In way, exploit overall distribution all samples' eigenvectors simultaneously. Then apply...

10.1145/3459637.3482082 preprint EN 2021-10-26

In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide for target speaker. However, it hard one express all expected styles. this paper, more general which produce expressive speech by combining any styles and timbres from multi-speaker corpus in each has unique style, proposed. To realize novel method This Tacotron2-based framework but fine-grained text-based prosody predicting module identity controller. Experiments...

10.1109/iscslp57327.2022.10038056 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2022-12-11

Given a long text, the summarization system aims to obtain shorter highlight while keeping important information on original text. For customer service, summaries of most dialogues between an agent and user focus several fixed key points, such as user's question, purpose, agent's solution, so on. Traditional extractive methods are difficult extract all predefined points exactly. Furthermore, there is lack large-scale high-quality datasets containing points. In order solve above challenges,...

10.1145/3404835.3463046 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these on-device applications with constrained computational resources. On the other hand, light-weight are highly desired practice despite their sub-optimal performance. This research aims improve SR through large-scale label-free knowledge distillation (KD). Existing KD approaches typically require labels learn task-specific knowledge,...

10.1109/iscslp57327.2022.10038276 article EN 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2022-12-11

State-of-art speaker verification (SV) systems use a backend model to score the similarity of embeddings extracted from neural network model.The commonly used back-end models are cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring.With recently developed embeddings, theoretically more appealing PLDA approach is found have no advantage against or even be inferior simple in terms SV system performance.This paper presents an investigation on relation between two...

10.21437/interspeech.2022-10021 article EN Interspeech 2022 2022-09-16

Speaker verification can be formulated as a representation learning task, where speaker-discriminative embeddings are extracted from utterances of variable lengths. Momentum Contrast (MoCo) is recently proposed unsupervised framework, and has shown its effectiveness for good feature downstream vision tasks. In this work, we apply MoCo to learn speaker embedding speech segments. We explore both pretraining settings. the scenario, learned by audio data without using any specific information....

10.48550/arxiv.2001.01986 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging.Recently, a class of methods such as density ratio (DR) and internal estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method.The basic idea behind these that RNN-T posterior should first subtract implicitly learned (ILM) prior, order to integrate ELM.While recent studies suggest only learns some low-order information, DR...

10.21437/interspeech.2022-10576 article EN Interspeech 2022 2022-09-16

The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are peaks in the probability distribution predicted by models, each peak represents a non-blank token. recognition latency models can be reduced encouraging predict earlier. Existing methods reduce require modifying transition relationship between tokens forward-backward algorithm, gradient calculation. Some these even depend on forced...

10.1109/icassp49357.2023.10095377 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Sentiment analysis is a fundamental task, and structure sentiment (SSA) an important component of analysis. However, traditional SSA suffering from some issues: (1) lack interactive knowledge different languages; (2) small amount annotation data or even no data. To address the above problems, we incorporate augment auxiliary tasks within cross-lingual pretrained language model into SSA. Specifically, employ XLM-Roberta to enhance mutually information when parallel available in pretraining...

10.18653/v1/2022.semeval-1.185 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2022-01-01

Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence (AI) assistants and facilitate international communication, it is essential to enhance performance speech recognition, which a crucial component interaction. In this paper, we propose two simple parameter-efficient methods: language prompt tuning frame-level adapter, respectively language-configurable language-agnostic recognition....

10.1109/icassp48485.2024.10446990 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies LLM quantization....

10.48550/arxiv.2405.17233 preprint EN arXiv (Cornell University) 2024-05-27

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, automatic speech recognition (ASR) has also garnered significant attention, as evidenced systems like Whisper. However, the proprietary nature of training data impeded researchers' efforts study ASR. This paper introduces MSR-86K, an evolving, large-scale corpus for research. The is derived from publicly accessible videos on...

10.48550/arxiv.2406.18301 preprint EN arXiv (Cornell University) 2024-06-26

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, automatic speech recognition (ASR) has also garnered significant attention, as evidenced systems like Whisper. However, the proprietary nature of training data impeded researchers' efforts study ASR. This paper introduces MSR-86K, an evolving, large-scale corpus for research. The is derived from publicly accessible videos on...

10.21437/interspeech.2024-890 article EN Interspeech 2022 2024-09-01

In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality of voices common scenarios. unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, existing methods can only extract and integrate coarse-grained timbre or mixed rhythm attributes separately. this paper, we propose AS-Speech, an methodology that integrates characteristics rhythmic into unified framework...

10.48550/arxiv.2409.05730 preprint EN arXiv (Cornell University) 2024-09-09

10.1109/slt61566.2024.10832337 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

Multi-label unknown intent detection is a challenging task where each utterance may contain not only multiple known but also intents. To tackle this challenge, pioneers proposed to predict the number of first, then compare it with results matching decide whether contains intent(s). Though they have made remarkable progress on task, their method still suffers from two important issues: 1) It inadequate extract intents using encoding; 2) Optimizing sub-tasks (intent prediction and matching)...

10.1145/3583780.3615163 article EN cc-by 2023-10-21

Data-driven methods have achieved notable performance on intent detection, which is a task to comprehend user queries. Nonetheless, they are controversial for over-confident predictions. In some scenarios, users do not only care about the accuracy but also confidence of model. Unfortunately, mainstream neural networks poorly calibrated, with large gap between and confidence. To handle this problem defined as calibration, we propose model using hyperspherical space rebalanced...

10.1609/aaai.v36i10.21314 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in past several years, most research efforts focus on detecting events formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, each dataset are either single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating Web...

10.18653/v1/2022.emnlp-main.191 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Data efficient voice cloning aims at synthesizing target speaker's with only a few enrollment samples hand. To this end, speaker adaptation and encoding are two typical methods based on base model trained from multiple speakers. The former uses small set of data to transfer the multi-speaker through direct update, while in latter, seconds audio directly goes an extra along synthesize without update. Nevertheless, need clean data. However, provided by user may inevitably contain acoustic...

10.48550/arxiv.2008.04265 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability interpretability; (iii) annotation-hungry. In this paper, we propose to represent utterance with a simpler concept named Dialogue Action, upon which construct tree-structured TaskFlow further build chatbot as core component. A framework is presented automatically large-scale dialogues deploy online....

10.1145/3477495.3536331 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022-07-06
Coming Soon ...