Xulong Zhang

ORCID: 0000-0001-7005-992X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Speech and Audio Processing
  • Diverse Musicological Studies
  • Natural Language Processing Techniques
  • Music Technology and Sound Studies
  • Topic Modeling
  • Generative Adversarial Networks and Image Synthesis
  • Emotion and Mood Recognition
  • Asian Culture and Media Studies
  • Face recognition and analysis
  • Human Motion and Animation
  • Nasal Surgery and Airway Studies
  • Handwritten Text Recognition Techniques
  • Advancements in Battery Materials
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Speech and dialogue systems
  • Digital Media Forensic Detection
  • AI in cancer detection
  • Data Analysis with R
  • Advanced Battery Materials and Technologies
  • Scientific Computing and Data Management
  • Advanced Neural Network Applications
  • Research Data Management Practices

Ping An (China)
2021-2025

Shenzhen Technology University
2021-2025

Lanzhou University
2015-2024

Chinese Academy of Medical Sciences & Peking Union Medical College
2022-2024

Lamar University
2021-2024

Jinling Institute of Technology
2024

Ningxia University
2021-2023

Foundation for Biomedical Research
2023

Committee on Publication Ethics
2023

Wuhan University of Science and Technology
2023

Voice Conversion(VC) refers to changing the timbre of a speech while retaining discourse content. Recently, many works have focused on disentangle-based learning techniques separate and linguistic content information from signal. Once successful, voice conversion will be feasible straightforward. This paper proposed novel one-shot framework based vector quantization (VQVC) AutoVC, called AVQVC. A new training method is applied VQVC more effectively. The result shows that this approach has...

10.1109/icassp43922.2022.9746369 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

The vibration caused blade High Cycle Fatigue (HCF) is seriously affects the safety operation of turbomachinery especially for aero-engine. Thus, it crucial important to identify parameters and then evaluate dynamic stress amplitude. Blade Tip Timing (BTT) method one promising solve these problems. While, need a high resolution Once Per Revolution (OPR) signal which difficult get Here, Coupled Vibration Analysis (CVA) identifying by none OPR BTT proposed. assumes that every real has its own...

10.1016/j.cja.2020.01.014 article EN cc-by-nc-nd Chinese Journal of Aeronautics 2020-03-19

Voice Conversion (VC) aims to convert the style of a source speaker, such as timbre and pitch, any target speaker while preserving linguistic content. However, ground truth converted speech does not exist in non-parallel VC scenario, which induces train-inference mismatch problem. Moreover, existing methods still have an inaccurate pitch low adaptation quality, there is significant disparity between domains. As result, models tend generate with hoarseness, posing challenges achieving...

10.48550/arxiv.2501.01861 preprint EN arXiv (Cornell University) 2025-01-03

The effect of the process aid “OPS” on rheological properties hydroxyl-terminated polybutadiene propellant was investigated by formulating different components high-solid-content slurry, and change in slurry viscosity with shear rate, surface morphology solid-phase particles, contact angle relevant interfaces were characterized. results showed that polyalkene polyamine surfactant OPS could significantly reduce apparent enhance to up a 30% reduction, achieved adjusting interfacial aluminum...

10.3390/polym17030286 article EN Polymers 2025-01-23

10.1109/icassp49660.2025.10889497 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890303 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Metaverse is an interactive world that combines reality and virtuality, where participants can be virtual avatars. Anyone hold a concert in hall, users quickly identify the real singer behind idol through identification. Most identification methods are processed using frame-level features. However, expect singer's timbre, music frame includes information, such as melodiousness, rhythm, tonal. It means information noise for features to singers. In this paper, instead of only features, we...

10.1109/ijcnn55064.2022.9892657 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2022-07-18

Any-to-any voice conversion problem aims to convert voices for source and target speakers, which are out of the training data. Previous works wildly utilize disentangle-based models. The model assumes speech consists content speaker style information untangle them change conversion. focus on reducing dimension get information. But size is hard determine lead overlapping problem. We propose Disentangled Representation Voice Conversion (DRVC) address issue. DRVC an end-to-end self-supervised...

10.1109/icassp43922.2022.9747434 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of prosody. We introduce ED-TTS, a model that leverages Speech Emotion Diarization (SED) and Recognition (SER) to emotions at different levels. Specifically, our proposed approach integrates emotion by SER with fine-grained frame-level obtained SED. These embeddings are used condition reverse process denoising diffusion...

10.1109/icassp48485.2024.10446467 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability well represent Besides, speaker-style modeling pre-trained models making process more complex. To tackle these issues, we introduce an new method named "CTVC" which utilizes disen-tangled contrastive learning and time-invariant...

10.1109/icassp48485.2024.10447283 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Lithium-ion battery is widely utilized in space applications with its significant performance advantages. The safety and reliability of lithium-ion are critical for spacecraft. It essential to assess the degradation estimate state battery. Meanwhile, as a brand new terminology Cyber-physical Systems (CPS), Digital Twin used smart manufacturing industry due advantages on real-time, stability reliability. Thus, can be pack ensure So far, has not been application about management assessment. As...

10.1109/i2mtc.2019.8827160 article EN 2022 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2019-05-01

Singing voice detection or vocal is a classification task that determines whether given audio segment contains singing voices. This plays very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between and nonsinging parts, it still difficult for machines to do so. Most existing methods focus on feature engineering with classifiers, which rely the experience of algorithm designer. In recent years, deep...

10.3390/electronics9091458 article EN Electronics 2020-09-07

Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty that, most of the time, various instruments singing voices are mixed according to harmonic structure, making it hard identify fundamental frequency (F0) a voice. Therefore, reducing interference accompaniment beneficial pitch estimation In this paper, we first adopted high-resolution network (HRNet) separate vocals from polyphonic music, then designed encoder-decoder estimate...

10.3390/electronics10030298 article EN Electronics 2021-01-26

Multi-speaker text-to-speech (TTS) using a few adaption data is challenge in practical applications. To address that, we propose zero-shot multi-speaker TTS, named nnSpeech, that could synthesis new speaker voice without fine-tuning and only one utterance. Compared with representation module to extract the characteristics of speakers, our method bases on speaker-guided conditional variational autoencoder can generate variable Z, which contains both content information. The latent Z...

10.1109/icassp43922.2022.9746875 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims better transfer and control further deliver the speaker's questioning intention while transferring emotion from reference speech. We a multi-style extractor extract style embedding two different levels. While sentence level represents emotion, final syllable intonation. For control, use relative...

10.1109/icassp49357.2023.10095623 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05
Coming Soon ...