Xinsheng Wang

ORCID: 0000-0003-1826-7419
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Multimodal Machine Learning Applications
  • Natural Language Processing Techniques
  • Speech Recognition and Synthesis
  • Astrophysics and Cosmic Phenomena
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Dark Matter and Cosmic Phenomena
  • Music and Audio Processing
  • Generative Adversarial Networks and Image Synthesis
  • Translation Studies and Practices
  • Remote Sensing and Land Use
  • Image Retrieval and Classification Techniques
  • Voice and Speech Disorders
  • Neutrino Physics Research
  • Video Analysis and Summarization
  • Subtitles and Audiovisual Media
  • Advanced Wireless Communication Techniques
  • Topic Modeling
  • Dermatologic Treatments and Research
  • Wireless Communication Networks Research
  • Particle Detector Development and Performance
  • Web Applications and Data Management
  • Advanced Computational Techniques and Applications
  • Face recognition and analysis

Jiangsu University
2025

Northwestern Polytechnical University
2022-2024

Jiangxi Agricultural University
2024

Friedrich-Alexander-Universität Erlangen-Nürnberg
2023

Xi'an Jiaotong University
2017-2023

Delft University of Technology
2020-2022

Hubei University
2007-2021

Pennsylvania State University
2009

Southeast University
2007

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....

10.1109/taslp.2022.3145293 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the transferred from reference recorded by another (source) speaker. During process, identity information of source could also affect synthesized results, resulting issue leakage, i.e., synthetic may have voice rather than This paper proposes new method aim controllable emotional expressive and meanwhile maintain speaker's TTS task. proposed is...

10.1109/taslp.2022.3164181 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Restless legs syndrome (RLS) is a neurological disorder that thought to involve decreased iron availability in the brain. Iron required for oxidative metabolism and plays critical role redox reactions mitochondria. The recent discovery of mitochondrial ferritin (FtMt) provided opportunity identify potential correlation between function RLS. Human substantia nigra (SN) putamen autopsy samples from 8 RLS cases controls were analyzed. Mitochondrial levels SN tissue homogenate assessed by...

10.1097/nen.0b013e3181bdc44f article EN Journal of Neuropathology & Experimental Neurology 2009-10-22

Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation their adaptation to has encountered challenges, notably slow convergence suboptimal outcomes. To address these issues enhance efficacy separation, we introduce EDSep, a novel single-channel method grounded score matching via stochastic differential equation (SDE). This enhances...

10.48550/arxiv.2501.15965 preprint EN arXiv (Cornell University) 2025-01-27

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and prevalence noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores synthetic enhance quality generation. CosyAudio...

10.48550/arxiv.2501.16761 preprint EN arXiv (Cornell University) 2025-01-28

Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...

10.48550/arxiv.2502.04128 preprint EN arXiv (Cornell University) 2025-02-06

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages estimated be lacking a commonly used written form. Consequently, these cannot benefit text-based technologies. This paper presents 1) new speech technology task, i.e., speech-to-image generation (S2IG) framework which translates descriptions photo-realistic images 2) without using any information, thus allowing...

10.1109/taslp.2021.3053391 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

The production performance of laying hens is influenced by various environmental factors within the henhouse. intricate interactions among these make impact process highly complicated. exact relationships between and variables are still not well understood. In this study, we measured across different parts henhouse, evaluated weight each variable, constructed a rate prediction model. Results displayed that body weight, rate, egg eggshell thickness decrease gradually from WCA to FA (P <...

10.1016/j.psj.2024.104185 article EN cc-by-nc-nd Poultry Science 2024-08-20

10.1016/j.ijheatmasstransfer.2017.12.149 article EN International Journal of Heat and Mass Transfer 2018-01-03

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind alignment/attention mechanism crucial for MWD system to learn meaningful word-level representation. We verify theory by conducting and experiments on MSCOCO Flickr8k, empirically demonstrate both neural MT with self-attention statistical achieve scores are superior those state-of-the-art system, outperforming it 2% 5% alignment F1...

10.1109/icassp39728.2021.9414418 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

An estimated half of the world's languages do not have a written form, making it impossible for these to benefit from any existing text-based technologies.In this paper, speech-toimage generation (S2IG) framework is proposed which translates speech descriptions photo-realistic images without using text information, thus allowing unwritten potentially technology.The S2IG framework, named S2IGAN, consists embedding network (SEN) and relation-supervised densely-stacked generative model...

10.21437/interspeech.2020-1759 article EN Interspeech 2022 2020-10-25

Automatically generating videos in which synthesized speech is synchronized with lip movements a talking head has great potential many human-computer interaction scenarios. In this paper, we present an automatic method to generate and talking-head on the basis of text single face image arbitrary person as input. contrast previous text-driven generation methods, can only synthesize voice specific person, proposed capable synthesizing for any person. Specifically, decomposes into two stages,...

10.1109/tmm.2022.3214100 article EN IEEE Transactions on Multimedia 2022-10-12

Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing its potential ability real-word applications. learning models rely on an embedding space, where both semantic descriptions of classes and visual features instances can be embedded for nearest neighbor search. Recently, most existing works consider space formulated by deep as ideal choice space. However, discrete distribution makes data structure unremarkable. We...

10.48550/arxiv.1907.00330 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Image captioning technology has great potential in many scenarios. However, current text-based image methods cannot be applied to approximately half of the world's languages due these lack a written form. To solve this problem, recently image-to-speech task was proposed, which generates spoken descriptions images bypassing any text via an intermediate representation consisting phonemes (image-to-phoneme). Here, we present comprehensive study on which, 1) several representative image-to-text...

10.1109/taslp.2021.3120644 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

In addition to conveying the linguistic content from source speech converted speech, maintaining speaking style of also plays an important role in voice conversion (VC) task, which is essential many scenarios with highly expressive such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length embedding extracted model insufficient achieve comprehensive modeling target speaker timbre preservation. Inspired by style's multi-scale nature human a...

10.1109/taslp.2023.3313414 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

As the technology in GIS and Computer Science develops, more applications of have been put into practice. And many general spatial databases has constructed which are connected with development social economy tightly. With complete databases, priority facilitating improving application is to retrieve data effectively output efficiently accurately. The current popular database products only include limited functions Retrieval. main drawback that retrieval can be operated on one dataset at...

10.1109/csss.2011.5974145 article EN 2011-06-01

In the case of unwritten languages, acoustic models cannot be trained in standard way, i.e., using speech and textual transcriptions. Recently, several methods have been proposed to learn representations images, visual grounding. Existing studies focused on scene images. Here, we investigate whether fine-grained semantic information, reflecting relationship between attributes objects, can learned from spoken language. To this end, a Fine-grained Semantic Embedding Network (FSEN) for learning...

10.1109/iscas51556.2021.9401232 article EN 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2021-04-27

The new fault-tolerant onboard computer(OBC) with dual processing modules is presented to improve the micro-satellite data handling. Each module composed of 32-bit ARM processor. Using fault tolerance method, OBC's hardware structure implemented based on commercial-off-the-shelf (COTS) devices. As well as, a detail analysis handling mechanism and software architecture given. Considering demanding extremely tight constraints mass, volume, power consumption space environmental conditions,...

10.13190/jbupt.200504.23.wangxsh article EN Beijing Youdian Xueyuan xuebao 2005-08-28

Inspired by the ability of human beings on recognizing relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos associated sounds. In this work, first time, a Look, Listen Infer Network (LLINet) is proposed to learn zero-shot model that can infer sounds from novel categories never appeared before. LLINet mainly desired qualify two tasks, i.e., image-audio retrieval sound localization in each image. Towards end, it...

10.1145/3394171.3414023 article EN Proceedings of the 30th ACM International Conference on Multimedia 2020-10-12
Coming Soon ...