NFDI4DS | UHH-SEMS - Publication Details

Xinsheng Wang

ORCID: 0000-0003-1826-7419

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5021472684

Research Areas

Speech and Audio Processing
Multimodal Machine Learning Applications
Natural Language Processing Techniques
Speech Recognition and Synthesis
Astrophysics and Cosmic Phenomena
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
Dark Matter and Cosmic Phenomena
Music and Audio Processing
Generative Adversarial Networks and Image Synthesis
Translation Studies and Practices
Remote Sensing and Land Use
Image Retrieval and Classification Techniques
Voice and Speech Disorders
Neutrino Physics Research
Video Analysis and Summarization
Subtitles and Audiovisual Media
Advanced Wireless Communication Techniques
Topic Modeling
Dermatologic Treatments and Research
Wireless Communication Networks Research
Particle Detector Development and Performance
Web Applications and Data Management
Advanced Computational Techniques and Applications
Face recognition and analysis

Jiangsu University
2025

Northwestern Polytechnical University
2022-2024

Jiangxi Agricultural University
2024

Friedrich-Alexander-Universität Erlangen-Nürnberg
2023

Xi'an Jiaotong University
2017-2023

Delft University of Technology
2020-2022

Hubei University
2007-2021

Pennsylvania State University
2009

Southeast University
2007

MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

OPENALEX - Publications

Yi Lei Shan Yang Xinsheng Wang Lei Xie

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....

10.1109/taslp.2022.3145293 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

OPENALEX - Publications

Tao Li Xinsheng Wang Qicong Xie Zhichao Wang Lei Xie

The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the transferred from reference recorded by another (source) speaker. During process, identity information of source could also affect synthesized results, resulting issue leakage, i.e., synthetic may have voice rather than This paper proposes new method aim controllable emotional expressive and meanwhile maintain speaker's TTS task. proposed is...

10.1109/taslp.2022.3164181 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Mitochondrial Ferritin in the Substantia Nigra in Restless Legs Syndrome

OPENALEX - Publications

Amanda Snyder Xinsheng Wang Stephanie M. Patton Paolo Arosio Sonia Levi and 3 more

Restless legs syndrome (RLS) is a neurological disorder that thought to involve decreased iron availability in the brain. Iron required for oxidative metabolism and plays critical role redox reactions mitochondria. The recent discovery of mitochondrial ferritin (FtMt) provided opportunity identify potential correlation between function RLS. Human substantia nigra (SN) putamen autopsy samples from 8 RLS cases controls were analyzed. Mitochondrial levels SN tissue homogenate assessed by...

10.1097/nen.0b013e3181bdc44f article EN Journal of Neuropathology & Experimental Neurology 2009-10-22

Experimental study on the relation between internal flow and flashing spray characteristics of R134a using straight tube nozzles

OPENALEX - Publications

Xinsheng Wang Bin Chen Rui Wang Xin Hui Zhifu Zhou

10.1016/j.ijheatmasstransfer.2017.08.040 article EN International Journal of Heat and Mass Transfer 2017-09-01

EDSep: An Effective Diffusion-Based Method for Speech Source Separation

OPENALEX - Publications

Jinwei Dong Xinsheng Wang Qirong Mao

Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation their adaptation to has encountered challenges, notably slow convergence suboptimal outcomes. To address these issues enhance efficacy separation, we introduce EDSep, a novel single-channel method grounded score matching via stochastic differential equation (SDE). This enhances...

10.48550/arxiv.2501.15965 preprint EN arXiv (Cornell University) 2025-01-27

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

OPENALEX - Publications

Xinfa Zhu Wenjie Tian Xinsheng Wang Lei He Xi Wang and 2 more

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and prevalence noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores synthetic enhance quality generation. CosyAudio...

10.48550/arxiv.2501.16761 preprint EN arXiv (Cornell University) 2025-01-28

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

OPENALEX - Publications

Zhen Ye Xinfa Zhu Chi-Min Chan Xinsheng Wang Xu Tan and 15 more

Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...

10.48550/arxiv.2502.04128 preprint EN arXiv (Cornell University) 2025-02-06

Weighted quasi-elastic net gradient regularization for image smoothing

OPENALEX - Publications

Yue Sun Yang Yang Xinsheng Wang Xinyu Wang Lanling Zeng

10.1117/1.jei.34.1.013053 article EN Journal of Electronic Imaging 2025-02-27

Generating Images From Spoken Descriptions

OPENALEX - Publications

Xinsheng Wang Tingting Qiao Jihua Zhu Alan Hanjalić Odette Scharenborg

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages estimated be lacking a commonly used written form. Consequently, these cannot benefit text-based technologies. This paper presents 1) new speech technology task, i.e., speech-to-image generation (S2IG) framework which translates descriptions photo-realistic images 2) without using any information, thus allowing...

10.1109/taslp.2021.3053391 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

Study on the changing patterns of production performance of laying hens and their relationships with environmental factors in a large-scale henhouse

OPENALEX - Publications

Yan Li Ruiyu Ma Renrong Qi hualong Li Junying Li and 9 more

The production performance of laying hens is influenced by various environmental factors within the henhouse. intricate interactions among these make impact process highly complicated. exact relationships between and variables are still not well understood. In this study, we measured across different parts henhouse, evaluated weight each variable, constructed a rate prediction model. Results displayed that body weight, rate, egg eggshell thickness decrease gradually from WCA to FA (P <...

10.1016/j.psj.2024.104185 article EN cc-by-nc-nd Poultry Science 2024-08-20

Numerical simulation of cryogen spray cooling by a three-dimensional hybrid vortex method

OPENALEX - Publications

Rui Wang Bin Chen Xinsheng Wang

10.1016/j.applthermaleng.2017.03.066 article EN Applied Thermal Engineering 2017-03-20

Atomization and surface heat transfer characteristics of cryogen spray cooling with expansion-chambered nozzles

OPENALEX - Publications

Xinsheng Wang Bin Chen Zhifu Zhou

10.1016/j.ijheatmasstransfer.2017.12.149 article EN International Journal of Heat and Mass Transfer 2018-01-03

Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

OPENALEX - Publications

Liming Wang Xinsheng Wang Mark Hasegawa–Johnson Odette Scharenborg Najim Dehak

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind alignment/attention mechanism crucial for MWD system to learn meaningful word-level representation. We verify theory by conducting and experiments on MSCOCO Flickr8k, empirically demonstrate both neural MT with self-attention statistical achieve scores are superior those state-of-the-art system, outperforming it 2% 5% alignment F1...

10.1109/icassp39728.2021.9414418 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

S2IGAN: Speech-to-Image Generation via Adversarial Learning

OPENALEX - Publications

Xinsheng Wang Tingting Qiao Jihua Zhu Alan Hanjalić Odette Scharenborg

An estimated half of the world's languages do not have a written form, making it impossible for these to benefit from any existing text-based technologies.In this paper, speech-toimage generation (S2IG) framework is proposed which translates speech descriptions photo-realistic images without using text information, thus allowing unwritten potentially technology.The S2IG framework, named S2IGAN, consists embedding network (SEN) and relation-supervised densely-stacked generative model...

10.21437/interspeech.2020-1759 article EN Interspeech 2022 2020-10-25

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

OPENALEX - Publications

Xinfa Zhu Wenjie Tian Xinsheng Wang Lei He Yujia Xiao and 4 more

10.1145/3664647.3681465 article EN 2024-10-26

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons

OPENALEX - Publications

Xinsheng Wang Qicong Xie Jihua Zhu Lei Xie Odette Scharenborg

Automatically generating videos in which synthesized speech is synchronized with lip movements a talking head has great potential many human-computer interaction scenarios. In this paper, we present an automatic method to generate and talking-head on the basis of text single face image arbitrary person as input. contrast previous text-driven generation methods, can only synthesize voice specific person, proposed capable synthesizing for any person. Specifically, decomposes into two stages,...

10.1109/tmm.2022.3214100 article EN IEEE Transactions on Multimedia 2022-10-12

Visual Space Optimization for Zero-shot Learning

OPENALEX - Publications

Xinsheng Wang Shanmin Pang Jihua Zhu Zhongyu Li Zhiqiang Tian and 1 more

Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing its potential ability real-word applications. learning models rely on an embedding space, where both semantic descriptions of classes and visual features instances can be embedded for nearest neighbor search. Recently, most existing works consider space formulated by deep as ideal choice space. However, discrete distribution makes data structure unremarkable. We...

10.48550/arxiv.1907.00330 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Synthesizing Spoken Descriptions of Images

OPENALEX - Publications

Xinsheng Wang Justin van der Hout Jihua Zhu Mark Hasegawa–Johnson Odette Scharenborg

Image captioning technology has great potential in many scenarios. However, current text-based image methods cannot be applied to approximately half of the world's languages due these lack a written form. To solve this problem, recently image-to-speech task was proposed, which generates spoken descriptions images bypassing any text via an intermediate representation consisting phonemes (image-to-phoneme). Here, we present comprehensive study on which, 1) several representative image-to-text...

10.1109/taslp.2021.3120644 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2021-01-01

MSM-VC: High-Fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-Scale Style Modeling

OPENALEX - Publications

Zhichao Wang Xinsheng Wang Qicong Xie Tao Li Lei Xie and 2 more

In addition to conveying the linguistic content from source speech converted speech, maintaining speaking style of also plays an important role in voice conversion (VC) task, which is essential many scenarios with highly expressive such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length embedding extracted model insufficient achieve comprehensive modeling target speaker timbre preservation. Inspired by style's multi-scale nature human a...

10.1109/taslp.2023.3313414 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

Validation of query expression based on Regular Expression

OPENALEX - Publications

Jin He Rui Yu Xinsheng Wang Lina Huang

As the technology in GIS and Computer Science develops, more applications of have been put into practice. And many general spatial databases has constructed which are connected with development social economy tightly. With complete databases, priority facilitating improving application is to retrieve data effectively output efficiently accurately. The current popular database products only include limited functions Retrieval. main drawback that retrieval can be operated on one dataset at...

10.1109/csss.2011.5974145 article EN 2011-06-01

Adaptive deep feature aggregation using Fourier transform and low-pass filtering for robust object retrieval

OPENALEX - Publications

Ziyao Zhou Xinsheng Wang Chen Li Ming Zeng Zhongyu Li

10.1016/j.jvcir.2020.102860 article EN Journal of Visual Communication and Image Representation 2020-07-24

Learning Fine-Grained Semantics in Spoken Language Using Visual Grounding

OPENALEX - Publications

Xinsheng Wang Tian Tian Jihua Zhu Odette Scharenborg

In the case of unwritten languages, acoustic models cannot be trained in standard way, i.e., using speech and textual transcriptions. Recently, several methods have been proposed to learn representations images, visual grounding. Existing studies focused on scene images. Here, we investigate whether fine-grained semantic information, reflecting relationship between attributes objects, can learned from spoken language. To this end, a Fine-grained Semantic Embedding Network (FSEN) for learning...

10.1109/iscas51556.2021.9401232 article EN 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 2021-04-27

Study on the on-Board Computer System Based on ARM Processor

OPENALEX - Publications

Xinsheng Wang Hanxu Sun Guodong Xu Zhihong Tong

The new fault-tolerant onboard computer(OBC) with dual processing modules is presented to improve the micro-satellite data handling. Each module composed of 32-bit ARM processor. Using fault tolerance method, OBC's hardware structure implemented based on commercial-off-the-shelf (COTS) devices. As well as, a detail analysis handling mechanism and software architecture given. Considering demanding extremely tight constraints mass, volume, power consumption space environmental conditions,...

10.13190/jbupt.200504.23.wangxsh article EN Beijing Youdian Xueyuan xuebao 2005-08-28

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

OPENALEX - Publications

Zhichao Wang Yuanzhe Chen Xinsheng Wang Lei Xie Yuping Wang

10.1109/lsp.2024.3483012 article EN IEEE Signal Processing Letters 2024-01-01

Look, Listen and Infer

OPENALEX - Publications

Ruijian Jia Xinsheng Wang Shanmin Pang Jihua Zhu Jianru Xue

Inspired by the ability of human beings on recognizing relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos associated sounds. In this work, first time, a Look, Listen Infer Network (LLINet) is proposed to learn zero-shot model that can infer sounds from novel categories never appeared before. LLINet mainly desired qualify two tasks, i.e., image-audio retrieval sound localization in each image. Towards end, it...

10.1145/3394171.3414023 article EN Proceedings of the 30th ACM International Conference on Multimedia 2020-10-12

Coming Soon ...