- Speech and Audio Processing
- Multimodal Machine Learning Applications
- Natural Language Processing Techniques
- Speech Recognition and Synthesis
- Astrophysics and Cosmic Phenomena
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Dark Matter and Cosmic Phenomena
- Music and Audio Processing
- Generative Adversarial Networks and Image Synthesis
- Translation Studies and Practices
- Remote Sensing and Land Use
- Image Retrieval and Classification Techniques
- Voice and Speech Disorders
- Neutrino Physics Research
- Video Analysis and Summarization
- Subtitles and Audiovisual Media
- Advanced Wireless Communication Techniques
- Topic Modeling
- Dermatologic Treatments and Research
- Wireless Communication Networks Research
- Particle Detector Development and Performance
- Web Applications and Data Management
- Advanced Computational Techniques and Applications
- Face recognition and analysis
Jiangsu University
2025
Northwestern Polytechnical University
2022-2024
Jiangxi Agricultural University
2024
Friedrich-Alexander-Universität Erlangen-Nürnberg
2023
Xi'an Jiaotong University
2017-2023
Delft University of Technology
2020-2022
Hubei University
2007-2021
Pennsylvania State University
2009
Southeast University
2007
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, thus synthesizing expressive has attracted much attention in recent years. Previous methods performed the synthesis either with explicit labels or a fixed-length style embedding extracted from reference audio, both of which can only learn an average ignores multi-scale nature prosody. In this paper, we propose MsEmoTTS, emotional framework, to model emotion different levels....
The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the transferred from reference recorded by another (source) speaker. During process, identity information of source could also affect synthesized results, resulting issue leakage, i.e., synthetic may have voice rather than This paper proposes new method aim controllable emotional expressive and meanwhile maintain speaker's TTS task. proposed is...
Restless legs syndrome (RLS) is a neurological disorder that thought to involve decreased iron availability in the brain. Iron required for oxidative metabolism and plays critical role redox reactions mitochondria. The recent discovery of mitochondrial ferritin (FtMt) provided opportunity identify potential correlation between function RLS. Human substantia nigra (SN) putamen autopsy samples from 8 RLS cases controls were analyzed. Mitochondrial levels SN tissue homogenate assessed by...
Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation their adaptation to has encountered challenges, notably slow convergence suboptimal outcomes. To address these issues enhance efficacy separation, we introduce EDSep, a novel single-channel method grounded score matching via stochastic differential equation (SDE). This enhances...
Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and prevalence noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores synthetic enhance quality generation. CosyAudio...
Recent advances in text-based large language models (LLMs), particularly the GPT series and o1 model, have demonstrated effectiveness of scaling both training-time inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate (e.g., diffusion after LLM), complicating decision whether to scale a particular model during training or testing. This work makes following contributions: First, we explore train-time compute for speech...
Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages estimated be lacking a commonly used written form. Consequently, these cannot benefit text-based technologies. This paper presents 1) new speech technology task, i.e., speech-to-image generation (S2IG) framework which translates descriptions photo-realistic images 2) without using any information, thus allowing...
The production performance of laying hens is influenced by various environmental factors within the henhouse. intricate interactions among these make impact process highly complicated. exact relationships between and variables are still not well understood. In this study, we measured across different parts henhouse, evaluated weight each variable, constructed a rate prediction model. Results displayed that body weight, rate, egg eggshell thickness decrease gradually from WCA to FA (P <...
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind alignment/attention mechanism crucial for MWD system to learn meaningful word-level representation. We verify theory by conducting and experiments on MSCOCO Flickr8k, empirically demonstrate both neural MT with self-attention statistical achieve scores are superior those state-of-the-art system, outperforming it 2% 5% alignment F1...
An estimated half of the world's languages do not have a written form, making it impossible for these to benefit from any existing text-based technologies.In this paper, speech-toimage generation (S2IG) framework is proposed which translates speech descriptions photo-realistic images without using text information, thus allowing unwritten potentially technology.The S2IG framework, named S2IGAN, consists embedding network (SEN) and relation-supervised densely-stacked generative model...
Automatically generating videos in which synthesized speech is synchronized with lip movements a talking head has great potential many human-computer interaction scenarios. In this paper, we present an automatic method to generate and talking-head on the basis of text single face image arbitrary person as input. contrast previous text-driven generation methods, can only synthesize voice specific person, proposed capable synthesizing for any person. Specifically, decomposes into two stages,...
Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing its potential ability real-word applications. learning models rely on an embedding space, where both semantic descriptions of classes and visual features instances can be embedded for nearest neighbor search. Recently, most existing works consider space formulated by deep as ideal choice space. However, discrete distribution makes data structure unremarkable. We...
Image captioning technology has great potential in many scenarios. However, current text-based image methods cannot be applied to approximately half of the world's languages due these lack a written form. To solve this problem, recently image-to-speech task was proposed, which generates spoken descriptions images bypassing any text via an intermediate representation consisting phonemes (image-to-phoneme). Here, we present comprehensive study on which, 1) several representative image-to-text...
In addition to conveying the linguistic content from source speech converted speech, maintaining speaking style of also plays an important role in voice conversion (VC) task, which is essential many scenarios with highly expressive such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length embedding extracted model insufficient achieve comprehensive modeling target speaker timbre preservation. Inspired by style's multi-scale nature human a...
As the technology in GIS and Computer Science develops, more applications of have been put into practice. And many general spatial databases has constructed which are connected with development social economy tightly. With complete databases, priority facilitating improving application is to retrieve data effectively output efficiently accurately. The current popular database products only include limited functions Retrieval. main drawback that retrieval can be operated on one dataset at...
In the case of unwritten languages, acoustic models cannot be trained in standard way, i.e., using speech and textual transcriptions. Recently, several methods have been proposed to learn representations images, visual grounding. Existing studies focused on scene images. Here, we investigate whether fine-grained semantic information, reflecting relationship between attributes objects, can learned from spoken language. To this end, a Fine-grained Semantic Embedding Network (FSEN) for learning...
The new fault-tolerant onboard computer(OBC) with dual processing modules is presented to improve the micro-satellite data handling. Each module composed of 32-bit ARM processor. Using fault tolerance method, OBC's hardware structure implemented based on commercial-off-the-shelf (COTS) devices. As well as, a detail analysis handling mechanism and software architecture given. Considering demanding extremely tight constraints mass, volume, power consumption space environmental conditions,...
Inspired by the ability of human beings on recognizing relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos associated sounds. In this work, first time, a Look, Listen Infer Network (LLINet) is proposed to learn zero-shot model that can infer sounds from novel categories never appeared before. LLINet mainly desired qualify two tasks, i.e., image-audio retrieval sound localization in each image. Towards end, it...