- Speech Recognition and Synthesis
- Speech and Audio Processing
- Remote-Sensing Image Classification
- Music and Audio Processing
- Remote Sensing and Land Use
- Advanced Image Fusion Techniques
- Hydrocarbon exploration and reservoir analysis
- Image Retrieval and Classification Techniques
- Seismic Imaging and Inversion Techniques
- Drilling and Well Engineering
- Natural Language Processing Techniques
- Geological and Geophysical Studies
- Advanced Image and Video Retrieval Techniques
- Face and Expression Recognition
- Sparse and Compressive Sensing Techniques
- Speech and dialogue systems
- Image and Signal Denoising Methods
- Quantum Computing Algorithms and Architecture
- Constructed Wetlands for Wastewater Treatment
- Gut microbiota and health
- Control and Dynamics of Mobile Robots
- Environmental Chemistry and Analysis
- Melamine detection and toxicity
- Marine and Coastal Research
- Nanocomposite Films for Food Packaging
Jiangsu Normal University
2017-2024
North China University of Science and Technology
2016-2024
Tencent (China)
2024
Xinjiang Petroleum Society
2024
Zhejiang University
2023
Bellevue Hospital Center
2019-2022
Beijing Haidian Hospital
2022
AviChina Industry & Technology (China)
2021
The University of Texas at Dallas
2017
Yangzhou University
2017
Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, achieve the disentanglement balancing information flow between global speaker time-varying content in sequential variational autoencoder...
The vision transformer (ViT) has become a hot topic in image processing due to its global feature extraction capabilities. However, the ViT suffers from over-smoothing and over-fitting training procedure, so it is hard achieve satisfactory performance hyperspectral (HSI) classification. To address these issues, we propose with contrastive learning (CViT). network architecture includes patch embedding module, blocks, classifier. of CViT can be considered as an optimization problem supervised...
Deep learning has made significant progress in hyperspectral image (HSI) classification, and its powerful ability to automatically learn abstract features is well recognized. Recently, the simple architecture of multi-layer perceptron (MLP) been extensively employed extract long-range dependencies HSI achieved impressive results. However, existing MLP-based models exhibit insufficient representation spectral–spatial information generally aggregate with fixed weights, which limits their...
Abstract Pixel‐wise classification of hyperspectral image (HSI) is a hot spot in the field remote sensing. The HSI requires model to be more sensitive dense features, which quite different from modelling requirements traditional tasks. Cycle‐Multilayer Perceptron (MLP) has achieved satisfactory results feature prediction tasks because it an expert extracting high‐resolution features. In order obtain stable receptive and enhance effect extraction multiple directions, we propose MLP‐like...
Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual using a unified ASR system remains highly challenging. Previous works on mainly focus two directions: multiple monolingual or code-switched that uses different languages interchangeably within single utterance. However, pragmatic recognizer is expected to be compatible with both directions. In this work, novel language-aware encoder (LAE) architecture proposed handle situations by...
Expressive speech introduces variations in the acoustic features affecting performance of technology such as speaker verification systems. It is important to identify range emotions for which we can reliably estimate tasks. This paper studies a system function emotions. Instead categorical classes happiness or anger, have intra-class variability, use continuous attributes arousal, valence, and dominance facilitate analysis. We evaluate an trained with i-vector framework probabilistic linear...
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone decomposition. We have demonstrated that simultaneous disentangling embedding speaker from one utterance feasible VC. In this study, we continue direction by raising concern about prior distribution of branch in DSVAE baseline. find random initialized...
We study traveling pulses on a lattice and in continuum where all pairs ofparticles interact, contributing to the potential energy. The interaction may be positiveor negative, depending particular pair but overall is positive certain sense.For such an kernel $J$ with unit integral (or sum), operator 1/ε2[J∗u-u], ∗ continuous or discrete convolution,shares some common features spatial second derivative operator, especially when εis small. Therefore, equation $u_{t t}$ - 1/ε2[J∗u-u] + f(u)=0...
Deep learning has dominated hyperspectral image (HSI) classification due to its modular design and powerful feature extraction capabilities. Recently, a modern macro-architecture-based framework with high-order interactions been proposed, inspiring the of HSI models. As spatial mixer in macro-architecture, interaction facilitates aggregation discriminative information by gated mechanisms standard convolutions. However, homogeneous operators convolution are challenging consider different...
This document briefly describes the systems submitted by Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) to 2016 National Institute Standards and Technology (NIST) Speaker Recognition Evaluation (SRE). We developed several UBM DNN i-Vector based speaker recognition with different data sets feature representations. Given that emphasis NIST SRE is on language mismatch between training enrollment/test data, so-called domain mismatch, in our system...
Deep learning methods have shown great promise in automatically extracting features from hyperspectral images (HSIs) for classification purposes. Recently, researchers recognized the importance of high-order feature interactions—capturing relationships between different image regions—in discriminative features. Despite their effectiveness, existing deep models HSI often overlook interactions, resulting suboptimal performance. To address this issue, we propose a novel spectral–spatial...
A novel Spectral-Spatial Difference Convolution Network (S <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> DCN) is proposed for hyperspectral image (HSI) classification, which integrates the difference principle into deep learning framework. S DCN employs a learnable gradient encoding pattern to extract important detail features in spectral and spatial domains, alleviating information loss caused by over-smoothing effect feature...
The architectures based on Multi-Layer Perceptron (MLP) have attracted great attention in hyperspectral image (HSI) classification recently, due to their simplified and efficient architectures. However, such are qualified by the rigid positional relationships between weights feature elements, inhibiting capacity effectively extract diversified features. To address these challenges, An adaptive spatial-shift MLP (AS2MLP) is presented dynamically modify spatial features parameterizing...
This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary waveform by given phonetic posteriorgram (PPG) representing content, F0 pitch, and speaker embedding timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, feature disentanglement $D_F$. The $G$ encodes features in parallel inversely transforms them into target waveform. In order to make timbre...
Speaker diarization consists of many components, e.g., front-end processing, speech activity detection (SAD), overlapped (OSD) and speaker segmentation/clustering. Conventionally, most the involved components are separately developed optimized. The resulting systems complicated sometimes lack satisfying generalization capabilities. In this study, we present a novel system, with generalized neural clustering module as backbone. whole system can be simplified to contain only two major parts,...
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, majority models currently rely on annotated audio data, but it is crucial to scale them self-supervised datasets order effectively capture wide range acoustic variations present human voice, including speaker identity, emotion, and prosody. this work, we propose Make-A-Voice, a unified framework for synthesizing manipulating signals from...