- Natural Language Processing Techniques
- Speech Recognition and Synthesis
- Topic Modeling
- Speech and Audio Processing
- Image Retrieval and Classification Techniques
- Handwritten Text Recognition Techniques
- Music and Audio Processing
- Advanced Image and Video Retrieval Techniques
- Image Processing and 3D Reconstruction
- Web Data Mining and Analysis
- Multimodal Machine Learning Applications
- Advanced Computational Techniques and Applications
- Speech and dialogue systems
- Emotion and Mood Recognition
- Text and Document Classification Technologies
- Advanced Graph Neural Networks
- Advanced Text Analysis Techniques
- Educational Technology and Assessment
- Neural Networks and Applications
- Sentiment Analysis and Opinion Mining
- Face and Expression Recognition
- Linguistics and Cultural Studies
- Advanced Adaptive Filtering Techniques
- Digital Media Forensic Detection
- Data Management and Algorithms
Inner Mongolia University
2016-2025
National University of Mongolia
2017-2021
University of Delaware
2009
Louisiana State University
2009
Inner Mongolia University of Technology
2009
Text-to-Speech (TTS) aims to convert the input text a human-like voice. With development of deep learning, encoder-decoder based TTS models perform superior performance, in terms naturalness, mainstream languages such as Chinese, English, etc. Note that linguistic information learning capability encoder is key. However, for low-resource agglutinative languages, scale <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math...
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that improves the speech styling at utterance level. One of key challenges in prosody modeling is lack reference makes explicit difficult. The proposed technique doesn't require annotations from data. It attempt to model explicitly either, but rather encodes association between input text and its styles using TTS framework. This study marks departure style token paradigm where modeled by bank embeddings....
Multimodal emotion recognition leverages complementary information across modalities to gain performance. However, we cannot guarantee that the data of all are always present in practice. In studies predict missing modalities, inherent difference between heterogeneous namely modality gap, presents a challenge. To address this, propose use invariant features for imagination network (IF-MMIN) which includes two novel mechanisms: 1) an feature learning strategy is based on central moment...
Supervised speech separation methods train learning machine to cast the noisy target clean speech. Most of them use mean-square error (MSE) as loss function. However, MSE is not perfect choice because it doesn't match human auditory perception. Short-time objective intelligibility (STOI) and perceptual evaluation quality (PESQ) are closely related perception widely used in research criteria. Therefore, STOI PESQ may be better choices for they nondifferentiable functions which cannot...
While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem autoregressive models remains an issue be resolved. The arises from mismatch between training and inference process, that results unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student scheme Tacotron-based TTS by introducing distillation loss function addition feature function. We first train...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1), which is challenging L2 different from L1 in terms phonetic rendering and prosody pattern (pitch, energy, duration variance, etc.). TTS has several significant real-world applications, such language learning, preserving documenting endangered languages dialects, etc. that make it important area research development. Moreover, changing intensity any conversational AI...
The great challenge of handwritten mathematical expression recognition (HMER) is the complex structures expressions, which are directly related to symbol spatial positions. Existing HMER methods typically employ attention mechanisms in decoder their models implicitly perceive positions, or counting and tree-based strategies model relation. However, these still cannot effectively capture structural information formulas, thus negatively impacting decoding HMER. To deal with this problem...
Dynamic systems described by fc(z) = z2 + c is called Mandelbrot set (M-set), which important for fractal and chaos theories due to its simple expression complex structure. zk generalized M (k–M set). This paper proposes a new theory compute the higher lower bounds of while exponent k rational, proves relevant properties, such as that could cover whole number plane when < 1, boundary ranges from circle with radius 1 infinite large. explores characteristics set, k–M determined k, p/q, where p...
Speech separation and pitch estimation in noisy conditions are considered to be a "chicken-and-egg" problem. On one hand, information is an important cue for speech separation. the other makes easier when background noise removed. In this paper, we propose supervised learning architecture solve these two problems iteratively. The proposed algorithm based on deep stacking network (DSN), which provides method simple processing modules build architectures. Each module classifier whose target...
Prosodic phrasing is an important factor that affects naturalness and intelligibility in text-to-speech synthesis. Studies show deep learning techniques improve prosodic when large text speech corpus are available. However, for low-resource languages, such as Mongolian, remains a challenge various reasons. First, the database suitable system training limited. Second, word composition knowledge prosody-informing has not been used phrase modeling. To address these problems, this article, we...
Temporal knowledge graph embedding (TKGE) models are commonly utilized to infer the missing facts and facilitate reasoning decision-making in temporal based systems. However, existing methods fuse information into entities, potentially leading evolution of entity limiting link prediction performance TKG. Meanwhile, current TKGE often lack ability simultaneously model important relation patterns provide interpretability, which hinders their effectiveness potential applications. To address...
Linear Graph Convolutional Networks (GCNs) are used to classify the node in graph data. However, we note that most existing linear GCN models perform neural network operations Euclidean space, which do not explicitly capture tree-like hierarchical structure exhibited real-world datasets modeled as graphs. In this paper, attempt introduce hyperbolic space into and propose a novel framework for Lorentzian GCN. Specifically, map learned features of nodes then feature transformation underlying...
In single-channel speech enhancement, methods based on fullband spectral features have been widely studied.However, only a few pay attention to non-full-band features.In this paper, we explore knowledge distillation framework sub-band mapping for enhancement.Specifically, divide the full frequency band into multiple sub-bands and pre-train an elite-level enhancement model (teacher model) each sub-band.These teacher models are dedicated processing their own sub-bands.Next, under models'...
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One key challenges in prosody modeling is lack reference that makes explicit difficult. The proposed technique doesn't require annotations from data. It attempt model explicitly either, but rather encodes association between input text and its styles using TTS framework. Our idea marks departure style token paradigm where modeled by bank embeddings. adopts combination...