- Topic Modeling
- Natural Language Processing Techniques
- Speech Recognition and Synthesis
- Advanced Computational Techniques and Applications
- Chinese history and philosophy
- Speech and Audio Processing
- Advanced Text Analysis Techniques
- Service-Oriented Architecture and Web Services
- Semantic Web and Ontologies
- Music and Audio Processing
- Translation Studies and Practices
- Web Data Mining and Analysis
- Speech and dialogue systems
- Simulation and Modeling Applications
- Industrial Technology and Control Systems
- Biomedical Text Mining and Ontologies
- Educational Reforms and Innovations
- Remote Sensing and Land Use
- Language, Metaphor, and Cognition
- Educational Technology and Pedagogy
- Recommender Systems and Techniques
- Text and Document Classification Technologies
- Geomechanics and Mining Engineering
- Multimodal Machine Learning Applications
- Data Quality and Management
Harbin Institute of Technology
2010-2024
Jiamusi University
2024
China University of Geosciences (Beijing)
2024
Affiliated Hospital of Chengde Medical College
2023
Dalian University of Technology
2019-2022
Google (United States)
2019-2022
Qingdao University
2022
BGI Group (China)
2021
Kunming Metallurgy College
2021
Guangdong Institute of Intelligent Manufacturing
2020
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.It is derived from the original audio and text materials of LibriSpeech corpus, which has been used training evaluating automatic recognition systems.The inherits desired properties while addressing number issues make less than ideal work.The released consists 585 hours data at 24kHz sampling rate 2,456 speakers corresponding texts.Experimental results show that neural end-to-end TTS models trained...
Sequence-to-sequence models have shown success in end-to-end speech recognition. However these only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for ASR models. We apply network-in-network principles, batch normalization, residual connections LSTMs build recurrent structures. Our exploit the spectral structure feature space computational depth without overfitting issues....
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained wav2vec 2.0 pre-training. By doing so, are able achieve word-error-rates (WERs) 1.4%/2.6% test/test-other sets against current WERs 1.7%/3.3%.
This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the approach can eliminate need for expert knowledge such as pronunciation dictionaries build ASR systems, it still requires large amount of paired data, i.e., utterances and their transcriptions. Cycle-consistency losses have been recently proposed way mitigate problem limited These approaches compose reverse operation with given transformation, e.g., text-to-speech (TTS)...
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody synthesized speech. Such typically incorporate a variational autoencoder (VAE) structure, extracting at each input token (e.g., phonemes). However, generating samples standard VAE prior often results in unnatural and discontinuous speech, dramatic prosodic variation between tokens. This paper proposes sequential discrete space which can generate more naturally sounding samples....
We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.Self-supervised learning signals aims learn the latent structure inherent in signal, while attempts capture lexical information.Learning aligned unpaired sequences is challenging task.Previous work either implicitly enforced these two modalities be space through multitasking parameter sharing or explicitly conversion of via synthesis.While former suffers interference between...
Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned downstream tasks from a variety domains languages. This paper takes universality unsupervised language one step further, by unifying within single model. We build encoder with BERT objective unlabeled together w2v-BERT speech. To further align our model representations across...
Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource low languages. However, degradations on are commonly observed interference the heterogeneous multilingual data reduction in per-language capacity. We conduct capacity study 15-language task, with amount of per language varying 7.6K 53.5K hours. adopt GShard [1] efficiently scale up 10B...
Recent neural network models for Chinese zero pronoun resolution gain great performance by capturing semantic information pronouns and candidate antecedents, but tend to be short-sighted, operating solely making local decisions. They typically predict coreference links between the one single antecedent at a time while ignoring their influence on future Ideally, modeling useful of preceding potential antecedents is crucial classifying later pronoun-candidate pairs, need which leads...
Table-based fact verification is expected to perform both linguistic reasoning and symbolic reasoning. Existing methods lack attention take advantage of the combination information information. In this work, we propose HeterTFV, a graph-based approach, that learns combine effectively. We first construct program graph encode programs, kind LISP-like logical form, learn semantic compositionality programs. Then heterogeneous incorporate by introducing nodes into graph. Finally, approach reason...
English promotional videos are crucial tools for image dissemination in universities, playing a key role shaping institutional branding, attracting potential students, and enhancing social awareness. While numerous studies have explored the linguistic characteristics of university from semiotic perspective, systematic analyses perspective textual meta-function remain relatively scarce. This study, utilizing UAM Corpus Tool, investigates thematic structure 89 universities China abroad through...
Abstract Research indicates that the performance‐gap between English Language Learners (ELLs) and their non‐ELL peers is partly due to ELLs' difficulty in understanding assessment language. Accommodations have been shown narrow this performance‐gap, but many accommodations studies not used a randomized design are based on relatively small sample sizes. Addressing such issues, we administered standard‐based mathematics approximately 3,000 Grade 9 ELL students under five different...
Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at complex semantic relationships entities documents, which contain intrasentence intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, are very valuable for task, particular, accurately.In this paper, we propose a novel end-to-end neural network based on graph...
In this paper, we propose a novel approach to identifying user intents of search engine queries. Specifically, recast it as classification problem, in which four types features are adopted. The based on deep linguistic analysis queries well feedbacks. We evaluate the method with real web query data. results show that about 88% test can be correctly identified framework via combining all 4 features.
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of LibriSpeech corpus, which has been used training evaluating automatic recognition systems. The inherits desired properties while addressing number issues make less than ideal work. released consists 585 hours data at 24kHz sampling rate 2,456 speakers corresponding texts. Experimental results show that neural end-to-end TTS models trained...
Acoustic feature similarity between search results has been shown to be very helpful for the task of spoken term detection (STD). A graph-based re-ranking approach STD proposed based on concept that results, which are acoustically similar other with higher confidence scores, should have scores themselves. In this approach, all a given considered as graph, and propagate through graph. Since can improve without any additional labelled data, it is especially suitable languages limited amounts...
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody synthesized speech. Such typically incorporate a variational autoencoder (VAE) structure, extracting at each input token (e.g., phonemes). However, generating samples standard VAE prior often results in unnatural and discontinuous speech, dramatic prosodic variation between tokens. This paper proposes sequential discrete space which can generate more naturally sounding samples....
Effective communication between humans often embeds both temporal and spatial context. While context captures the geographic settings of objects in environment, describes their changes over time. In this paper, we propose inverse semantics (TeSIS) to extend approach also consider for robots communicating with humans. Inverse generates natural language requests while taking into account how well human listeners would interpret those given current Compared semantics, our incorporates by...
End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies focused primarily on short utterances that typically last for just a few seconds or, at most, tens of seconds. Whether such architectures are practical long from minutes hours remains an open question. In this paper, we investigate improve end-to-end...
In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing. But quite a few works focus on joint tagging and parsing models to avoid error propagation. contrast, recent studies suggest that POS becomes much less important or even useless neural parsing, especially when using character-based word representations. Yet there are not enough investigations focusing this issue, both empirically linguistically....