Zhang Yu

ORCID: 0000-0003-2012-226X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Speech Recognition and Synthesis
  • Advanced Computational Techniques and Applications
  • Chinese history and philosophy
  • Speech and Audio Processing
  • Advanced Text Analysis Techniques
  • Service-Oriented Architecture and Web Services
  • Semantic Web and Ontologies
  • Music and Audio Processing
  • Translation Studies and Practices
  • Web Data Mining and Analysis
  • Speech and dialogue systems
  • Simulation and Modeling Applications
  • Industrial Technology and Control Systems
  • Biomedical Text Mining and Ontologies
  • Educational Reforms and Innovations
  • Remote Sensing and Land Use
  • Language, Metaphor, and Cognition
  • Educational Technology and Pedagogy
  • Recommender Systems and Techniques
  • Text and Document Classification Technologies
  • Geomechanics and Mining Engineering
  • Multimodal Machine Learning Applications
  • Data Quality and Management

Harbin Institute of Technology
2010-2024

Jiamusi University
2024

China University of Geosciences (Beijing)
2024

Affiliated Hospital of Chengde Medical College
2023

Dalian University of Technology
2019-2022

Google (United States)
2019-2022

Qingdao University
2022

BGI Group (China)
2021

Kunming Metallurgy College
2021

Guangdong Institute of Intelligent Manufacturing
2020

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.It is derived from the original audio and text materials of LibriSpeech corpus, which has been used training evaluating automatic recognition systems.The inherits desired properties while addressing number issues make less than ideal work.The released consists 585 hours data at 24kHz sampling rate 2,456 speakers corresponding texts.Experimental results show that neural end-to-end TTS models trained...

10.21437/interspeech.2019-2441 article EN Interspeech 2022 2019-09-13

Sequence-to-sequence models have shown success in end-to-end speech recognition. However these only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for ASR models. We apply network-in-network principles, batch normalization, residual connections LSTMs build recurrent structures. Our exploit the spectral structure feature space computational depth without overfitting issues....

10.1109/icassp.2017.7953077 preprint EN 2017-03-01

We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained wav2vec 2.0 pre-training. By doing so, are able achieve word-error-rates (WERs) 1.4%/2.6% test/test-other sets against current WERs 1.7%/3.3%.

10.48550/arxiv.2010.10504 preprint EN other-oa arXiv (Cornell University) 2020-01-01

This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the approach can eliminate need for expert knowledge such as pronunciation dictionaries build ASR systems, it still requires large amount of paired data, i.e., utterances and their transcriptions. Cycle-consistency losses have been recently proposed way mitigate problem limited These approaches compose reverse operation with given transformation, e.g., text-to-speech (TTS)...

10.1109/icassp.2019.8683307 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody synthesized speech. Such typically incorporate a variational autoencoder (VAE) structure, extracting at each input token (e.g., phonemes). However, generating samples standard VAE prior often results in unnatural and discontinuous speech, dramatic prosodic variation between tokens. This paper proposes sequential discrete space which can generate more naturally sounding samples....

10.1109/icassp40776.2020.9053436 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.Self-supervised learning signals aims learn the latent structure inherent in signal, while attempts capture lexical information.Learning aligned unpaired sequences is challenging task.Previous work either implicitly enforced these two modalities be space through multitasking parameter sharing or explicitly conversion of via synthesis.While former suffers interference between...

10.21437/interspeech.2022-10937 article EN Interspeech 2022 2022-09-16

Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned downstream tasks from a variety domains languages. This paper takes universality unsupervised language one step further, by unifying within single model. We build encoder with BERT objective unlabeled together w2v-BERT speech. To further align our model representations across...

10.48550/arxiv.2110.10329 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource low languages. However, degradations on are commonly observed interference the heterogeneous multilingual data reduction in per-language capacity. We conduct capacity study 15-language task, with amount of per language varying 7.6K 53.5K hours. adopt GShard [1] efficiently scale up 10B...

10.1109/asru51503.2021.9687871 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

Recent neural network models for Chinese zero pronoun resolution gain great performance by capturing semantic information pronouns and candidate antecedents, but tend to be short-sighted, operating solely making local decisions. They typically predict coreference links between the one single antecedent at a time while ignoring their influence on future Ideally, modeling useful of preceding potential antecedents is crucial classifying later pronoun-candidate pairs, need which leads...

10.18653/v1/p18-1053 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Table-based fact verification is expected to perform both linguistic reasoning and symbolic reasoning. Existing methods lack attention take advantage of the combination information information. In this work, we propose HeterTFV, a graph-based approach, that learns combine effectively. We first construct program graph encode programs, kind LISP-like logical form, learn semantic compositionality programs. Then heterogeneous incorporate by introducing nodes into graph. Finally, approach reason...

10.18653/v1/2020.coling-main.466 article EN cc-by Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

English promotional videos are crucial tools for image dissemination in universities, playing a key role shaping institutional branding, attracting potential students, and enhancing social awareness. While numerous studies have explored the linguistic characteristics of university from semiotic perspective, systematic analyses perspective textual meta-function remain relatively scarce. This study, utilizing UAM Corpus Tool, investigates thematic structure 89 universities China abroad through...

10.32996/ijllt.2025.6.4.21 article EN International Journal of Linguistics Literature & Translation 2025-04-17

Abstract Research indicates that the performance‐gap between English Language Learners (ELLs) and their non‐ELL peers is partly due to ELLs' difficulty in understanding assessment language. Accommodations have been shown narrow this performance‐gap, but many accommodations studies not used a randomized design are based on relatively small sample sizes. Addressing such issues, we administered standard‐based mathematics approximately 3,000 Grade 9 ELL students under five different...

10.1111/emip.12328 article EN Educational Measurement Issues and Practice 2020-04-12

Automatically extracting relations between chemicals and diseases plays an important role in biomedical text mining. Chemical-disease relation (CDR) extraction aims at complex semantic relationships entities documents, which contain intrasentence intersentence relations. Most previous methods did not consider dependency syntactic information across the sentences, are very valuable for task, particular, accurately.In this paper, we propose a novel end-to-end neural network based on graph...

10.2196/17638 article EN cc-by JMIR Medical Informatics 2020-04-25

In this paper, we propose a novel approach to identifying user intents of search engine queries. Specifically, recast it as classification problem, in which four types features are adopted. The based on deep linguistic analysis queries well feedbacks. We evaluate the method with real web query data. results show that about 88% test can be correctly identified framework via combining all 4 features.

10.1109/pcspa.2010.40 article EN 2010-09-01

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of LibriSpeech corpus, which has been used training evaluating automatic recognition systems. The inherits desired properties while addressing number issues make less than ideal work. released consists 585 hours data at 24kHz sampling rate 2,456 speakers corresponding texts. Experimental results show that neural end-to-end TTS models trained...

10.48550/arxiv.1904.02882 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Acoustic feature similarity between search results has been shown to be very helpful for the task of spoken term detection (STD). A graph-based re-ranking approach STD proposed based on concept that results, which are acoustically similar other with higher confidence scores, should have scores themselves. In this approach, all a given considered as graph, and propagate through graph. Since can improve without any additional labelled data, it is especially suitable languages limited amounts...

10.21437/interspeech.2014-526 article EN Interspeech 2022 2014-09-14

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody synthesized speech. Such typically incorporate a variational autoencoder (VAE) structure, extracting at each input token (e.g., phonemes). However, generating samples standard VAE prior often results in unnatural and discontinuous speech, dramatic prosodic variation between tokens. This paper proposes sequential discrete space which can generate more naturally sounding samples....

10.48550/arxiv.2002.03788 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Effective communication between humans often embeds both temporal and spatial context. While context captures the geographic settings of objects in environment, describes their changes over time. In this paper, we propose inverse semantics (TeSIS) to extend approach also consider for robots communicating with humans. Inverse generates natural language requests while taking into account how well human listeners would interpret those given current Compared semantics, our incorporates by...

10.1109/icra.2018.8460754 article EN 2018-05-01

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies focused primarily on short utterances that typically last for just a few seconds or, at most, tens of seconds. Whether such architectures are practical long from minutes hours remains an open question. In this paper, we investigate improve end-to-end...

10.48550/arxiv.1911.02242 preprint EN other-oa arXiv (Cornell University) 2019-01-01

10.16511/j.cnki.qhdxxb.2018.25.016 article EN Journal of Tsinghua University(Science and Technology) 2018-03-15

In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing. But quite a few works focus on joint tagging and parsing models to avoid error propagation. contrast, recent studies suggest that POS becomes much less important or even useless neural parsing, especially when using character-based word representations. Yet there are not enough investigations focusing this issue, both empirically linguistically....

10.48550/arxiv.2003.03204 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...