Quoc Truong

ORCID: 0000-0003-1472-1370
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Topic Modeling
  • Speech and dialogue systems
  • Speech and Audio Processing
  • Language, Metaphor, and Cognition
  • Music and Audio Processing
  • Emotion and Mood Recognition
  • Advanced Algorithms and Applications
  • Microplastics and Plastic Pollution
  • Translation Studies and Practices
  • Recycling and Waste Management Techniques
  • Diabetes Management and Education
  • Subtitles and Audiovisual Media
  • Health and Wellbeing Research

Bình Dương University
2025

Nara Institute of Science and Technology
2014-2019

In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity short segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand semantically also cause natural language processing models NER, POS semantic parsing. this paper, we propose a method restore is based Transformer chunk merging...

10.1109/o-cocosda46868.2019.9041202 article EN 2019-10-01

Objectives: This study evaluated the quality of life and associated factors among patients with diabetes mellitus type 2 in My Phuoc Hospital 2024. Subjects methods: A cross-sectional design was conducted, Vietnamese Asian Diabetes Quality Life Version (AsianDQOL) used for data collection 151 participants. Results: The mean score AsianDQOL respondents 55.2 (SD 15.4). majority have a moderate level (60.9%). At same time, 6.6% had good life, while rate participants who poor 32.5%. average...

10.51298/vmj.v550i1.13869 article EN Tạp chí Y học Việt Nam 2025-04-29

Speech-to-speech translation (S2ST) is a technology that translates speech across languages, which can remove barriers in cross-lingual communication. In the conventional S2ST systems, linguistic meaning of was translated, but paralinguistic information conveying other features such as emotion or emphasis were ignored. this paper, we propose method to translate information, specifically focusing on emphasis. The consists series components accurately using all acoustic speech. First,...

10.1109/taslp.2016.2643280 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-12-21

Language documentation begins by gathering speech.Manual or automatic transcription at the word level is typically not possible because of absence an orthography prior lexicon, and though manual phonemic possible, it prohibitively slow.On other hand, translations minority language into a major are more easily acquired.We propose method to harness such improve phoneme recognition.The assumes no lexicon translation model, instead learning them from lattices speech being transcribed.Experiments...

10.18653/v1/d16-1263 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2016-01-01

Speech-to-speech translation (S2ST) systems are capable of breaking language barriers in cross-lingual communication by translating speech across languages. Recent studies have introduced many improvements that allow existing S2ST to handle not only linguistic meaning but also paralinguistic information such as emphasis proposing additional estimation and components. However, the approach used for is optimal sequence tasks fails easily long-term dependencies words levels. It requires...

10.1109/taslp.2018.2846402 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2018-06-11

In speech, emphasis is an important type of paralinguistic information that helps convey the focus utterance, new information, and emotion. If can be incorporated into a speech-to-speech (S2S) translation system, it will possible to this across language barrier. However, previous related work focuses only on particular prosodic features, such as F0, or works with but extremely small vocabularies, 10 digits. paper, we describe S2S method able translate languages consider multiple features...

10.21437/interspeech.2015-727 article EN Interspeech 2022 2015-09-06

As the number of Japanese-English bilingual speakers continues to increase, code-switching phenomena also happen more frequently. The units and locations switches may vary widely from single word whole phrases (beyond length loanword units). Therefore, speech recognition systems must be developed that can handle not only Japanese or English but code-switching. Consequently, a large-scale database is required for model training. But collecting natural conversation dialogues data both...

10.1109/icsda.2018.8693044 article EN 2018-05-01

The Multi-Genre Broadcast challenge is an official of the IEEE Automatic Speech Recognition and Understanding Workshop. This paper presents NAISTs contribution to premiere this challenge. presented speech-to-text system for English makes use various front-ends (e.g., MFCC, i-vector FBANK), DNN acoustic models several language decoding rescoring (N-gram, RNNLM). Subsets training data with varying sizes were evaluated respect overall quality. Two speech segmentation systems developed...

10.1109/asru.2015.7404858 article EN 2015-12-01

Automatic Speech Recognition (ASR) systems convert human speech into the corresponding transcription automatically. They have a wide range of applications such as controlling robots, call center analytics, voice chatbot. Recent studies on ASR for English achieved performance that surpasses ability. The were trained large amount training data and performed well under many environments. With regards to Vietnamese, there been improving existing systems, however, them are conducted small-scaled...

10.15625/1813-9663/34/4/13165 article EN Journal of Computer Science and Cybernetics 2019-01-30

Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, facial. To fuse the from multiple modalities, experimented on methods: majority voting, concatenation, hierarchical fusion. The recognition was done TV-series dataset simulate actual conversations.

10.1109/icsda.2018.8693020 article EN 2018-05-01

This paper presents an approach of Multi Space Distribution Hidden Markov Model (MSD-HMM) for Vietnamese recognition. An MSD-HMM prototype with four independent streams is proposed modeling the phonemes which embedded tonal information corresponding to its syllable. These are built by adding symbol each phoneme syllables based on International Phonetic Alphabet (IPA). improves 2.49% accuracy compared baseline system. A process feature extraction that suitable also described. The result shows...

10.15625/1813-9663/30/1/3553 article EN Journal of Computer Science and Cybernetics 2014-04-16

This paper proposes a method to train Weighted Finite State Transducer (WFST) based structural classifiers using deep neural network (DNN) acoustic features and recurrent (RNN) language for speech recognition. Structural classification is an effective approach achieve highly accurate recognition of structured data in which the classifier optimized maximize discriminative performance different kinds features. A WFST-based classifier, can integrate acoustic, pronunciation, embedded composed...

10.1109/icassp.2015.7178914 article EN 2015-04-01

Emphasis is an important factor of human speech that helps convey emotion and the focused information utterances. Recently, studies have been conducted on speech-to-speech translation to preserve emphasis from source language target language. However, since different cultures various ways expressing emphasis, just considering acoustic-to-acoustic feature may not always reflect experiences users. On other hand, can be expressed at levels in both text speech. it remains unclear how we...

10.1109/slt.2018.8639641 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

This paper presents a high quality Vietnamese speech corpus that can be used for analyzing characteristic as well building synthesis models. The consists of 5400 clean-speech utterances spoken by 12 speakers including 6 males and females. is designed with phonetic balanced in mind so it synthesis, especially, adaptation approaches. Specifically, all utter common dataset contains 250 sentences. To increase the variety context, each speaker also utters another 200 non-shared, phonetic-balanced...

10.48550/arxiv.1904.05569 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Speech-to-speech (S2S) translation [10] is gradually starting to break down the language barrier, bringing opportunities for people understand each other using different languages. However, one of limitations current S2S systems that they usually do not translate paralinguistic information included in input speech. Among various types information, we focus on emphasis, a type used convey sentence, emotion speaker, or high level useful communication. This paper describes collection an...

10.1109/icsda.2014.7051424 article EN 2014-09-01

Inverse text normalization (ITN) is the task that transforms in spoken-form into written-form. While automatic speech recognition (ASR) produces spoken-form, human and natural language understanding systems prefer to consume ITN generally deals with semiotic phrases (e.g., numbers, date, time). However, lack of studies deal phonetization phrases, which ASR's output when it handles unseen data foreign-named entities, domain names), although these exist same form text. The reason are infinite...

10.1109/icassp49357.2023.10094599 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05
Coming Soon ...