Dat Quoc Nguyen

ORCID: 0000-0001-8214-2878
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Biomedical Text Mining and Ontologies
  • Advanced Graph Neural Networks
  • Text Readability and Simplification
  • Semantic Web and Ontologies
  • Speech Recognition and Synthesis
  • Speech and dialogue systems
  • Recommender Systems and Techniques
  • Sentiment Analysis and Opinion Mining
  • Advanced Text Analysis Techniques
  • Multimodal Machine Learning Applications
  • Data Quality and Management
  • Text and Document Classification Technologies
  • Handwritten Text Recognition Techniques
  • Computational Drug Discovery Methods
  • Web Data Mining and Analysis
  • Computational and Text Analysis Methods
  • CAR-T cell therapy research
  • Complex Network Analysis Techniques
  • Bayesian Modeling and Causal Inference
  • Plant and Fungal Species Descriptions
  • Domain Adaptation and Few-Shot Learning
  • Immune Cell Function and Interaction
  • Machine Learning in Materials Science

The University of Melbourne
2017-2024

VinUniversity
2020-2024

Peter MacCallum Cancer Centre
2023-2024

Vietnam Academy of Science and Technology
2015-2022

Ho Chi Minh City University of Technology
2022

ORCID
2020

Machine Intelligence Research Institute
2020

Macquarie University
2015-2018

Vietnam National University, Hanoi
2009-2014

Vietnam National University Ho Chi Minh City
2013-2014

Dai Quoc Nguyen, Tu Dinh Dat Phung. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2053 preprint EN cc-by 2018-01-01

Probabilistic topic models are widely used to discover latent topics in document collections, while feature vector representations of words have been obtain high performance many NLP tasks. In this paper, we extend two different Dirichlet multinomial by incorporating trained on very large corpora improve the word-topic mapping learnt a smaller corpus. Experimental results show that using information from external corpora, our new produce significant improvements coherence, clustering and...

10.1162/tacl_a_00140 article EN cc-by Transactions of the Association for Computational Linguistics 2015-12-01

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that consistently outperforms recent best multilingual model XLM-R (Conneau et al., 2020) improves state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition Natural inference. release to facilitate future research downstream...

10.18653/v1/2020.findings-emnlp.92 article EN cc-by 2020-01-01

Dai Quoc Nguyen, Thanh Vu, Tu Dinh Dat Phung. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1226 article EN 2019-01-01

Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson. Proceedings of the 2016 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2016.

10.18653/v1/n16-1054 preprint EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016-01-01

In public health surveillance, measuring how information enters and spreads through online communities may help us understand geographical variation in decision making associated with poor outcomes.Our aim was to evaluate the use of community structure topic modeling methods as a process for characterizing clustering opinions about human papillomavirus (HPV) vaccines on Twitter.The study examined Twitter posts (tweets) collected between October 2013 2015 HPV vaccines. We tested Latent...

10.2196/jmir.6045 article EN cc-by Journal of Medical Internet Research 2016-08-29

Abstract Chimeric antigen receptor (CAR) T cell therapy has transformed the treatment of haematological malignancies such as acute lymphoblastic leukaemia, B lymphoma and multiple myeloma 1–4 , but efficacy CAR in solid tumours been limited 5 . This is owing to a number factors, including immunosuppressive tumour microenvironment that gives rise poorly persisting metabolically dysfunctional cells. Analysis anti-CD19 cells used clinically shown positive outcomes are associated with more...

10.1038/s41586-024-07242-1 article EN cc-by Nature 2024-04-10

Thanh Vu, Dat Quoc Nguyen, Dai Mark Dras, Johnson. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Demonstrations. 2018.

10.18653/v1/n18-5012 preprint EN cc-by 2018-01-01

In this paper, we provide an overview of the WNUT-2020 shared task on identification informative COVID-19 English Tweets. We describe how construct a corpus 10K Tweets and organize development evaluation phases for task. addition, also present brief summary results obtained from final system submissions 55 teams, finding that (i) many systems obtain very high performance, up to 0.91 F1 score, (ii) majority achieve substantially higher than baseline fastText (Joulin et al., 2017), (iii)...

10.18653/v1/2020.wnut-1.41 article EN cc-by 2020-01-01

This paper describes our robust, easyto-use and language independent toolkit namely RDRPOSTagger which employs an error-driven approach to automatically construct a Single Classification Ripple Down Rules tree of transformation rules for POS tagging task.During the demonstration session, we will run tagger on data sets in 15 different languages.

10.3115/v1/e14-2005 article EN cc-by 2014-01-01

We investigate the incorporation of character-based word representations into a standard CNN-based relation extraction model. experiment with two common neural architectures, CNN and LSTM, to learn vector from character embeddings. Through task on BioCreative-V CDR corpus, extracting relationships between chemicals diseases, we show that models exploiting improve do not use this information, obtaining state-of-the-art result relative previous approaches.

10.18653/v1/w18-2314 article EN cc-by 2018-01-01

ICD coding is a process of assigning the International Classification Disease diagnosis codes to clinical/medical notes documented by health professionals (e.g. clinicians). This requires significant human resources, and thus costly prone error. To handle problem, machine learning has been utilized for automatic coding. Previous state-of-the-art models were based on convolutional neural networks, using single/several fixed window sizes. However, lengths interdependence between text fragments...

10.24963/ijcai.2020/461 preprint EN 2020-07-01

In this paper, we propose a new approach to construct system of transformation rules for the Part-of-Speech (POS) tagging task. Our is based on an incremental knowledge acquisition method where are stored in exception structure and only added correct errors existing rules; thus allowing systematic control interaction between rules. Experimental results 13 languages show that our fast terms training time speed. Furthermore, obtains very competitive accuracy comparison state-of-the-art POS...

10.3233/aic-150698 article EN AI Communications 2016-04-26

Semantic parsing is an important NLP task. However, Vietnamese a low-resource language in this research area. In paper, we present the first public large-scale Text-to-SQL semantic dataset for Vietnamese. We extend and evaluate two strong baselines EditSQL (Zhang et al., 2019) IRNet (Guo on our dataset. compare with key configurations find that: automatic word segmentation improves results of both baselines; normalized pointwise mutual information (NPMI) score (Bouma, 2009) useful schema...

10.18653/v1/2020.findings-emnlp.364 article EN 2020-01-01

This paper presents a framework for automatically constructing timeline summaries from collections of web news articles. We also evaluate our solution against manually created timelines and in comparison with related work.

10.1145/2487788.2487829 article EN 2013-05-13

Knowledge bases are useful resources for many natural language processing tasks, however, they far from complete.In this paper, we define a novel entity representation as mixture of its neighborhood in the knowledge base and apply technique on TransE-a well-known embedding model completion.Experimental results show that information significantly helps to improve TransE, leading better performance than obtained by other state-of-the-art models three benchmark datasets triple classification,...

10.18653/v1/k16-1005 preprint EN cc-by 2016-01-01

We compare the use of LSTM-based and CNN-based character-level word embeddings in BiLSTM-CRF models to approach chemical disease named entity recognition (NER) tasks. Empirical results over BioCreative V CDR corpus show that either type conjunction with leads comparable state-of-the-art performance. However, using have a computational performance advantage, increasing training time word-based by 25% while more than double required time.

10.18653/v1/w18-5605 article EN cc-by 2018-01-01

In this paper, we propose a novel embedding model, named ConvKB, for knowledge base completion. Our model ConvKB advances state-of-the-art models by employing convolutional neural network, so that it can capture global relationships and transitional characteristics between entities relations in bases. each triple (head entity, relation, tail entity) is represented as 3-column matrix where column vector represents element. This then fed to convolution layer multiple filters are operated on...

10.3233/sw-180318 article EN Semantic Web 2018-08-24

Chemical patents are an important resource for chemical information. However, few Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, character-level representations contextualized ELMo patents. We compare embeddings biomedical corpora. The effect tokenizers optimized domain is also explored. results...

10.18653/v1/w19-5035 article EN cc-by 2019-01-01

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications help fight pandemic. However, most these are exclusively for English. As is a global problem, it worth creating related datasets languages other than In this paper, we present first manually-annotated domain-specific dataset Vietnamese. Particularly, our annotated named entity recognition (NER) task with newly-defined types can be used in future epidemics. Our also...

10.18653/v1/2021.naacl-main.173 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021-01-01

Intent detection and slot filling are important tasks in spoken natural language understanding.However, Vietnamese is a low-resource these research topics.In this paper, we present the first public intent dataset for Vietnamese.In addition, also propose joint model filling, that extends recent state-ofthe-art JointBERT+CRF [1] with an intent-slot attention layer to explicitly incorporate context information into via "soft" label embedding.Experimental results on our show proposed...

10.21437/interspeech.2021-618 article EN Interspeech 2022 2021-08-27
Coming Soon ...