- Topic Modeling
- Natural Language Processing Techniques
- Immune cells in cancer
- Computational and Text Analysis Methods
- Biomedical Text Mining and Ontologies
- Single-cell and spatial transcriptomics
- Advanced Graph Neural Networks
- Metaheuristic Optimization Algorithms Research
- Speech and dialogue systems
- Neuroinflammation and Neurodegeneration Mechanisms
- Energy Efficient Wireless Sensor Networks
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Software Engineering Research
- Anomaly Detection Techniques and Applications
- Advanced Text Analysis Techniques
- Philippine History and Culture
- Software Testing and Debugging Techniques
- Immune responses and vaccinations
Vietnam National University Ho Chi Minh City
2024
National Cancer Institute
2021-2022
Viet Tri University of Industry
2022
Case Western Reserve University
2021-2022
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...
In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show text-generation have significant potential broad array of NLP tasks, particularly those requiring longer, more complex outputs. results support exploration difficult text generation...
We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming (PL). Using self-supervision, CoTexT is pre-trained on large corpora to learn general understanding of code. supports downstream NL-PL tasks such as code summarizing/documentation, generation, defect detection, debugging. train different combinations available PL corpus including both “bimodal” “unimodal” data. Here, bimodal data...
Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...
Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this has been widely studied in English, there very limited research on for Vietnamese text. In paper, we investigate the robustness of transformer-based encoder-decoder architectures abstractive summarization. Leveraging transfer learning and self-supervised learning, validate performance methods two datasets.
We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming (PL). Using self-supervision, CoTexT is pre-trained on large corpora to learn general understanding of code. supports downstream NL-PL tasks such as code summarizing/documentation, generation, defect detection, debugging. train different combinations available PL corpus including both "bimodal" "unimodal" data. Here, bimodal data...
This paper proposed several transformer-based approaches for Reliable Intelligence Identification on Vietnamese social network sites at VLSP 2020 evaluation campaign. We exploit both of monolingual and multilingual pre-trained models. Besides, we utilize the ensemble method to improve robustness different approaches. Our team achieved a score 0.9378 ROC-AUC metric in private test set which is competitive other participants.
Abstract Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-theart translation model English-Vietnamese to translate produce both pretrained well supervised the biomedical domains. Thanks large-scale translation, introduce ViPubmedT5, Encoder-Decoder Transformer trained on 20 million translated abstracts from high-quality public PubMed corpus. ViPubMedT5 demonstrates...
Abstract Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...