Long Phan

ORCID: 0000-0003-0980-7057
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Immune cells in cancer
  • Computational and Text Analysis Methods
  • Biomedical Text Mining and Ontologies
  • Single-cell and spatial transcriptomics
  • Advanced Graph Neural Networks
  • Metaheuristic Optimization Algorithms Research
  • Speech and dialogue systems
  • Neuroinflammation and Neurodegeneration Mechanisms
  • Energy Efficient Wireless Sensor Networks
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Software Engineering Research
  • Anomaly Detection Techniques and Applications
  • Advanced Text Analysis Techniques
  • Philippine History and Culture
  • Software Testing and Debugging Techniques
  • Immune responses and vaccinations

Vietnam National University Ho Chi Minh City
2024

National Cancer Institute
2021-2022

Viet Tri University of Industry
2022

Case Western Reserve University
2021-2022

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show text-generation have significant potential broad array of NLP tasks, particularly those requiring longer, more complex outputs. results support exploration difficult text generation...

10.48550/arxiv.2106.03598 preprint EN cc-by arXiv (Cornell University) 2021-01-01

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming (PL). Using self-supervision, CoTexT is pre-trained on large corpora to learn general understanding of code. supports downstream NL-PL tasks such as code summarizing/documentation, generation, defect detection, debugging. train different combinations available PL corpus including both “bimodal” “unimodal” data. Here, bimodal data...

10.18653/v1/2021.nlp4prog-1.5 article EN cc-by 2021-01-01

Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...

10.1371/journal.pcbi.1010349 article EN public-domain PLoS Computational Biology 2022-10-03

Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this has been widely studied in English, there very limited research on for Vietnamese text. In paper, we investigate the robustness of transformer-based encoder-decoder architectures abstractive summarization. Leveraging transfer learning and self-supervised learning, validate performance methods two datasets.

10.48550/arxiv.2110.04257 preprint EN cc-by arXiv (Cornell University) 2021-01-01

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming (PL). Using self-supervision, CoTexT is pre-trained on large corpora to learn general understanding of code. supports downstream NL-PL tasks such as code summarizing/documentation, generation, defect detection, debugging. train different combinations available PL corpus including both "bimodal" "unimodal" data. Here, bimodal data...

10.48550/arxiv.2105.08645 preprint EN cc-by arXiv (Cornell University) 2021-01-01

This paper proposed several transformer-based approaches for Reliable Intelligence Identification on Vietnamese social network sites at VLSP 2020 evaluation campaign. We exploit both of monolingual and multilingual pre-trained models. Besides, we utilize the ensemble method to improve robustness different approaches. Our team achieved a score 0.9378 ROC-AUC metric in private test set which is competitive other participants.

10.48550/arxiv.2012.07557 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Abstract Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-theart translation model English-Vietnamese to translate produce both pretrained well supervised the biomedical domains. Thanks large-scale translation, introduce ViPubmedT5, Encoder-Decoder Transformer trained on 20 million translated abstracts from high-quality public PubMed corpus. ViPubMedT5 demonstrates...

10.1101/2022.10.11.511776 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2022-10-14

Abstract Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...

10.1101/2021.08.01.454697 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2021-08-02
Coming Soon ...