NFDI4DS | UHH-SEMS - Publication Details

Long Phan

ORCID: 0000-0003-0980-7057

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5016673808

Research Areas

Topic Modeling
Natural Language Processing Techniques
Immune cells in cancer
Computational and Text Analysis Methods
Biomedical Text Mining and Ontologies
Single-cell and spatial transcriptomics
Advanced Graph Neural Networks
Metaheuristic Optimization Algorithms Research
Speech and dialogue systems
Neuroinflammation and Neurodegeneration Mechanisms
Energy Efficient Wireless Sensor Networks
Multimodal Machine Learning Applications
Domain Adaptation and Few-Shot Learning
Software Engineering Research
Anomaly Detection Techniques and Applications
Advanced Text Analysis Techniques
Philippine History and Culture
Software Testing and Debugging Techniques
Immune responses and vaccinations

Vietnam National University Ho Chi Minh City
2024

National Cancer Institute
2021-2022

Viet Tri University of Industry
2022

Case Western Reserve University
2021-2022

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

OPENALEX - Publications

Hugo Laurençon Lucile Saulnier Thomas J. Wang Christopher Akiki A. Villanova del Moral and 49 more

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with goal of researching training large as values-driven undertaking, putting issues ethics, harm, governance foreground. This paper documents data creation curation efforts undertaken by to assemble Responsible Open-science Open-collaboration...

10.48550/arxiv.2303.03915 preprint EN cc-by arXiv (Cornell University) 2023-01-01

SciFive: a text-to-text transformer model for biomedical literature

OPENALEX - Publications

Long Phan James Anibal Hieu Tran Shaurya Chanana Erol Bahadroglu and 2 more

In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show text-generation have significant potential broad array of NLP tasks, particularly those requiring longer, more complex outputs. results support exploration difficult text generation...

10.48550/arxiv.2106.03598 preprint EN cc-by arXiv (Cornell University) 2021-01-01

CoTexT: Multi-task Learning with Code-Text Transformer

OPENALEX - Publications

Long Phan Hieu Tran Daniel Le Hieu Chi Nguyen James Annibal and 2 more

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming (PL). Using self-supervision, CoTexT is pre-trained on large corpora to learn general understanding of code. supports downstream NL-PL tasks such as code summarizing/documentation, generation, defect detection, debugging. train different combinations available PL corpus including both “bimodal” “unimodal” data. Here, bimodal data...

10.18653/v1/2021.nlp4prog-1.5 article EN cc-by 2021-01-01

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

OPENALEX - Publications

James Anibal Alexandre G. R. Day Erol Bahadiroglu Liam J. O’Neil Long Phan and 5 more

Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...

10.1371/journal.pcbi.1010349 article EN public-domain PLoS Computational Biology 2022-10-03

VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization?

OPENALEX - Publications

Hieu Chi Nguyen Long Phan James Anibal Alec Peltekian Hieu Tran

Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this has been widely studied in English, there very limited research on for Vietnamese text. In paper, we investigate the robustness of transformer-based encoder-decoder architectures abstractive summarization. Leveraging transfer learning and self-supervised learning, validate performance methods two datasets.

10.48550/arxiv.2110.04257 preprint EN cc-by arXiv (Cornell University) 2021-01-01

CoTexT: Multi-task Learning with Code-Text Transformer

OPENALEX - Publications

Long Phan Hieu Tran Daniel Le Hieu Chi Nguyen James Anibal and 2 more

10.48550/arxiv.2105.08645 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Leveraging Transfer Learning for Reliable Intelligence Identification on Vietnamese SNSs (ReINTEL)

OPENALEX - Publications

Trung-Hieu Tran Long Phan Truong-Son Nguyen Tien-Huy Nguyen

This paper proposed several transformer-based approaches for Reliable Intelligence Identification on Vietnamese social network sites at VLSP 2020 evaluation campaign. We exploit both of monolingual and multilingual pre-trained models. Besides, we utilize the ensemble method to improve robustness different approaches. Our team achieved a score 0.9378 ROC-AUC metric in private test set which is competitive other participants.

10.48550/arxiv.2012.07557 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Enriching Biomedical Knowledge for Low-resource Language Through Translation

OPENALEX - Publications

Long Phan Tai Dang Hieu Tran-Trung Vy Phan Lam D. Chau and 1 more

Abstract Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-theart translation model English-Vietnamese to translate produce both pretrained well supervised the biomedical domains. Thanks large-scale translation, introduce ViPubmedT5, Encoder-Decoder Transformer trained on 20 million translated abstracts from high-quality public PubMed corpus. ViPubMedT5 demonstrates...

10.1101/2022.10.11.511776 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2022-10-14

Scalable clustering with supervised linkage methods

OPENALEX - Publications

James Anibal Alexandre G. R. Day Erol Bahadiroglu Liam O’Neill Long Phan and 5 more

10.5281/zenodo.6332934 article EN Zenodo (CERN European Organization for Nuclear Research) 2021-08-01

Scalable clustering with supervised linkage methods

OPENALEX - Publications

James Anibal Alexandre G. R. Day Erol Bahadiroglu Liam O’Neill Long Phan and 5 more

Abstract Data clustering plays a significant role in biomedical sciences, particularly single-cell data analysis. Researchers use algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets clusters must generated assess varying cluster specificity. For example, there are subtypes leukocytes (e.g. T cells), whose preponderance phenotype assessed for...

10.1101/2021.08.01.454697 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2021-08-02

Coming Soon ...