NFDI4DS | UHH-SEMS - Publication Details

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Computational Engineering, Finance, and Science (cs.CE) FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Science - Computational Engineering, Finance, and Science Computation and Language (cs.CL)

DOI: 10.48550/arxiv.2402.09391 Publication Date: 2024-02-14

Abstract Supplemental Material References Cited by

AUTHORS (5)

YU Bo-tao

Frazier N. Baker

Ziqi Chen

Xia Ning

Huan Sun

ABSTRACT

Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) GPT-4 exhibit remarkable capabilities on natural processing tasks, existing work shows their performance chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results comprehensive set of outperforming the most advanced across all by substantial margin approaching SoTA task-specific models. The key to success large-scale, comprehensive, high-quality dataset for instruction tuning named SMolInstruct. It contains 14 meticulously selected over three million samples, laying solid foundation training evaluating chemistry. Based SMolInstruct, fine-tune open-source LLMs, among which, find Mistral serves best base model tasks. We further conduct analysis impact trainable parameters, providing insights future research.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....