LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Computational Engineering, Finance, and Science (cs.CE) FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Science - Computational Engineering, Finance, and Science Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2402.09391 Publication Date: 2024-02-14
ABSTRACT
Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) GPT-4 exhibit remarkable capabilities on natural processing tasks, existing work shows their performance chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results comprehensive set of outperforming the most advanced across all by substantial margin approaching SoTA task-specific models. The key to success large-scale, comprehensive, high-quality dataset for instruction tuning named SMolInstruct. It contains 14 meticulously selected over three million samples, laying solid foundation training evaluating chemistry. Based SMolInstruct, fine-tune open-source LLMs, among which, find Mistral serves best base model tasks. We further conduct analysis impact trainable parameters, providing insights future research.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()