MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations

Representation Feature Learning Modalities ENCODE
DOI: 10.1093/bioinformatics/btae260 Publication Date: 2024-06-28T09:30:32Z
ABSTRACT
Abstract Motivation The current paradigm of deep learning models for the joint representation molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits models’ versatility adaptability across a wide range modalities. Conversely, limited research focusing explicit tends to overlook textual data within biomedical domain. Results We present unified pre-trained language model, MolLM, concurrently captures alongside text. MolLM consists Transformer encoder encoder, designed encode both structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive as supervisory signal learning, demonstrates robust capabilities four downstream tasks, including cross-modal molecule matching, property prediction, captioning, text-prompted editing. Through ablation, demonstrate inclusion representations improves performance in these tasks. Availability implementation Our code, data, model weights, examples using our are all available at https://github.com/gersteinlab/MolLM. In particular, provide Jupyter Notebooks offering step-by-step guidance how use extract embeddings
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (59)
CITATIONS (2)