Impact of the Characteristics of Quantum Chemical Databases on Machine Learning Prediction of Tautomerization Energies

Divergence (linguistics)
DOI: 10.1021/acs.jctc.1c00363 Publication Date: 2021-07-21T19:35:21Z
ABSTRACT
An essential aspect for adequate predictions of chemical properties by machine learning models is the database used training them. However, studies that analyze how content and structure databases impact prediction quality are scarce. In this work, we quantify relationships learned a model (Neural Network) trained on five different reference (QM9, PC9, ANI-1E, ANI-1, ANI-1x) to predict tautomerization energies from molecules in Tautobase. For this, characteristics such as number heavy atoms molecule, given element, bond composition, or initial geometry considered. The results indicate chemically diverse crucial obtaining good also conformational sampling can partly compensate limited coverage diversity. overall best-performing (ANI-1x) performs average 1 kcal/mol better than which, however, contains about 2 orders magnitude fewer structures. On other hand, PC9 more factor ∼5 quantified atom-in-molecule-based fragments (amons) it compared with ANI family databases. A quantitative measure deficiencies Kullback–Leibler divergence between target distributions. It explicitly demonstrated when certain types bonds need be covered (Tautobase) but undersampled databases, resulting poor. Examples include poor performance all analyzed C(sp2)–C(sp2) double close heteroatoms azoles containing N–N N–O bonds. Analysis Tree MAP algorithm provides deeper understanding specific predicting datasets due inadequate space. Capitalizing information either improve existing generate new sufficient diversity range (ML) applications chemistry.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (83)
CITATIONS (14)