Back to Search Start Over

Improved GNNs for Log D7.4Prediction by Transferring Knowledge from Low-Fidelity Data

Authors :
Duan, Yan-Jing
Fu, Li
Zhang, Xiao-Chen
Long, Teng-Zhi
He, Yuan-Hang
Liu, Zhao-Qian
Lu, Ai-Ping
Deng, Ya-Feng
Hsieh, Chang-Yu
Hou, Ting-Jun
Cao, Dong-Sheng
Source :
Journal of Chemical Information and Modeling; April 2023, Vol. 63 Issue: 8 p2345-2359, 15p
Publication Year :
2023

Abstract

The n-octanol/buffer solution distribution coefficient at pH = 7.4 (log D7.4) is an indicator of lipophilicity, and it influences a wide variety of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties and druggability of compounds. In log D7.4prediction, graph neural networks (GNNs) can uncover subtle structure–property relationships (SPRs) by automatically extracting features from molecular graphs that facilitate the learning of SPRs, but their performances are often limited by the small size of available datasets. Herein, we present a transfer learning strategy called pretraining on computational data and then fine-tuning on experimental data (PCFE) to fully exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model on 1.71 million computational log Ddata (low-fidelity data) and then fine-tuning it on 19,155 experimental log D7.4data (high-fidelity data). The experiments for three GNN architectures (graph convolutional network (GCN), graph attention network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in improving GNNs for log D7.4predictions. Moreover, the optimal PCFE-trained GNN model (cx-Attentive FP, Rtest2= 0.909) outperformed four excellent descriptor-based models (random forest (RF), gradient boosting (GB), support vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness of the cx-Attentive FP model was also confirmed by evaluating the models with different training data sizes and dataset splitting strategies. Therefore, we developed a webserver and defined the applicability domain for this model. The webserver (http://tools.scbdd.com/chemlogd/) provides free log D7.4prediction services. In addition, the important descriptors for log D7.4were detected by the Shapley additive explanations (SHAP) method, and the most relevant substructures of log D7.4were identified by the attention mechanism. Finally, the matched molecular pair analysis (MMPA) was performed to summarize the contributions of common chemical substituents to log D7.4, including a variety of hydrocarbon groups, halogen groups, heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive FP model can serve as a reliable tool to predict log D7.4and hope that pretraining on low-fidelity data can help GNNs make accurate predictions of other endpoints in drug discovery.

Details

Language :
English
ISSN :
15499596 and 1549960X
Volume :
63
Issue :
8
Database :
Supplemental Index
Journal :
Journal of Chemical Information and Modeling
Publication Type :
Periodical
Accession number :
ejs62691866
Full Text :
https://doi.org/10.1021/acs.jcim.2c01564