Back to Search Start Over

An NLP-based technique to extract meaningful features from drug SMILES

Authors :
Rahul Sharma
Ehsan Saghapour
Jake Y. Chen
Source :
iScience, Vol 27, Iss 3, Pp 109127- (2024)
Publication Year :
2024
Publisher :
Elsevier, 2024.

Abstract

Summary: NLP is a well-established field in ML for developing language models that capture the sequence of words in a sentence. Similarly, drug molecule structures can also be represented as sequences using the SMILES notation. However, unlike natural language texts, special characters in drug SMILES have specific meanings and cannot be ignored. We introduce a novel NLP-based method that extracts interpretable sequences and essential features from drug SMILES notation using N-grams. Our method compares these features to Morgan fingerprint bit-vectors using UMAP-based embedding, and we validate its effectiveness through two personalized drug screening (PSD) case studies. Our NLP-based features are sparse and, when combined with gene expressions and disease phenotype features, produce better ML models for PSD. This approach provides a new way to analyze drug molecule structures represented as SMILES notation, which can help accelerate drug discovery efforts. We have also made our method accessible through a Python library.

Details

Language :
English
ISSN :
25890042
Volume :
27
Issue :
3
Database :
Directory of Open Access Journals
Journal :
iScience
Publication Type :
Academic Journal
Accession number :
edsdoj.9a8c5246d434d24a059e85dece6f1e2
Document Type :
article
Full Text :
https://doi.org/10.1016/j.isci.2024.109127