Back to Search
Start Over
An NLP-based technique to extract meaningful features from drug SMILES
- Source :
- iScience, Vol 27, Iss 3, Pp 109127- (2024)
- Publication Year :
- 2024
- Publisher :
- Elsevier, 2024.
-
Abstract
- Summary: NLP is a well-established field in ML for developing language models that capture the sequence of words in a sentence. Similarly, drug molecule structures can also be represented as sequences using the SMILES notation. However, unlike natural language texts, special characters in drug SMILES have specific meanings and cannot be ignored. We introduce a novel NLP-based method that extracts interpretable sequences and essential features from drug SMILES notation using N-grams. Our method compares these features to Morgan fingerprint bit-vectors using UMAP-based embedding, and we validate its effectiveness through two personalized drug screening (PSD) case studies. Our NLP-based features are sparse and, when combined with gene expressions and disease phenotype features, produce better ML models for PSD. This approach provides a new way to analyze drug molecule structures represented as SMILES notation, which can help accelerate drug discovery efforts. We have also made our method accessible through a Python library.
- Subjects :
- Pharmaceutical science
Chemistry
Computer science
Science
Subjects
Details
- Language :
- English
- ISSN :
- 25890042
- Volume :
- 27
- Issue :
- 3
- Database :
- Directory of Open Access Journals
- Journal :
- iScience
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.9a8c5246d434d24a059e85dece6f1e2
- Document Type :
- article
- Full Text :
- https://doi.org/10.1016/j.isci.2024.109127