1. Interpretable detection of novel human viruses from genome sequencing data
- Author
-
Bartoszewicz, Jakub M, Seidel, Anja, and Renard, Bernhard Y
- Subjects
AcademicSubjects/SCI01140 ,AcademicSubjects/SCI01060 ,Computer science ,Standard Article ,AcademicSubjects/SCI01180 ,medicine.disease_cause ,computer.software_genre ,Genome ,interpretability tools ,0302 clinical medicine ,Structural Biology ,Nucleotide ,Pathogen ,Coronavirus ,Interpretability ,chemistry.chemical_classification ,0303 health sciences ,Suite ,Applied Mathematics ,Computer Science Applications ,AcademicSubjects/SCI00980 ,Host (network) ,viral host prediction ,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ,AcademicSubjects/SCI00030 ,education ,Genomics ,Machine learning ,Virus ,DNA sequencing ,03 medical and health sciences ,medicine ,Genetics ,ddc:610 ,Molecular Biology ,030304 developmental biology ,business.industry ,Deep learning ,500 Naturwissenschaften und Mathematik::570 Biowissenschaften ,Biologie::570 Biowissenschaften ,Biologie ,Visualization ,chemistry ,Filter (video) ,deep neural architectures ,Artificial intelligence ,business ,610 Medizin und Gesundheit ,computer ,030217 neurology & neurosurgery - Abstract
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
- Published
- 2021