Back to Search
Start Over
The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
- Source :
- J Appl Genet, Journal of Applied Genetics
- Publication Year :
- 2020
- Publisher :
- Springer Science and Business Media LLC, 2020.
-
Abstract
- A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. Electronic supplementary material The online version of this article (10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users.
- Subjects :
- 0301 basic medicine
Genotyping Techniques
TensorFlow
Computational biology
Biology
Polymorphism, Single Nucleotide
Genome
DNA sequencing
03 medical and health sciences
Deep Learning
0302 clinical medicine
Genotype
Genetics
Animals
SNP
SNP microarray
Whole Genome Sequencing
business.industry
Deep learning
SNP calling
Correction
General Medicine
Classification
Animal Genetics • Original Paper
030104 developmental biology
Next-generation sequencing
Cattle
Artificial intelligence
business
Algorithms
030217 neurology & neurosurgery
Keras
Python
Subjects
Details
- ISSN :
- 21903883 and 12341983
- Volume :
- 61
- Database :
- OpenAIRE
- Journal :
- Journal of Applied Genetics
- Accession number :
- edsair.doi.dedup.....e28f49e7e5b2d14823f1d9458b386fbf