Back to Search Start Over

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines

Authors :
Magda Mielczarek
Krzysztof Kotlarz
Joanna Szyda
Bernt Guldbrandtsen
Bartosz Czech
Tomasz Suchocki
Source :
J Appl Genet, Journal of Applied Genetics
Publication Year :
2020
Publisher :
Springer Science and Business Media LLC, 2020.

Abstract

A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. Electronic supplementary material The online version of this article (10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users.

Details

ISSN :
21903883 and 12341983
Volume :
61
Database :
OpenAIRE
Journal :
Journal of Applied Genetics
Accession number :
edsair.doi.dedup.....e28f49e7e5b2d14823f1d9458b386fbf