Back to Search
Start Over
Understanding the Causes of Errors in Eukaryotic Protein-coding Gene Prediction: A Case Study of Primate Proteomes
- Source :
- BMC Bioinformatics, BMC Bioinformatics, BioMed Central, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩, BMC Bioinformatics, Vol 21, Iss 1, Pp 1-16 (2020), BMC Bioinformatics, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩
- Publication Year :
- 2020
- Publisher :
- Research Square Platform LLC, 2020.
-
Abstract
- Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
- Subjects :
- Primates
Proteome
Gene prediction
Receptor-Like Protein Tyrosine Phosphatases
Computational biology
Biology
lcsh:Computer applications to medicine. Medical informatics
Biochemistry
Genome
DNA sequencing
Open Reading Frames
03 medical and health sciences
0302 clinical medicine
Protein sequencing
Structural Biology
[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN]
Animals
Humans
Amino Acid Sequence
Error correction
Databases, Protein
lcsh:QH301-705.5
Molecular Biology
Gene
030304 developmental biology
Sequence (medicine)
0303 health sciences
Applied Mathematics
Protein sequence errors
Genome project
Computer Science Applications
Mutagenesis, Insertional
lcsh:Biology (General)
[SDV.BBM.GTP] Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN]
lcsh:R858-859.7
Sequence Alignment
Gene Deletion
030217 neurology & neurosurgery
Research Article
Genome annotation
Subjects
Details
- ISSN :
- 14712105
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics, BMC Bioinformatics, BioMed Central, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩, BMC Bioinformatics, Vol 21, Iss 1, Pp 1-16 (2020), BMC Bioinformatics, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩
- Accession number :
- edsair.doi.dedup.....d6752c39ad5440b570253b3489a5ffe7
- Full Text :
- https://doi.org/10.21203/rs.3.rs-50810/v1