Start Over

Understanding the Causes of Errors in Eukaryotic Protein-coding Gene Prediction: A Case Study of Primate Proteomes

Authors :: Nicolas Scalzitti
Corentin Meyer
Julie D. Thompson
Pierre Collet
Anne Jeannin-Girardon
Olivier Poch
Laboratoire des sciences de l'ingénieur, de l'informatique et de l'imagerie (ICube)
Institut National des Sciences Appliquées - Strasbourg (INSA Strasbourg)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Strasbourg (UNISTRA)-Centre National de la Recherche Scientifique (CNRS)-École Nationale du Génie de l'Eau et de l'Environnement de Strasbourg (ENGEES)-Réseau nanophotonique et optique
Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Matériaux et nanosciences d'Alsace (FMNGE)
Institut de Chimie du CNRS (INC)-Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Institut de Chimie du CNRS (INC)-Université de Strasbourg (UNISTRA)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)
École Nationale du Génie de l'Eau et de l'Environnement de Strasbourg (ENGEES)-Université de Strasbourg (UNISTRA)-Institut National des Sciences Appliquées - Strasbourg (INSA Strasbourg)
Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Les Hôpitaux Universitaires de Strasbourg (HUS)-Centre National de la Recherche Scientifique (CNRS)-Matériaux et Nanosciences Grand-Est (MNGE)
Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Institut National de la Santé et de la Recherche Médicale (INSERM)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Institut National de la Santé et de la Recherche Médicale (INSERM)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)-Réseau nanophotonique et optique
Université de Strasbourg (UNISTRA)-Université de Haute-Alsace (UHA) Mulhouse - Colmar (Université de Haute-Alsace (UHA))-Centre National de la Recherche Scientifique (CNRS)-Université de Strasbourg (UNISTRA)-Centre National de la Recherche Scientifique (CNRS)
Thompson, Julie
Source :: BMC Bioinformatics, BMC Bioinformatics, BioMed Central, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩, BMC Bioinformatics, Vol 21, Iss 1, Pp 1-16 (2020), BMC Bioinformatics, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩
Publication Year :: 2020
Publisher :: Research Square Platform LLC, 2020.
Abstract: Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

Subjects :: Primates
Proteome
Gene prediction
Receptor-Like Protein Tyrosine Phosphatases
Computational biology
Biology
lcsh:Computer applications to medicine. Medical informatics
Biochemistry
Genome
DNA sequencing
Open Reading Frames
03 medical and health sciences
0302 clinical medicine
Protein sequencing
Structural Biology
[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN]
Animals
Humans
Amino Acid Sequence
Error correction
Databases, Protein
lcsh:QH301-705.5
Molecular Biology
Gene
030304 developmental biology
Sequence (medicine)
0303 health sciences
Applied Mathematics
Protein sequence errors
Genome project
Computer Science Applications
Mutagenesis, Insertional
lcsh:Biology (General)
[SDV.BBM.GTP] Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN]
lcsh:R858-859.7
Sequence Alignment
Gene Deletion
030217 neurology & neurosurgery
Research Article
Genome annotation

Details

ISSN :: 14712105
Database :: OpenAIRE
Journal :: BMC Bioinformatics, BMC Bioinformatics, BioMed Central, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩, BMC Bioinformatics, Vol 21, Iss 1, Pp 1-16 (2020), BMC Bioinformatics, 2020, 21 (1), ⟨10.1186/s12859-020-03855-1⟩
Accession number :: edsair.doi.dedup.....d6752c39ad5440b570253b3489a5ffe7
Full Text :: https://doi.org/10.21203/rs.3.rs-50810/v1

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Understanding the Causes of Errors in Eukaryotic Protein-coding Gene Prediction: A Case Study of Primate Proteomes

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Understanding the Causes of Errors in Eukaryotic Protein-coding Gene Prediction: A Case Study of Primate Proteomes

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources