In this issue, Krishna et al [1] present an integrated analysis of RNA-seq and mass spectrometry data to improve genome annotation, in the closely related protozoan parasites Toxoplasma gondii and Neospora caninum. They build new RNA-seq-derived genetic models and provide the first global proteomic dataset for N. caninum. The resulting genome annotations cover 35% and 17% of the T. gondii and N. caninum predicted proteomes, respectively. Furthermore, this analysis led to the identification of a significant number of novel protein-coding genes, which are absent from current annotations. The genomes of many important human pathogens have now been sequenced, providing an essential resource for research. For most of these organisms, however, there is a very limited understanding of their encoded transcripts and proteins. Accurate genome annotations are essential for many approaches, particularly genetic studies or for constructing databases for proteomics. Two ab initio prediction programs for eukaryotic genes are widely used to annotate genomic sequence: TigrScan and GlimmerHMM [2]. While tools such as these are essential for prediction of gene models, inaccuracies in these models are common and create significant issues for global proteomic studies of many organisms. Gene models can have an incorrect or missing start site, wrong intron or exon boundaries, or a novel gene may not even be predicted by such approaches. Furthermore, information on alternative splicing is lacking from most annotations. How accurate genome annotations are is unclear, and varies from organism to organism. Proteogenomics, the integration of proteomic, transcriptomic and genomics data, can be a powerful approach to improving genome annotation and identifying novel genes. The first efforts to sequence the T. gondii genome were performed by Shotgun sequencing and EST assembly [3]. Strains representing the main lineages of T. gondii have been sequenced providing critically important data for understanding the biology of this ubiquitous pathogen. The most recent annotations of the T. gondii and N. caninum genome [4] are maintained by ToxoDB.org as part of the Eukaryotic Pathogen Database Resource Center (EuPathDB) [5], an important resource for the Apicomplexa community. Over 8000 genes are currently annotated in the draft T. gondii genome, which were originally annotated using conventional computational algorithms (including TigrScan, Twinscan and GlimmerHMM) [3,6]. While such tools have been useful for predicting T. gondii genes, the algorithms on which they are based result in the prediction of different gene models, which has led to uncertainty about the accuracy of these predictions [7]. By comparing T. gondii gene annotations generated from TigrScan and GlimmerHMM with proteomics and EST data, Dybas et al calculated a false negative rate of these gene models of up to 41% [8], illustrating the problems inherent in gene annotation based on the analytical programs available at that time of publication of this paper. Gene models can be significantly improved by combining experimental data with existing annotations. T. gondii genetic models are continuously reassessed by semi-automated reannotation using experimental data, or manual curation [3,6]. Proteomics has played an important role in shaping the current T. gondii genome annotations, amounting to at least 68% coverage of the predicted proteome [9]. Proteomic data can be used to validate gene annotations, and is also resource for new open reading frames and novel proteins. A global proteomic study of T. gondii tachyzoites, performed by Xia et al [10], provided coverage of 27% of the predicted proteome, and was the first study to use mass spectrometry data to validate genetic models in T. gondii. Integration of this data with EST information led to the validation of 91% of the proteins in the proteome, arguing that transcriptomic data can be used to validate proteomic datasets. A subsequent study by Che et al [11] that employed three proteomic strategies (LC-MS/MS, TLSGE MudPIT, and BDAP LC-MSMS) identified 2241 T. gondii proteins that were classified into 841 protein clusters. For analysis, they employed a hypothetical T. gondii proteome based on a combination of computationally predicted proteins from TigrScan, TwinScan, GlimmerHMM, Release 6.0 ToxodB and the available experimental T. gondii sequences from the NCBI nonredundant protein database, confirming that the experimental proteomic data identified valid predictions that were unique to each computational model. Next generation sequencing has exponentially improved the quality of transcriptional information that can be obtained, and can provide information on alternative splicing, intronexon boundaries and lead to identification of novel transcripts. In their study, Krishna et al queried mass spectrometry data against RNA-seq derived gene models for T. gondii and N. caninum [1], leading to the identification of loci that were not present in current genome annotations, indicating that RNA-seq is a valuable tool for validation of genome annotation models. Furthermore, Krishna et al introduced an RNA-seq compliant version of CRAIG, a tool for gene model generation, which is an alternative to TigrScan and other widely used genome annotation algorithms [12]. Studies based on sequencing, such as these, are beginning to play an important role in genome annotation in T. gondii. RNA-seq has been used to perform de novo assembly in ME49 strain [13], which led to the identification of over 2000 transcripts that did not correspond to any previously annotated gene. In addition, this provided information on alternative splicing, which was previously uncharacterised in T. gondii. TSS-seq has also been used to profile the 5’UTR regions of genes and determine transcriptional start site locations 14]. In addition, a recent study used strand-specific RNA-seq to explore untranslated regions in T. gondii and N. caninum [15], which led to the identification of putative antisense transcripts and long noncoding RNAs. With RNA-seq data on different T. gondii life cycle stages and in various strains now available [16, 17], it may be possible to further increase coverage of genome annotations and mine these datasets for novel transcripts that are stage-and/or strain-specific. Considering the value proteomics and next generation sequencing information has had in gene annotations in T. gondii, Krishna et al have combined both types of data for a more powerful proteogenomics analyses [1]. This resulted in a significant improvement in the genome annotation, as demonstrated by this study, providing the greatest coverage of the predicted proteome compared to previous studies on these organisms that used other models. The genomes of many Apicomplexan parasites and other pathogens have been sequenced, providing an important resource for researches; however, the annotations are far from complete. Combinatorial approaches such as those used by Krishna et al can provide important validation of computational predictions of annotations. Incorporation of proteogenomics into genome annotation pipelines as standard practice is likely to be hugely useful in the generation of accurate gene models in the future. To this end, the current version of ToxoDB.org is actively curated, employing heuristic gene prediction methods incorporating experimental data sets such as proteomics and transcriptomic data to improve gene annotations. This represents an significant model for gene annotation and illustrates the importance of maintaining active curation efforts to improve and maintain the utility of these critical scientific community resources.