1. MVAR
- Author
-
Govindarajan Kunde-Ramamoorthy, Bahá El Kassaby, Francisco Castellanos, and Carol J. Bult
- Subjects
Annotation ,dbSNP ,Data access ,Computer science ,Ensembl ,Human genome ,Computational biology ,Mouse Genome Informatics ,JSON ,computer ,Genome ,computer.programming_language - Abstract
Model organisms are essential to understanding the biological and disease consequences of human genome variation. Bioinformatics resources that support meaningful comparisons of mouse and human genotype-to-phenotype data and knowledge are needed to support the translation from bench to bedside and back again [1]. There is no genome variation resource for mouse comparable to resources available for human genome variation data such as EXAC [2], ClinVar [3], or ClinGen [4]. NCBI resources such as dbSNP and ClinVar no longer accept data from model organisms. While the European Variation Archive (EVA) serves a repository of SNP data for mouse, however, the resource does not accept imputed variation data or curated phenotype annotations associated with variation data that are central to data interpretation and analysis. Although the Mouse Genome Informatics database (MGI) [5] serves as a comprehensive mouse allele registry and curates information about the association of mouse variants with phenotypes and disease, the variation data in MGI are not currently available in format consistent with the Human Genome Variation Society (HGVS) standards [6]. The Mouse Variation Registry (MVAR) will represent the integration of all mouse genome variation data and includes processes to automatically canonicalize variants so that they are uniquely represented in the database with comprehensive annotation and their distribution across strains. The starting dataset used as input into MVAR was downloaded in VCF format [7] (as a 42GB gzipped file) from the Mouse Genomes Project [8] and contains about 81M Single-Nucleotide Variants (SNV), ~9M Deletions and ~8M Insertions. Other data will be obtained from MGI, the Mouse Mutant Repository Database (MMRDB), the Diversity Outbred Database (DODB), and from computationally imputed SNP data. The MVAR data ingest workflow has been developed to normalize, prepare and annotate input variation data. With the help of the GATK framework [9], the first step of the pipeline consists of normalizing i.e., left aligning each variant, and decomposing the multi-allelic variants (where there is more than one variation in a row of data). The next step in the pipeline is made with the use of the Ensembl Variant Effect Predictor (VEP) [10], which annotates the variation data with its corresponding HGVS nomenclature and existing external Id. The final step uses the Jannovar library [11] to enrich the data with Functional Consequence annotations. After the data has been pre-processed through the pipeline, they are inserted into a MySQL database with the help of custom tools developed to create the canonical variants representations. MVAR supports programmatic data access to the registry through an API for interoperability. This API is used by a user-friendly web-application with rich user interfaces to query the database and display results. The API is also available to be a resource for other services or applications over HTTP with JSON data payloads. Wide-used industry frameworks like Angular and Groovy Grails were leveraged to build the MVAR web application. To conclude, the lack of a comprehensive, annotated genome variation resource for mouse is a significant barrier to comparing variation and its biological consequences between mouse and human and limits the impact of many research and resource development programs. The MVAR project seeks to address this resource gap by bringing together investigators that have active projects in the area of genome variation in either mouse or human or both. Many of the investigators on this project have developed independent resources to curate or manage genome variation. This project aims to unify these efforts and build a common data resource. Future work will include the incorporation of structural variants into the MVAR registry.
- Published
- 2021
- Full Text
- View/download PDF