Back to Search
Start Over
High Performance Merging of Massive Data from Genome-Wide Association Studies
- Source :
- Data Management and Analytics for Medicine and Healthcare ISBN: 9783319671857, DMAH@VLDB
- Publication Year :
- 2017
- Publisher :
- Springer International Publishing, 2017.
-
Abstract
- The traditional data processing methods working on single computer show less scalability and efficiency for performing ordered full-outer-joining, on merging large number of individual Genome-Wide Associations Studies (GWAS) data. Although the emerging of big data platforms such as Hadoop and Spark shed lights on this problem, the inefficiency of keeping data in total-sorted order as well as the workload imbalance problem limit their performance. In this study, we designed and compared three new methodologies based on MapReduce, HBase and Spark respectively, to merge hundreds of individuals VCF files on their Single Nucleotide Polymorphism (SNP) location into a single TPED file. Our methodologies overcame the limitations stated above and considerably improved the performance with good scalability on input size and computing resources.
- Subjects :
- Computer science
business.industry
Big data
020206 networking & telecommunications
Genome-wide association study
02 engineering and technology
computer.software_genre
020204 information systems
Spark (mathematics)
Scalability
0202 electrical engineering, electronic engineering, information engineering
Data mining
business
computer
Merge (version control)
Subjects
Details
- ISBN :
- 978-3-319-67185-7
- ISBNs :
- 9783319671857
- Database :
- OpenAIRE
- Journal :
- Data Management and Analytics for Medicine and Healthcare ISBN: 9783319671857, DMAH@VLDB
- Accession number :
- edsair.doi...........c91f42bc4cf3e6715988977959634a9b
- Full Text :
- https://doi.org/10.1007/978-3-319-67186-4_4