Back to Search Start Over

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Authors :
Yunpeng Cai
Wei Zheng
Jin Yao
Yujie Yang
Volker Mai
Qi Mao
Yijun Sun
Source :
PLoS Computational Biology, Vol 13, Iss 4, p e1005518 (2017)
Publication Year :
2017
Publisher :
Public Library of Science (PLoS), 2017.

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

Subjects

Subjects :
Biology (General)
QH301-705.5

Details

Language :
English
ISSN :
1553734X and 15537358
Volume :
13
Issue :
4
Database :
Directory of Open Access Journals
Journal :
PLoS Computational Biology
Publication Type :
Academic Journal
Accession number :
edsdoj.71b7e187f314d239e27493243bfdb71
Document Type :
article
Full Text :
https://doi.org/10.1371/journal.pcbi.1005518