Back to Search
Start Over
RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data
- Source :
- Bioinformatics
- Publication Year :
- 2011
- Publisher :
- Oxford University Press (OUP), 2011.
-
Abstract
- Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode. Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/. Contact: ude.anaidni@eyy Supplementary information: Available at the RAPSearch2 website.
- Subjects :
- Statistics and Probability
Speedup
Source code
Computer science
Nearest neighbor search
media_common.quotation_subject
computer.software_genre
Biochemistry
law.invention
03 medical and health sciences
Search engine
law
Databases, Protein
Molecular Biology
030304 developmental biology
media_common
0303 health sciences
030306 microbiology
Gene Expression Profiling
Suffix array
High-Throughput Nucleotide Sequencing
Proteins
Data structure
Hash table
Computer Science Applications
Search Engine
Applications Note
Computational Mathematics
Computational Theory and Mathematics
Memory footprint
Programming Languages
Data mining
Sequence Analysis
computer
Algorithms
Software
Subjects
Details
- ISSN :
- 14602059 and 13674803
- Volume :
- 28
- Database :
- OpenAIRE
- Journal :
- Bioinformatics
- Accession number :
- edsair.doi.dedup.....0c029c7bbb0da8710afef28a33202851
- Full Text :
- https://doi.org/10.1093/bioinformatics/btr595