1. PatMatch: a program for finding patterns in peptide and nucleotide sequences
- Author
-
Danny Yoo, Seung Y. Rhee, Lukas A. Mueller, Thomas Yan, Dan C. Weems, J. Michael Cherry, Shuai Weng, and Tanya Z. Berardini
- Subjects
0106 biological sciences ,DNA, Plant ,Arabidopsis ,Biology ,Bioinformatics ,computer.software_genre ,01 natural sciences ,Article ,User-Computer Interface ,03 medical and health sciences ,Software ,Sequence Analysis, Protein ,Genetics ,Code (cryptography) ,Literal (computer programming) ,Regular expression ,030304 developmental biology ,computer.programming_language ,Internet ,0303 health sciences ,Sequence ,Arabidopsis Proteins ,business.industry ,Programming language ,Wildcard character ,Sequence Analysis, DNA ,computer.file_format ,Approximate string matching ,Perl ,Peptides ,business ,computer ,010606 plant biology & botany - Abstract
Here, we present PatMatch, an efficient, web-basedpattern-matching program that enables searches forshort nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domainsandmotifsinproteinsequences.Theprogramcanbeused to find matches to a user-specified sequencepattern that can be described using ambiguoussequence codes and a powerful and flexible patternsyntax based on regular expressions. A recentupgrade has improved performance and nowsupports both mismatches and wildcards in a singlepattern. This enhancement has been achievedbyreplacingtheprevioussearchingalgorithm,scan_for_matches [D’Souza et al. (1997), Trends inGenetics, 13, 497–498], with nondeterministic-reverse grep (NR-grep), a general pattern matchingtool that allows for approximate string matching[Navarro (2001), Software Practice and Experience,31, 1265–1312]. We have tailored NR-grep to beused for DNA and protein searches with PatMatch.The stand-alone version of the software can beadapted for use with any sequence dataset and isavailable for download at The Arabidopsis Infor-mation Resource (TAIR) at ftp://ftp.arabidopsis.org/home/tair/Software/Patmatch/. The PatMatch serveris available on the web at http://www.arabidopsis.org/cgi-bin/patmatch/nph-patmatch.pl for searchingArabidopsis thaliana sequences.INTRODUCTIONPatMatch is designed to find short (3–30 nt or amino acid)sequence matches. It can be useful for finding short patternsin nucleotide sequences such as cis-elements, massivelyparallel signature sequence (MPSS), instances of known serialanalysis of gene expression (SAGE) tags, small RNA bindingsitesorsmallproteindomainsandmotifsinproteinsequences.PatMatch uses a short sequence or regular expression asinput and allows ambiguous sequences to be represented bystandard International Union of Pure and Applied Chemistry(IUPAC; http://www.chem.qmw.ac.uk/iupac) nomenclature.The program also allows inexact matching (mismatches)of the query sequence against literal or regular expres-sion patterns. PatMatch requires users to explicitly enter apattern to search for and is not meant for de novo patterndetection.The original version of PatMatch was provided by theSaccharomyces Genome Database (SGD; http://www.yeastgenome.org/) (1) and was modified to be deployed atThe Arabidopsis information resource (2,3). In this paper,we report on changes we made to the software to improveperformance and support for mismatches when usingwildcards in the query sequence by using Nondeterministic-Reverse grep (NR-grep) (4). In addition, the CommonGateway Interface (CGI) code has been restructured, andthe auxiliary programs that displayed the results, whichwere written in C, were rewritten in Perl and modularizedto facilitate maintenance and future extension. This newversion of PatMatch is available at TAIR and is also availablefrom SGD at http://db.yeastgenome.org/cgi-bin/PATMATCH/nph-patmatch. more...
- Published
- 2005
- Full Text
- View/download PDF