100 results on '"Biosequence"'
Search Results
2. VLSI Implementation of Smith–Waterman Algorithm for Biological Sequence Scanning
- Author
-
Rajalakshmi, K., Nivedita, R., Angrisani, Leopoldo, Series editor, Arteaga, Marco, Series editor, Chakraborty, Samarjit, Series editor, Chen, Jiming, Series editor, Chen, Tan Kay, Series editor, Dillmann, Ruediger, Series editor, Duan, Haibin, Series editor, Ferrari, Gianluigi, Series editor, Ferre, Manuel, Series editor, Hirche, Sandra, Series editor, Jabbari, Faryar, Series editor, Kacprzyk, Janusz, Series editor, Khamis, Alaa, Series editor, Kroeger, Torsten, Series editor, Ming, Tan Cher, Series editor, Minker, Wolfgang, Series editor, Misra, Pradeep, Series editor, Möller, Sebastian, Series editor, Mukhopadhyay, Subhas Chandra, Series editor, Ning, Cun-Zheng, Series editor, Nishida, Toyoaki, Series editor, Panigrahi, Bijaya Ketan, Series editor, Pascucci, Federica, Series editor, Samad, Tariq, Series editor, Seng, Gan Woon, Series editor, Veiga, Germano, Series editor, Wu, Haitao, Series editor, Zhang, Junjie James, Series editor, and Nath, Vijay, editor
- Published
- 2018
- Full Text
- View/download PDF
3. Peatland in Malaysia
- Author
-
Melling, Lulie, Osaki, Mitsuru, editor, and Tsuji, Nobuyuki, editor
- Published
- 2016
- Full Text
- View/download PDF
4. Soil chronosequence and biosequence on old lake sediments of the Burdur Lake in Turkey
- Author
-
Yakun Zhang, Sevda Altunbaş, Mustafa Sari, Alfred E. Hartemink, and Gafur Gözükara
- Subjects
chemistry.chemical_classification ,Topsoil ,Pedogenesis ,Horizon (archaeology) ,chemistry ,Chronosequence ,Biosequence ,Soil Science ,Environmental science ,Soil horizon ,Soil science ,Organic matter ,Vegetation - Abstract
The Burdur Lake is located in the southwest of Turkey, and its area has decreased by 40% from 211 km2 in 1975 to 126 km2 in 2019. In this study, we investigated how the soil has changed in the lacustrine material. Three soil profiles were sampled from the former lakebed (chronosequence profiles: P1, 2007; P2, 1994; and P3, 1975), and three soil profiles under different land use types (biosequence profiles: P4, native forest vegetation; P5, agriculture; and P6, lakebed) were sampled. The chronosequence and biosequence soil profiles represented various distances from the Burdur Lake and showed different stages of lacustrine evolution. Soil electrical conductivity (EC; 18.1 to 0.4 dS m–1), exchangeable Na+ (34.7 to 1.4 cmol kg–1) and K+ (0.61 to 0.56 cmol kg–1), and water-soluble Cl– (70.3 to 2.1 cmol L–1) and SO24– (275.9 to 25.0 cmol L–1) decreased with increasing distance from the Burdur Lake, whereas the A horizon thickness (10 to 48 cm), structure formation (0 to 48 cm), gleization-oxidation depth (0 to 79 cm), and montmorillonite and organic matter (OM; 25.9 to 46.0 g kg–1) contents increased in the chronosequence soil profiles. The formation of P3 in the chronosequence and P5 in the biosequence soil profiles increased due to longer exposure to pedogenic processes (time, land use, vegetation, etc.). Changes in EC, exchangeable cation (Na+ and K+) and water-soluble anion (Cl– and SO24–) concentrations of the salt-enriched horizon, OM, gleization-oxidation depth, A horizon thickness, and structure formation of the chronosequence and biosequence soil profiles (especially the topsoil horizon) were highly related to the distance from the Burdur Lake, time, and land use.
- Published
- 2021
- Full Text
- View/download PDF
5. The role of oak species in long-term soil P loss in a humid river bottomland.
- Author
-
Stinchcomb, Gary E., El Masri, Bassil, and Ferguson, Benedict
- Subjects
- *
SOIL erosion , *FOREST soils , *OAK , *ACID soils , *SOIL testing , *WILDLIFE refuges - Abstract
• Replicated experiment shows whole-soil P loss differs by forest type and drought tolerance. • Greater ET in post oak forests causes positive feedback in P leaching. • The coating of Fe-Mn concretions with smectite may drive greater P loss in soils. There is a gap in our understanding of if and how bottomland forest type will affect long-term nutrient cycling and loss. This study aims to determine how different forests affect soil hydrologic variability and whole-soil P loss in a humid-subtropical setting. We used replicate-sampling and measured soil physical, chemical, and mineralogical properties at 12 sites in two forest ecosystems, post oak (Quercus stellata) and cherry bark oak (Quercus pagoda) in Clarks River National Wildlife Refuge in Western Kentucky. We hypothesize that wetting–drying events in redox soils of forested bottomlands can cause positive feedback in whole-soil P loss. Trees with greater P demand (e.g., post oak) take up more water creating more frequent and pronounced episodes in swelling and shrinking of expandable clays. Repeated swelling and shrinking of clays occlude the surface of Fe-Mn oxides from further adsorption of P in acidic soil. This can lead to greater loss or plant uptake of available P. Our results show (i) a significant difference in mean whole-soil P loss between the oak species with more loss in soils underlying the post oak forest, and (ii) a difference in the total P found in the sap- and heartwood of the two oak species. Soil analysis reveals that the clay mineralogy of the post and cherry bark oak sites are similar, and thus, may only play a minor role in governing the whole-soil P loss difference. However, the leaf data analysis suggests that the post oak site could be P and nitrogen-limited, while the cherry bark is only nitrogen limited. Our study shows that differences in the oak forest ecosystem may affect the long-term balance in water and nutrient uptake and may alter the redistribution of nutrients in the canopy and the underlying soils. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. PEER: A direct method for biosequence pattern mining through waits of optimal k-mers
- Author
-
Balaram Bhattacharyya, Uddalak Mitra, and Tathagato Mukhopadhyay
- Subjects
Information Systems and Management ,Phylogenetic tree ,Computer science ,Direct method ,05 social sciences ,Feature extraction ,Biosequence ,050301 education ,02 engineering and technology ,Computer Science Applications ,Theoretical Computer Science ,Euclidean distance ,Artificial Intelligence ,Control and Systems Engineering ,Phylogenetics ,0202 electrical engineering, electronic engineering, information engineering ,Entropy (information theory) ,020201 artificial intelligence & image processing ,Clade ,0503 education ,Time complexity ,Algorithm ,Software - Abstract
Achieving accuracy of alignment-based methods at linear time complexity is desirable for biosequence studies. k-mer statistics is the principal alternative, but selecting the optimal k is crucial for best feature extraction. Prevalent methods require successive trials upon incrementing k for best match with a reference phylogeny tree. We observe that successive intervals(or waits) of optimal length k -mers contain precise information of the sequence such that feature extraction is possible from entropies of the waits. We introduce a method, Pattern Extraction through Entropy Retrieval(PEER), that transforms a sequence into a vector of wait entropies of optimal k -mers. Distance between a pair of sequences amounts to the Euclidean Distance between their wait vectors. We present an analytical determination of optimal k from maximality of total wait entropy. This makes PEER free from the usual multiple trials for obtaining optimal k. We conduct experiments on several benchmark datasets of omics clades for phylogeny analysis and perform an in-depth comparison against seven state-of-the-art alignment-free methods. Phylogeny tree from PEER distance closely resembles the corresponding biological taxonomy and achieves the best Robinson-Foulds score. PEER can sense small artificial mutations within sequence. It is highly scalable with linear time complexity, exceptionally useful for comparing long sequences.
- Published
- 2020
- Full Text
- View/download PDF
7. Spatial patterns of total and available N and P at alpine treeline.
- Author
-
Liptzin, Daniel, Sanford, Robert, and Seastedt, Timothy
- Subjects
- *
TIMBERLINE , *PLANTS , *SOILS , *FOREST canopies , *ECOSYSTEM dynamics - Abstract
Background and aims: Vegetation can have direct and indirect effects on soil nutrients. To test the effects of trees on soils, we examined the patterns of soil nutrients and nutrient ratios at two spatial scales: at sites spanning the alpine tundra/subalpine forest ecotone (ecotone scale), and beneath and beyond individual tree canopies within the transitional krummholz zone (tree scale). Methods: Soils were collected and analyzed for total carbon (C), nitrogen (N), and phosphorus (P) as well as available N and P on Niwot Ridge in the Colorado Rocky Mountains. Results: Total C, N, and P were higher in the krummholz zone than the forest or tundra. Available P was also greatest in the krummholz zone while available N increased from the forest to the tundra. Throughout the krummholz zone, total soil nutrients and available P were higher downwind compared to upwind of trees. Conclusions: The krummholz zone in general, and downwind of krummholz trees in particular, are zones of nutrient accumulation. This pattern indicates that the indirect effects of trees on soils are more important than the direct effects. The higher N:P ratios in the tundra suggest nutrient dynamics differ from the lower elevation sites. We propose that evaluating soil N and P simultaneously in soils may provide a robust assay of ecosystem nutrient limitation. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
8. Regional and local patterns of soil nutrients at Rocky Mountain treelines
- Author
-
Liptzin, Daniel and Seastedt, Timothy R.
- Subjects
- *
SOILS & nutrition , *TIMBERLINE , *MOUNTAIN plants , *NITROGEN , *EOLIAN processes , *LEAD , *SEDIMENTATION & deposition - Abstract
Abstract: The soils across treeline should vary because of direct effects of biological differences of coniferous subalpine forest and the herbaceous alpine tundra in Colorado. In addition, the change in life form may indirectly affect soils because of interactions of the vegetation and wind-driven deposition processes. This is particularly important as nitrogen (N) saturation is a growing concern in high elevation ecosystems, and treeline is predicted to be a deposition hotspot. The vegetation transition at treeline provides an opportunity to test the effects of vegetation, topography, and external inputs on soils at three spatial scales. First, a regional evaluation of soils at eleven abrupt treeline sites was made comparing sites on east and west aspects both east and west of the Continental Divide (CD). Second, soils were compared in the adjacent forest and tundra. Finally, edge effects were assessed along transects spanning treeline. At the regional scale, total soil N was higher east of the CD and on east aspects while exchangeable calcium was higher on east aspects and at sites west of the CD. Higher lead (Pb) concentration in the forest organic horizon was associated with lower 206Pb/207Pb ratios, an indication of greater anthropogenic Pb inputs. However, the spatial pattern in soil Pb suggests a different source area or transport mechanism than N. Within individual sites, the soils differed between the forest and tundra in almost every measured variable, but edge effects were minimal on both sides of these abrupt treelines. While a direct link between the observed soil patterns to deposition of external inputs cannot be made based on the study design, the observed soil patterns suggest that the impacts of acid deposition are amplified or attenuated by processes such as dust deposition. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
9. Biosequence
- Author
-
Chesworth, Ward, editor
- Published
- 2008
- Full Text
- View/download PDF
10. Intellectual property management of biosequence information from a patent searching perspective
- Author
-
Yoo, Heahyun, Ramanathan, Chandra, and Barcelon-Yang, Cynthia
- Subjects
- *
INTELLECTUAL property , *BIOTECHNOLOGY , *PATENT searching , *CLASSIFICATION of patents - Abstract
Abstract: With recent advances in biotech research tools, the use of biosequence information has greatly facilitated the R&D process in the pharmaceutical and other life sciences industries. Concurrently, it has presented substantial challenges in the management of biosequence related intellectual property, due to the dramatic increase in the amount of sequence data, the nature of claims on sequences as well as limitations in available database resources for sequence analysis and searches. This paper discusses some of these challenges in-depth, and suggests ways to alleviate difficulties associated with conducting sequence-based patent information searches. [Copyright &y& Elsevier]
- Published
- 2005
- Full Text
- View/download PDF
11. Evolution of biosequence search algorithms: a brief survey
- Author
-
Gregory Kucherov, Laboratoire d'Informatique Gaspard-Monge (LIGM), Centre National de la Recherche Scientifique (CNRS)-Fédération de Recherche Bézout-ESIEE Paris-École des Ponts ParisTech (ENPC)-Université Paris-Est Marne-la-Vallée (UPEM), Skolkovo Institute of Science and Technology [Moscow] (Skoltech), and Université Paris-Est Marne-la-Vallée (UPEM)-École des Ponts ParisTech (ENPC)-ESIEE Paris-Fédération de Recherche Bézout-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Computer science ,0206 medical engineering ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,02 engineering and technology ,Biochemistry ,Population genomics ,03 medical and health sciences ,Search algorithm ,Surveys and Questionnaires ,Quantitative Biology - Genomics ,Molecular Biology ,ComputingMilieux_MISCELLANEOUS ,030304 developmental biology ,Genomics (q-bio.GN) ,0303 health sciences ,Biosequence ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Data science ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Metagenomics ,FOS: Biological sciences ,Key (cryptography) ,Sequence Analysis ,020602 bioinformatics ,Algorithms - Abstract
The paper surveys the evolution of main algorithmic techniques to compare and search biological sequences. We highlight key algorithmic ideas emerged in response to several interconnected factors: shifts of biological analytical paradigm, advent of new sequencing technologies, and a substantial increase in size of the available data. We discuss the expansion of alignment-free techniques coming to replace alignment-based algorithms in large-scale analyses. We further emphasize recently emerged and growing applications of sketching methods which support comparison of massive datasets, such as metagenomics samples. Finally, we focus on the transition to population genomics and outline associated algorithmic challenges., 11 pages, 71 references
- Published
- 2019
- Full Text
- View/download PDF
12. Defeating Fake Food Labels Using Watermarking and Biosequence Analysis
- Author
-
Manoranjan Mohanty and Vijay Naidu
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,digestive, oral, and skin physiology ,Biosequence ,020207 software engineering ,Pattern recognition ,Watermark ,02 engineering and technology ,ComputingMethodologies_PATTERNRECOGNITION ,0202 electrical engineering, electronic engineering, information engineering ,Fake food ,020201 artificial intelligence & image processing ,Quality (business) ,Artificial intelligence ,Food label ,Food quality ,business ,Digital watermarking ,True positive rate ,media_common - Abstract
Fake food label is one of the leading ways to distribute a low quality food item as a high quality branded product. For example, under fake labels, significantly higher amount of fake Manuka honey is sold than what is actually being produced. In this paper, we propose a scheme to combat the spread of such low quality food items by identifying fake food labels. In our scheme, a watermarking is inserted to a genuine food label and biosequence analysis is used to detect this watermark. The proposed biosequence analysis is such that it can detect duplicate labels, for example a photocopy of the genuine label. The proposed method works by converting a label image into biological amino acid form (e.g., to A, C, D, G, H, etc. form) and then extracting a signature from the label (which is represented in amino acid form) using biological tools. These signatures are then matched against a query label image to find out its originality. Experiment with honey food labels (honey watermarked dataset created by us) shows that the proposed method has true positive rate of 91:67%.
- Published
- 2019
- Full Text
- View/download PDF
13. Wildfire Effects on Soils of a 55-Year-Old Chaparral and Pine Biosequence
- Author
-
Jodi L. Johnson-Maynard, Paul D. Sternberg, Robert C. Graham, Peter J. Shouse, Joan M. Breiner, Louise M. Egerton-Warburton, Paul F. Hendrix, Jack A. Jobes, and Sylvie A. Quideau
- Subjects
0106 biological sciences ,geography ,geography.geographical_feature_category ,Agroforestry ,Biosequence ,Soil Science ,Forestry ,04 agricultural and veterinary sciences ,Chaparral ,010603 evolutionary biology ,01 natural sciences ,Soil water ,040103 agronomy & agriculture ,0401 agriculture, forestry, and fisheries ,Environmental science - Published
- 2016
- Full Text
- View/download PDF
14. Analyzing the protection scope limit for biosequence patent claims in China
- Author
-
Wei Li
- Subjects
0301 basic medicine ,China ,Economic Competition ,Scope (project management) ,Bacteria ,Drug Industry ,Biomedical Engineering ,Biosequence ,Bioengineering ,Legislation, Drug ,Applied Microbiology and Biotechnology ,Europe ,Patents as Topic ,03 medical and health sciences ,030104 developmental biology ,Pharmaceutical Preparations ,Mutation ,Molecular Medicine ,Business ,Limit (mathematics) ,Patent claim ,Glucan 1,4-alpha-Glucosidase ,Law and economics ,Biotechnology - Abstract
Though a patent's protection scope should be based on the content of the claims and written description, determining the protection scope of a biosequence patent has always been a controversial issue in practice.
- Published
- 2018
15. Negative Factor
- Author
-
Baihua Zheng, Xiaochun Yang, Yaoshu Wang, Bin Wang, Chen Li, and Tao Qiu
- Subjects
Matching (graph theory) ,business.industry ,Computer science ,String (computer science) ,Biosequence ,02 engineering and technology ,Machine learning ,computer.software_genre ,Substring ,Automaton ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Regular expression ,Pruning (decision trees) ,Performance improvement ,business ,computer ,Information Systems - Abstract
The problem of finding matches of a regular expression (RE) on a string exists in many applications, such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this article, we propose a novel technique that prunes false negatives by utilizing negative factors , which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We present a detailed description of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real datasets, including DNA sequences, proteins, and text documents, and show significant performance improvement of the state-of-the-art tools by an order of magnitude.
- Published
- 2016
- Full Text
- View/download PDF
16. Nanoarrays for Systolic Biosequence Analysis
- Author
-
Malik Ashter Mehdy, Aleandro Antidormi, Gianluca Piccinini, and Mariagrazia Graziano
- Subjects
010302 applied physics ,Computer science ,Biosequence ,Systolic array ,02 engineering and technology ,General Medicine ,Parallel computing ,Dissipation ,021001 nanoscience & nanotechnology ,01 natural sciences ,Hardware and Architecture ,0103 physical sciences ,Overhead (computing) ,Electrical and Electronic Engineering ,0210 nano-technology - Abstract
Applications like biosequence alignment are currently addressed using traditional technology at the price of a huge overhead in terms of area and power dissipation. Nanoarrays are expected to outperform current limits especially in terms of processing capabilities. The purpose of this work is to assess the real terms of these expectations. Our contribution deals with: (i) a new model for nanowire FETs used to evaluate transistor’s essential performance; (ii) a new switch-level simulator for nanoarray structure used to evaluate its switching activity; (iii) a nanoarray implementation for biosequence alignment based on a systolic array and the modeling of its essential performance based on (i) and (ii); (iv) the evaluation of the potential improvement of the nanoarray-based systolic structure with respect to an equivalent CMOS one in terms of processing capabilities, area, and power dissipation. Depending on the possible technological scenario, the performance of nanoarray is impressive, especially considering the density achievable in terms of processing per unit area. A wide solution space can be explored to find the optimal solution in terms of trading power and performance considering the technological limitations of a realistic implementation.
- Published
- 2018
17. Finding Homologs in Amino Acid Sequences Using Network BLAST Searches
- Author
-
Istvan Ladunga
- Subjects
0301 basic medicine ,Computer science ,Sequence analysis ,Molecular Sequence Data ,Inference ,Information Storage and Retrieval ,Computational biology ,computer.software_genre ,Biochemistry ,Domain (software engineering) ,03 medical and health sciences ,Structural Biology ,Protein methods ,Sequence Analysis, Protein ,Databases, Genetic ,Statistical inference ,Database search engine ,Protein function prediction ,Amino Acid Sequence ,Databases, Protein ,Smith–Waterman algorithm ,Internet ,030102 biochemistry & molecular biology ,Sequence Homology, Amino Acid ,Basic Local Alignment Search Tool ,Biosequence ,Computational Biology ,Proteins ,Divergent evolution ,030104 developmental biology ,ComputingMethodologies_PATTERNRECOGNITION ,Sequence homology ,Database Management Systems ,Data mining ,computer ,Sequence Alignment ,Function (biology) ,Software - Abstract
The Basic Local Alignment Search Tool (BLAST) is the most fundamental (and most misused) algorithm and software in bioinformatics/computational biology for functional assessment of unknown proteins or discovery of similar proteins with potentially common evolutionary origins. We show how to balance sensitivity with selectivity (without generating massive output) by selecting and demonstrating proper database, algorithm, and alignment display options of the user-friendly Web sites of the National Center for Biotechnology Information (NCBI). We discuss protein query searches against protein databases and submission of all combinations of translated searches. Careful biological and statistical inferences are drawn to possible functions, taking into account the highly nonrandom nature of proteins. Guidelines for such inferences, using real-life biological examples (e.g., protein kinases with widely distributed structural and functional domains), are provided. We show how to avoid incorrect functional inference from misleading similarities, using the divergent evolution of a serine protease domain that erodes the protease function in haptoglobins. Curr. Protoc. Bioinform. 25:3.4.1-3.4.34. © 2009 by John Wiley & Sons, Inc. Keywords: BLAST; bioinformatics; computational biology; database search; functional assessment; statistical inference; local alignment; translated database search
- Published
- 2017
18. Data Structures for Parsimony Correlation and Biosequence Co-Evolution
- Author
-
Robert Hochberg and Treena Larrew Milam
- Subjects
Computer science ,computer.software_genre ,Evolution, Molecular ,Correlation ,Software ,Genetics ,Preprocessor ,Computer Simulation ,Base sequence ,Molecular Biology ,Interactive computation ,Phylogeny ,Base Sequence ,Models, Genetic ,business.industry ,Biosequence ,Pattern recognition ,Sequence Analysis, DNA ,Data structure ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Modeling and Simulation ,Graph (abstract data type) ,Artificial intelligence ,Data mining ,business ,computer ,Algorithms - Abstract
We give an algorithm for discovering co-evolution in biosequences from a dataset consisting of aligned data and a phylogeny. The method correlates vectors of parsimony scores on the edges of a graph, averaged over all optimally parsimonious reconstructions of the data. We describe an efficient data structure, and a preprocessing step that allows for rapid, interactive computation of many correlation scores, at the expense of storage space.
- Published
- 2014
- Full Text
- View/download PDF
19. Characteristics of melanic epipedon based on biosequence in the physiography of Marapi - Singgalang, West Sumatra
- Author
-
Dedi Hermon, Olivia Oktorie, Aprizon Putra, and Ganefri
- Subjects
Hydrology ,Watershed ,Buffer zone ,Watershed area ,Biosequence ,Geology ,Stratified sampling - Abstract
This research aim to express descriptively melanic epipedon based on biosequence middle bevel physiography of Marapi-Singgalang mount representing Upper watershed area from Batang Anai watershed. The sample determined by Stratified Random Sampling at each biosequence. From result of research obtained by difference of melanic epipedon of Marapi mount characteristic with melanic epipedon of Singgalang mount characteristic, so that for the farm in the middle bevel of Singgalang need the existence of land conservation action, passing reboisasi program, utilize to improve again land that function as buffer zone in Batang Anai watershed.
- Published
- 2019
- Full Text
- View/download PDF
20. String analysis by sliding positioning strategy
- Author
-
Rafael Morales-Bueno, José M. Carmona-Cejudo, and Manuel Baena-García
- Subjects
Sequence ,Theoretical computer science ,Computer Networks and Communications ,Applied Mathematics ,Process (computing) ,Biosequence ,Data structure ,Theoretical Computer Science ,Matrix (mathematics) ,Computational Theory and Mathematics ,Trie ,Heuristics ,Algorithm ,Computer Science::Databases ,Natural language ,Mathematics - Abstract
Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. We present an Apriori-like solution which exploits the fact that any super-pattern of a non-frequent pattern cannot be frequent. The SANSPOS algorithm does a multiple-pass, candidate generation and test approach. Multiple length patterns can be generated in a pass. This algorithm uses a new data structure to arrange nodes in a trie. A Positioning Matrix is defined as a new positioning strategy. By using Positioning Matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The Positioning Matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. Furthermore, in our algorithm we apply parallelism to transverse different sections of the input strings concurrently, speeding up the resulting running time. The algorithm has been successfully used in natural language and biological sequence contexts.
- Published
- 2014
- Full Text
- View/download PDF
21. Comparing Biosequence Similarity Using the Hamiltonian Lattice Model and the Prospect for Two-Dimensional Codes
- Author
-
Chi-Ching Yang, Chung-Jen Ou, Chien-Han Lin, and Chung-Ming Ou
- Subjects
Discrete mathematics ,Health (social science) ,General Computer Science ,General Mathematics ,General Engineering ,Biosequence ,Education ,Algebra ,symbols.namesake ,General Energy ,symbols ,Hamiltonian (quantum mechanics) ,General Environmental Science ,Mathematics - Published
- 2013
- Full Text
- View/download PDF
22. Increasing Efficiency of Computation Time For Hit Detection In BLASTN
- Author
-
Yuva Bharathi. R, Jebaraj Jegan. T, and Punitha. P
- Subjects
Sequence ,ComputingMethodologies_PATTERNRECOGNITION ,Sequence database ,Chromosome (genetic algorithm) ,Computer science ,Computation ,Biosequence ,General Earth and Planetary Sciences ,One-to-many ,Base (topology) ,Field-programmable gate array ,Algorithm ,General Environmental Science - Abstract
For biologists very hard time is with analyzing the uniqueness between two sample sequences such as DNA, RNA and protein sequences. A Biosequence represents a single, continuous molecule of nucleic acid or protein. It can be anything from a band on a gel to a complete chromosome. That's to design for a huge database system which finds similarities between two sequences that have biological significance. In such condition we have to compromise in computation time, this can be overcome through implementation BLASTN process. In this paper the BLAST process will be working more efficient by a new approach for biological sequence database scanning. The scanning is performed with reconfigurable FPGA base hardware by comparing sequence one to many sequences from the database. The experimental sequence matching reduces the computation time of BLAST. [1] [2]
- Published
- 2013
- Full Text
- View/download PDF
23. A Biosequence-Based Approach to Software Characterization
- Author
-
Darren S. Curtis, Christopher S. Oehmen, Elena S. Peterson, and Aaron R. Phillips
- Subjects
Cloning ,020203 distributed computing ,Similarity (geometry) ,Sequence analysis ,Computer science ,business.industry ,Process (computing) ,Biosequence ,020207 software engineering ,02 engineering and technology ,computer.file_format ,computer.software_genre ,Identification (information) ,chemistry.chemical_compound ,Software ,chemistry ,Software construction ,0202 electrical engineering, electronic engineering, information engineering ,Executable ,Data mining ,business ,Software analysis pattern ,computer ,DNA - Abstract
For many applications, it is desirable to have a process for recognizing when software binaries are closely related without relying on them to be identical or have identical segments. But doing so in a dynamic environment is a nontrivial task because most approaches to software similarity require extensive and time-consuming analysis of a binary, or they fail to recognize executables that are similar but not identical. Presented herein is a novel biosequence-based method for quantifying similarity of executable binaries. Using this method, we show in an example application on large-scale multi-author codes that 1) the biosequence-based method has a statistical performance in recognizing and distinguishing between a collection of real-world high performance computing applications better than 90% of ideal, and 2) an example of using family-tree analysis to tune identification for a code subfamily can achieve better than 99% of ideal performance.
- Published
- 2016
- Full Text
- View/download PDF
24. Prediction of Protein-Protein Interaction via co-occurring Aligned Pattern Clusters
- Author
-
Antonio Sze-To, Sanderz Fung, Andrew K. C. Wong, and En-Shiun Annie Lee
- Subjects
0301 basic medicine ,Computer science ,business.industry ,Feature vector ,Supervised learning ,Biosequence ,Computational Biology ,Pattern recognition ,Bioinformatics ,General Biochemistry, Genetics and Molecular Biology ,Conserved sequence ,Support vector machine ,03 medical and health sciences ,030104 developmental biology ,String kernel ,Sequence Analysis, Protein ,Protein Interaction Mapping ,Feature (machine learning) ,Artificial intelligence ,Amino Acid Sequence ,Protein Interaction Maps ,business ,Molecular Biology ,Algorithms ,Sequence (medicine) - Abstract
Predicting Protein–Protein Interaction (PPI) is important for making new discoveries in the molecular mechanisms inside a cell. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Sequence-based prediction is currently the most readily applicable and cost-effective method. It exploits known PPI Databases to construct classifiers for predicting unknown PPIs based only on sequence data without requiring any other prior knowledge. Among existing sequence-based methods, most feature-based methods use exact sequence patterns with fixed length as features — a constraint which is biologically unrealistic. SVM with Pairwise String Kernel renders better predicting performance. However it is difficult to be biologically interpretable since it is kernel-based where no concrete feature values are computed. Here we have developed a novel method WeMine-P2P to overcome these drawbacks. By assuming that the regions/sites that mediate PPI are more conserved, WeMine-P2P first discovers/locates the conserved sequence patterns in protein sequences in the form of Aligned Pattern Clusters (APCs), allowing pattern variations with variable length. It then pairs up all APCs into a set of Co-Occurring APC (cAPC) pairs, and computes a cAPC-PPI score for each cAPC pair on all PPI pairs. It further constructs a feature vector composed of all cAPC pairs with their cAPC-PPI scores for each PPI pair and uses them for constructing a PPI predictor. Through 40 independent experiments, we showed that (1) WeMine-P2P outperforms the well-known algorithm, PIPE2, which also utilizes co-occurring amino acid sequence segments but does not allow variable lengths and pattern variations; (2) WeMine-P2P achieves satisfactory PPI prediction performance, comparable to the SVM-based methods particularly among unseen protein sequences with a potential reduction of feature dimension of 1280×; (3) Unlike SVM-based methods, WeMine-P2P renders interpretable biological features from which we observed that co-occurring sequence patterns from the compositional bias regions are more discriminative. WeMine-P2P is extendable to predict other biosequence interactions such as Protein–DNA interactions.
- Published
- 2016
25. Biosequence Similarity Search on the Mercury System
- Author
-
Praveen Krishnamurthy, Kwame Gyang, Arpith C. Jacob, Roger D. Chamberlain, Joseph M. Lancaster, Jeremy Buhler, and Mark A. Franklin
- Subjects
Computer science ,Nearest neighbor search ,Biosequence ,computer.software_genre ,Article ,DNA sequencing ,Search algorithm ,Signal Processing ,Data mining ,Electrical and Electronic Engineering ,Mercury (programming language) ,computer ,Information Systems ,computer.programming_language - Abstract
Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described.
- Published
- 2007
- Full Text
- View/download PDF
26. Predicting Protein-protein interaction using co-occurring Aligned Pattern Clusters
- Author
-
Sanderz Fung, Antonio Sze-To, Andrew K. C. Wong, and En-Shiun Annie Lee
- Subjects
Support vector machine ,Feature Dimension ,Co occurring ,business.industry ,String kernel ,Supervised learning ,Biosequence ,Pattern recognition ,Artificial intelligence ,Biology ,business ,Protein–protein interaction ,Random forest - Abstract
Understanding Protein-protein interaction (PPI) is of fundamental importance in deciphering cellular processes. Predicting PPIs is thus critical in making new discoveries in the biological domains. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Computational docking is an alternative but requires the three-dimensional structures of the target proteins which are not always accessible. Sequence-based prediction is the most readily applicable and cost-effective method. It exploits known PPI Databases to construct classifiers for predicting unknown PPIs based only on sequence data. However, existing methods, adopting features that fix the pattern length and use exact patterns, are biologically unrealistic. Also, those based on SVM and String Kernel are hardly biologically interpretable since they do not compute the features. Recently, we have developed a new method for predicting PPI known as WeMine-P2P based on our WeMine Aligned Pattern Clustering algorithm which discovers and identifies the localized and co-occurring conserved patterns and regions allowing variable length and pattern variations. As our first attempt, under 40 independent experiments, we showed that (1) WeMine-P2P outperforms the well-known algorithm, PIPE2 which also utilizes co-occurring amino acid sequence segments but does not allow variable lengths and pattern variations; (2) Unlike SVM-based methods, WeMine-P2P renders interpretable biological features; (3) WeMine-P2P achieves satisfactory PPI prediction performance, comparable to the SVM-based methods particularly in unseen protein sequences, with a potential reduction of feature dimension of 1280x. WeMine-P2P is extendable to other biosequence interactions such as predicting Protein-DNA interactions.
- Published
- 2015
- Full Text
- View/download PDF
27. Efficient q-Gram Filters for Finding All ε-Matches over a Given Length
- Author
-
Kim R. Rasmussen, Jens Stoye, and Eugene W. Myers
- Subjects
Theoretical computer science ,sequence assembly ,Heuristic (computer science) ,Sequence assembly ,Word error rate ,Sequence alignment ,Biology ,Sensitivity and Specificity ,Sequence comparison ,Genetics ,EST ,Molecular Biology ,Alignment-free sequence analysis ,Mathematics ,Gram ,Smith–Waterman algorithm ,Insertion sort ,filter ,Genome ,Heuristic ,Biosequence ,Computational Biology ,q-grams ,Quantitative Biology::Genomics ,Computational Mathematics ,local alignment searching ,Computational Theory and Mathematics ,k-mer ,Filter (video) ,Modeling and Simulation ,Databases, Nucleic Acid ,Sequence Alignment ,Algorithm ,Algorithms ,clustering - Abstract
Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.
- Published
- 2006
- Full Text
- View/download PDF
28. Beyond Mfold: Recent advances in RNA bioinformatics
- Author
-
Robert Giegerich, Marc Rehmsmeier, Jens Reeder, Björn Voss, and Matthias Höchsmann
- Subjects
Models, Molecular ,RNA pseudoknots ,Structural alignment ,Molecular Sequence Data ,Bioengineering ,miRNA target prediction ,Biology ,Bioinformatics ,RNA structure comparison ,Applied Microbiology and Biotechnology ,Mirna target ,Article ,Nucleic acid secondary structure ,RNA interference ,Sequence Homology, Nucleic Acid ,Animals ,Humans ,Computational analysis ,structure comparison ,Internet ,Base Sequence ,Biosequence ,RNA ,Computational Biology ,General Medicine ,RNA secondary structure ,Nucleic Acid Conformation ,Pseudoknot ,Software ,Biotechnology ,Consensus structure prediction - Abstract
Computational analysis of RNA secondary structure is a classical field of biosequence analysis, which has recently gained momentum due to the manyfold regulatory functions of RNA that have become apparent. We present five recent computational approaches that address the problems of synoptic folding space analysis, pseudoknot prediction, structure alignment, comparative structure prediction, and miRNA target prediction. All these programs are in current use and are available via the Bielefeld Bioinformatics Server at http://bibiserv.techfak.uni-bielefeld.de. (c) 2006 Elsevier B.V. All rights reserved.
- Published
- 2006
29. VECTOR SPACE INDEXING FOR BIOSEQUENCE SIMILARITY SEARCHES
- Author
-
Hakan Ferhatosmanoglu and Ozgur Ozturk
- Subjects
business.industry ,Computer science ,Search engine indexing ,Biosequence ,Pattern recognition ,Domain (mathematical analysis) ,Wavelet ,Index (publishing) ,Similarity (network science) ,Artificial Intelligence ,Artificial intelligence ,Pruning (decision trees) ,business ,Vector space - Abstract
We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries and both (b) pruning ability and (c) approximation quality for ε-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.
- Published
- 2005
- Full Text
- View/download PDF
30. THE DESIGN PRINCIPLES AND ALGORITHMS OF A WEIGHTED GRAMMAR LIBRARY
- Author
-
Cyril Allauzen, Mehryar Mohri, and Brian Roark
- Subjects
Theoretical computer science ,Grammar ,Programming language ,Computer science ,media_common.quotation_subject ,Biosequence ,Design elements and principles ,computer.software_genre ,Variety (linguistics) ,Automaton ,Rule-based machine translation ,Computer Science (miscellaneous) ,Software design ,Representation (mathematics) ,Algorithm ,computer ,media_common - Abstract
We present the software design principles, algorithms, and utilities of a general weighted grammar library, the GRM Library, that can be used in a variety of applications in text, speech, and biosequence processing. Several of the algorithms and utilities of this library are described, including in some cases their pseudocodes and pointers to their use in applications. The algorithms and the utilities were designed to support a wide variety of semirings and the representation and use of large grammars and automata of several hundred million rules or transitions.
- Published
- 2005
- Full Text
- View/download PDF
31. Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis
- Author
-
David Haussler and Adam Siepel
- Subjects
Computer science ,Gene prediction ,Inference ,Biology ,Machine learning ,computer.software_genre ,Markov model ,Sequence Analysis, Protein ,Genetics ,Hidden Markov model ,Molecular Biology ,Phylogeny ,Likelihood Functions ,Phylogenetic tree ,business.industry ,Maximum-entropy Markov model ,Variable-order Markov model ,Autocorrelation ,Substitution (logic) ,Biosequence ,Computational Biology ,Genomics ,Sequence Analysis, DNA ,Markov Chains ,Variable-order Bayesian network ,Computational Mathematics ,Computational Theory and Mathematics ,Modeling and Simulation ,Artificial intelligence ,business ,computer - Abstract
A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction---for gene finding, for example, or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higher-order states in the HMM, which allows for context-sensitive models of substitution---that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higher-order states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higher-order states being particularly pronounced.
- Published
- 2004
- Full Text
- View/download PDF
32. Biosequence Time–Frequency Processing: Pathogen Detection and Identification
- Author
-
Antonia Papandreou-Suppappola, Brian O'Donnell, and A. Maurer
- Subjects
Identification (information) ,Signal processing ,Diagnostic information ,Pathogen detection ,business.industry ,Computer science ,Biosequence ,Gene chip analysis ,Pattern recognition ,Artificial intelligence ,business ,Random sequence ,Time–frequency analysis - Abstract
Diagnostic information obtained from antibodies binding to random peptide sequences is now feasible using immunosignaturing, a recently developed microarray technology. The success of this technology is highly dependent upon the use of advanced algorithms to analyze the random sequence peptide arrays and to process variations in antibody profiles to discriminate between pathogens. This work presents the use of time–frequency signal processing methods for immunosignaturing. In particular, highly-localized waveforms and their parameters are used to uniquely map random peptide sequences and their properties in the time–frequency plane. Advanced time–frequency signal processing techniques are then applied for estimating antigenic determinants or epitope candidates for detecting and identifying potential pathogens.
- Published
- 2015
- Full Text
- View/download PDF
33. Blurring the line between bioinformatics and patent analysis
- Author
-
Barbara Hall Miller, Robert Poolman, and Seth E. Mendelson
- Subjects
Focus (computing) ,Biological patent ,Renewable Energy, Sustainability and the Environment ,Computer science ,Process Chemistry and Technology ,Biosequence ,Energy Engineering and Power Technology ,ComputingMilieux_LEGALASPECTSOFCOMPUTING ,Bioengineering ,Library and Information Sciences ,Bioinformatics ,Data science ,Computer Science Applications ,Disk formatting ,Patent analysis ,Fuel Technology ,Code (cryptography) ,Line (text file) ,Macro - Abstract
Biological patent analysts are often faced with querying multiple databases of sequence and text. After retrieving the query results, the results must be analyzed by reviewing patents and applications in a form of manual data reduction. Once families are identified for final analysis, the analyst must expand back out, looking at the members of those patent families. Again, sequence and text is reviewed manually, not algorithmically. One way for patent analysts to reduce the number of repetitive tasks performed is through the creation of macros. Such macros can be used for time-consuming tasks like global formatting, searching and replacing within a document, formatting extracted sequences into FASTA, or even converting three-letter amino acid code into single-letter code. Another way for a patent analyst to reduce repetitive tasks in biosequence patent analysis is through an alliance between biological patent analysts and bioinformaticians. Such an alliance could result in the development of tools that focus on these types of repetitive tasks. A bioinformatician is skilled in looking for solutions to various repetitive tasks. Such a solution could even be packaged and deployed to those colleagues who would benefit from access to such a time-saving program. Finding a way to automate repetitive tasks will free the patent analyst to spend more time on intellectual analysis of the results.
- Published
- 2011
- Full Text
- View/download PDF
34. An integrated approach for genome-wide gene expression analysis
- Author
-
Yuh-Jyh Hu
- Subjects
Computer science ,Health Informatics ,Saccharomyces cerevisiae ,Computational biology ,Regulon ,Genome ,Gene expression ,Gene ,Regulation of gene expression ,Stochastic Processes ,Messenger RNA ,Models, Genetic ,business.industry ,Gene Expression Profiling ,Decision Trees ,Biosequence ,Nucleic acid sequence ,Computational Biology ,Sequence Analysis, DNA ,Computer Science Applications ,Gene expression profiling ,Artificial intelligence ,Genome, Fungal ,business ,Algorithms ,Software - Abstract
Since efficient and relatively cheap methods were developed for determining biosequences, a lot of biosequence data has been generated. As the main problem in molecular biology is the analysis of the data instead of the data acquisition, part of the study of computational biology is to extract all kinds of meaningful information from the sequences. Computer-assisted methods have become very important in analyzing biosequence data. However, most of the current computer-assisted methods are limited to finding motifs. Genes can be regulated in many ways, including combinations of regulatory elements. This research is aimed at developing a new integrated system for genome-wide gene expression analysis. This research begins with a new motif-finding method, using a new objective function combining multiple well defined components and an improved stochastic iterative sampling strategy. Combinatorial motif analysis is accomplished by constructive induction that analyzes potential motif combinations. We then apply standard inductive learning algorithms to generate hypotheses for different gene behaviors. A genome-wide gene expression analysis demonstrated the value of this novel integrated system.
- Published
- 2001
- Full Text
- View/download PDF
35. Geometric Approach to Biosequence Analysis
- Author
-
Valentin E. Brimkov and Boris Brimkov
- Subjects
Simple (abstract algebra) ,Significant difference ,String (computer science) ,Biosequence ,Linearity ,Algorithm ,Mathematics - Abstract
Tools that effectively analyze and compare sequences are of great importance in various areas of applied computational research, especially in the framework of molecular biology. In the present paper, we introduce simple geometric criteria based on the notion of string linearity and use them to compare DNA sequences of various organisms, as well as to distinguish them from random sequences. Our experiments reveal a significant difference between biosequences and random sequences the former having much higher deviation from linearity than the latter as well as a general trend of increasing deviation from linearity between primitive and biologically complex organisms. The proposed approach is potentially applicable to the construction of dendograms representing the evolutionary relationships among species.
- Published
- 2014
- Full Text
- View/download PDF
36. Biosequence Analysis Using Intel® Xeon Phi
- Author
-
Pradeep K. Sinha, Kalyani Shewale, Abhishek Das, Deepu Vikranman, Goldi Misra, Shraddha Desai, and Sucheta Pawar
- Subjects
Biological data ,Sequence ,Coprocessor ,Computer science ,Sequence analysis ,Message passing ,Biosequence ,Parallel computing ,Supercomputer ,Xeon Phi - Abstract
Due to the ever-increasing size of sequence databases, it has become clear that faster techniques must be employed to effectively perform biological sequence analysis in a reasonable amount of time. In bioinformatics, protein sequence alignment is one of the fundamental tasks. MPI-HMMER is one of the applications that perform this kind of bio-sequence analysis. Since the growth of biological data is exponential, there is an ever-increasing demand for computational power. This paper discusses the behavior of MPI-HMMER on hybrid architecture by using native compilation technique on Intel Xeon Phi.
- Published
- 2013
- Full Text
- View/download PDF
37. BioSCAN: a network sharable computational resource for searching biosequence databases
- Author
-
Doug Hoffman, Raj K. Singh, Stephen G. Tell, and C.T. White
- Subjects
Statistics and Probability ,Databases, Factual ,Computer science ,Sequence alignment ,Client ,Computational resource ,computer.software_genre ,Biochemistry ,World Wide Web ,Computer Communication Networks ,Resource (project management) ,File server ,Software ,Nucleic Acids ,Amino Acid Sequence ,Molecular Biology ,Gene ,Peptide sequence ,Base Sequence ,Database ,business.industry ,Biosequence ,Nucleic acid sequence ,Proteins ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,GenBank ,The Internet ,business ,Sequence Alignment ,computer ,Algorithms - Abstract
We describe a network sharable, interactive computational tool for rapid and sensitive search and analysis of biomolecular sequence databases such as GenBank, GenPept, Protein Identification Resource, and SWISS-PROT. The resource is accessible via the World Wide Web using popular client software such as Mosaic and Netscape. The client software is freely available on a number of computing platforms including Macintosh, IBM-PC, and Unix workstations.
- Published
- 1996
- Full Text
- View/download PDF
38. Trainer: A General-Purpose Trainable Short Biosequence Classifer
- Author
-
Hasan Ogul, Sinan Uğur Umu, Alper T. Kalkan, and Mahinur S. Akkaya
- Subjects
Sequence ,Computer science ,business.industry ,Trainer ,Biosequence ,Retraining ,Proteins ,General Medicine ,Bioinformatics ,Machine learning ,computer.software_genre ,Biochemistry ,Structural Biology ,Artificial Intelligence ,Sequence Analysis, Protein ,Feature (machine learning) ,Relevance (information retrieval) ,Artificial intelligence ,business ,Representation (mathematics) ,Databases, Protein ,computer ,Software ,Molecular entity - Abstract
Classifying sequences is one of the central problems in computational biosciences. Several tools have been released to map an unknown molecular entity to one of the known classes using solely its sequence data. However, all of the existing tools are problem-specific and restricted to an alphabet constrained by relevant biological structure. Here, we introduce TRAINER, a new online tool designed to serve as a generic sequence classification platform to enable users provide their own training data with any alphabet therein defined. TRAINER allows users to select among several feature representation schemes and supervised machine learning methods with relevant parameters. Trained models can be saved for future use without retraining by other users. Two case studies are reported for effective use of the system for DNA and protein sequences; candidate effector prediction and nucleolar localization signal prediction. Biological relevance of the results is discussed.
- Published
- 2013
39. An Hardware Viewpoint on Biosequence Analysis: What's Next?
- Author
-
Mariagrazia Graziano, Maurizio Zamboni, and Stefano Frache
- Subjects
Nanofabrics ,Computer science ,business.industry ,Emerging technologies ,Biosequence ,Disruptive technology ,Beyond CMOS ,Hardware and Architecture ,silicon nanoarray ,Electrical and Electronic Engineering ,Silicon nanowires ,business ,Software ,Computer hardware - Abstract
Biosequence alignment recently received an increasing support from both commodity and dedicated hardware platforms. Processing capabilities are constantly rising, but still not satisfying the limitless requirements of this application. We give an insight on the contribution to this need that can possibly be expected from emerging technology devices and architectures, focusing as an example on nanofabrics based on silicon nanowires. By varying a few parameters we explore the solution space, and demonstrate with proper figures of merit how this family of beyond CMOS structures could be considered as the effective disruptive technology for biosequence analysis applications.
- Published
- 2013
40. The Effects of Different Representations on Malware Motif Identification
- Author
-
Shaoning Pang, Ban Tao, Yi Chen, and Ajit Narayanan
- Subjects
chemistry.chemical_classification ,Multiple sequence alignment ,Computer science ,business.industry ,Biosequence ,Sequence alignment ,computer.software_genre ,Machine learning ,Amino acid ,chemistry.chemical_compound ,chemistry ,Malware ,Artificial intelligence ,Data mining ,business ,computer ,Gene ,Alignment-free sequence analysis ,DNA - Abstract
Sequence alignment is widely used in bioinformatics for revealing the genetic diversity of organisms and annotating gene functions by finding regions of similarity across biosequences. Such alignment requires sequences to be represented in the DNA or protein alphabet for tools such as Clustal to work. Previous work has demonstrated the feasibility of applying biosequence multiple alignment techniques to computer viral and worm signatures to find regions of similarity that can serve as malware i®motifs', or meta-signatures. However, it was not known how different ways of representing signatures in an appropriate biosequence alphabet would affect the alignment results. This paper investigates the effects of adopting three different ways of representing malware signatures on sequence alignment and motif identification. The results of the alignment were checked with perceptrons, decision tree and logistic regression. The best performing representation was used to derive rules in PRISM that give rise to i®motifs' that can perform the role of i®meta-signatures'. All analysis was undertaken on the publicly available data mining tool, Weka (Waikato Environment for Knowledge Analysis: http://www.cs.waikato.ac.nz/ml/weka/).
- Published
- 2012
- Full Text
- View/download PDF
41. ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases
- Author
-
Xiaochun Yang, Honglei Liu, and Bin Wang
- Subjects
Compressed suffix array ,Smith–Waterman algorithm ,FOS: Computer and information sciences ,Speedup ,Theoretical computer science ,Database ,Computer science ,Heuristic (computer science) ,General Engineering ,Biosequence ,Databases (cs.DB) ,computer.software_genre ,DNA sequencing ,Computer Science - Databases ,Affine transformation ,computer ,Algorithm - Abstract
We study the problem of local alignment, which is finding pairs of similar subsequences with gaps. The problem exists in biosequence databases. BLAST is a typical software for finding local alignment based on heuristic, but could miss results. Using the Smith-Waterman algorithm, we can find all local alignments in O(mn) time, where m and n are lengths of a query and a text, respectively. A recent exact approach BWT-SW improves the complexity of the Smith-Waterman algorithm under constraints, but still much slower than BLAST. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches, especially for long queries. In this paper, we propose an efficient software called ALAE to speed up BWT-SW using a compressed suffix array. ALAE utilizes a family of filtering techniques to prune meaningless calculations and an algorithm for reusing score calculations. We also give a mathematical analysis and show that the upper bound of the total number of calculated entries using ALAE could vary from 4.50mn0.520 to 9.05mn0.896 for random DNA sequences and vary from 8.28mn0.364 to 7.49mn0.723 for random protein sequences. We demonstrate the significant performance improvement of ALAE on BWT-SW using a thorough experimental study on real biosequences. ALAE guarantees correctness and accelerates BLAST for most of parameters., VLDB2012
- Published
- 2012
42. Protein alignment HW/SW optimizations
- Author
-
Stefano Frache, Marco Vacca, Mariagrazia Graziano, Gianvito Urgese, Maurizio Zamboni, and Muhammad Awais
- Subjects
Application-specific integrated circuit ,Filter (video) ,Computer science ,business.industry ,Algorithmic efficiency ,Embedded system ,Biosequence ,Point (geometry) ,Field-programmable gate array ,business ,Implementation ,Selection (genetic algorithm) - Abstract
Biosequence alignment recently received an amazing support from both commodity and dedicated hardware platforms. The limitless requirements of this application motivate the search for improved implementations to boost processing time and capabilities. We propose an unprecedented hardware improvement to the classic Smith-Waterman (S-W) algorithm based on a twofold approach: i) an on-the-fly gap-open/gap-extension selection that reduces the hardware implementation complexity; ii) a pre-selection filter that uses reduced amino-acid alphabets to screen out not-significant sequences and to shorten the S-W iterations on huge reference databases.We demonstrated the improvements w.r.t. a classic approach both from the point of view of algorithm efficiency and of HW performance (FPGA and ASIC post-synthesis analysis).
- Published
- 2012
43. Self-organized neural maps of human protein sequences
- Author
-
Bernard Pflugfelder, Pascual Ferrara, and Edgardo A. Ferrán
- Subjects
Self-organizing map ,Artificial neural network ,Computer science ,business.industry ,Biosequence ,Pattern recognition ,Bioinformatics ,Biochemistry ,Reduction (complexity) ,Matrix (mathematics) ,Dipeptide composition ,Artificial intelligence ,Swissprot database ,Cluster analysis ,business ,Molecular Biology - Abstract
We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis.
- Published
- 1994
- Full Text
- View/download PDF
44. New data structures for analyzing frequent factors in strings
- Author
-
Rafael Morales-Bueno and Manuel Baena-García
- Subjects
Sequence ,Computer science ,Trie ,Biosequence ,Process (computing) ,String searching algorithm ,Data mining ,Heuristics ,computer.software_genre ,Data structure ,computer ,Computer Science::Databases ,Natural language - Abstract
Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. This algorithm uses a new data structure to arrange nodes in a trie. A positioning matrix is defined as a new positioning strategy. By using positioning matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The positioning matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. The algorithm has been successfully used in natural language and biological sequence contexts.
- Published
- 2011
- Full Text
- View/download PDF
45. Bloom Filter Performance on Graphics Engines
- Author
-
Jeremy Buhler, Lin Ma, Mark A. Franklin, and Roger D. Chamberlain
- Subjects
Set (abstract data type) ,Computer science ,Biosequence ,Parallel computing ,Bloom filter ,Data mining ,Graphics ,computer.software_genre ,Throughput (business) ,computer - Abstract
Bloom filters are a probabilistic technique for large-scale set membership tests. They exhibit no false negative test results but are susceptible to false positive results. They are well-suited to both large sets and large numbers of membership tests. We implement the Bloom filters present in an accelerated version of BLAST, a genome biosequence alignment application, on NVIDIA GPUs and develop an analytic performance model that helps potential users of Bloom filters to quantify the inherent tradeoffs between throughput and false positive rates.
- Published
- 2011
- Full Text
- View/download PDF
46. Autotuned parallel I/O for highly scalable biosequence analysis
- Author
-
Qing Liu, Shirley Moore, Haihang You, and Bhanu Rekapalli
- Subjects
Identification (information) ,Speedup ,Knowledge extraction ,Computer science ,Distributed computing ,Scalability ,Biosequence ,Parallel computing ,Supercomputer ,Massively parallel ,Parallel I/O - Abstract
In recent years, the rate of genomics sequence generation increased dramatically due to significant advances in the sequencing technology. The genomics data is accumulating at an exponential rate in various databases all around the world and rapid analysis techniques will enhance the knowledge discovery in the fields of medicine and biotechnology. Analysis of such growing sequence databases demands tremendous computational power that can only be provided by massively parallel computers. Improving the performance and scalability of bioinformatics tools thus becomes a critical step in the quest to transform ever-growing raw genomics data into biological knowledge. In this paper we describe an efficient parallel implementation of a profile hidden Markov models (profile HMMs) code used for protein domain identification, along with auto-tuned parallel I/O optimization. Experimental results show linear speedup with increasing numbers of computing cores on a supercomputer, allowing the domain identification of millions of proteins in few minutes using hundreds of thousands computing cores.
- Published
- 2011
- Full Text
- View/download PDF
47. Application of Recurrence Quantification Analysis (RQA) in Biosequence Pattern Recognition
- Author
-
Achuthsankar S. Nair, Alessandro Giuliani, Pawan K. Dhar, Saritha Namboodiri, and Chandra S. Verma
- Subjects
Sequence ,biology ,Computer science ,business.industry ,Allosteric regulation ,Biosequence ,Active site ,Pattern recognition ,Ligand (biochemistry) ,Recurrence quantification analysis ,Pattern recognition (psychology) ,Feature (machine learning) ,biology.protein ,Artificial intelligence ,business - Abstract
Recurrence Quantitative Analysis is a relatively new pattern recognition tool well suited for short, non-linear and non stationary systems. It is designed to detect recurrence patterns that are expressed as a set of Recurrence Quantification variables. In our work we made use of this tool on allosteric protein system to identify residues involved in the transmission of the structural rearrangements as an upshot of allostery. Allostery is the phenomenon of changes in the structure and activity of proteins that appear as a consequence of ligand binding at sites other than the active site. Here, we scrutinized the sequence landscape of ‘ras’ protein by partitioning its residues into windows of equal size. An 11 element characteristic vector, comprising of 10 features extracted from the Recurrence Quantification Analysis along with a feature relating to allosteric involvement, was defined for each windowed sequence set. By applying multivariate statistical analysis tools including Principal Component Analysis and Multiple Regression Analysis upon the characteristic feature vectors extracted from all the windowed sequence set, we could develop a significant linear model to identify the residues that are critical to allostery of ‘ras’ protein.
- Published
- 2011
- Full Text
- View/download PDF
48. Nanofabric Power Analysis: Biosequence Alignment Case of Study
- Author
-
Luca Gaetano Amaru, Stefano Frache, Maurizio Zamboni, and Mariagrazia Graziano
- Subjects
Nanofabrics ,Power analysis ,Computer engineering ,Application-specific integrated circuit ,Computer science ,Logic gate ,Overhead (engineering) ,Biosequence ,Electronic engineering - Abstract
The promising features of Nanoscale array structures pave the way for interesting applications like biosequence alignment, that currently can be addressed only at the price of a huge overhead in terms of area and power dissipation. Nanofabrics, once technology will be mature enough, are expected to enormously overcome these limits, and assure an evident advantage in terms of processing capabilities.
- Published
- 2011
49. Motif Discovery and Data Mining in Bioinformatics
- Author
-
Qader, Nooruldeen, Al-Khafaji, Hussein Keitan, Qader, Nooruldeen, and Al-Khafaji, Hussein Keitan
- Abstract
Bioinformatics analyses huge amounts of biological data that demands in-depth understanding. On the other hand, data mining research develops methods for discovering motifs in biosequences. Motif discovery involves benefits and challenges. We show bridge of the two fields, data mining and Bioinformatics, for successful mining of biological data. We found the motivation and justification factors lead to preferring naturalistic method research for Bioinformatics, because naturalistic method depends on real data. The method empowers Bioinformatics techniques to handle the true properties and reducing assumptions for un-modeled or uncover biodata phenomena. The empowerment comes from recognizing and understanding biodata properties and processes.
- Published
- 2014
50. A system for pattern matching applications on biosequences
- Author
-
Gene Myers and Gerhard Mehldau
- Subjects
Statistics and Probability ,Theoretical computer science ,Base Sequence ,Syntax (programming languages) ,Semantics (computer science) ,Computer science ,Molecular Sequence Data ,Biosequence ,Nucleic acid sequence ,Biochemistry ,DNA sequencing ,Pattern Recognition, Automated ,Computer Science Applications ,User-Computer Interface ,Computational Mathematics ,chemistry.chemical_compound ,Computational Theory and Mathematics ,chemistry ,Feature (machine learning) ,Humans ,Pattern matching ,Molecular Biology ,Software ,DNA - Abstract
ANREP is a system for finding matches to patterns composed of (i) spacing constraints called 'spacers', and (ii) approximate matches to 'motifs' that are, recursively, patterns composed of 'atomic' symbols. A user specifies such patterns via a declarative, free-format and strongly typed language called A that is presented here in a tutorial style through a series of progressively more complex examples. The sample patterns are for protein and DNA sequences, the application domain for which ANREP was specifically created. ANREP provides a unified framework for almost all previously proposed biosequence patterns and extends them by providing approximate matching, a feature heretofore unavailable except for the limited case of individual sequences. The performance of ANREP is discussed and an appendix gives a concise specification of syntax and semantics. A portable C software package implementing ANREP is available via anonymous remote file transfer.
- Published
- 1993
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.