25 results on '"Goethals, Bart"'
Search Results
2. Pattern Based Sequence Classification.
- Author
-
Zhou, Cheng, Cule, Boris, and Goethals, Bart
- Subjects
DATA mining ,MATHEMATICAL sequences ,MACHINE learning ,CLASSIFICATION rule mining ,ALGORITHMS - Abstract
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
3. Quick Inclusion-Exclusion.
- Author
-
Bonchi, Francesco, Boulicaut, Jean-Francois, Calders, Toon, and Goethals, Bart
- Abstract
Many data mining algorithms make use of the well-known Inclusion-Exclusion principle. As a consequence, using this principle efficiently is crucial for the success of all these algorithms. Especially in the context of condensed representations, such as NDI, and in computing interesting measures, a quick inclusion-exclusion algorithm can be crucial for the performance. In this paper, we give an overview of several algorithms that depend on the inclusion-exclusion principle and propose an efficient algorithm to use it and evaluate its complexity. The theoretically obtained results are supported by experimental evaluation of the quick IE technique in isolation, and of an example application. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
4. Integrating Pattern Mining in Relational Databases.
- Author
-
Fürnkranz, Johannes, Scheffer, Tobias, Spiliopoulou, Myra, Calders, Toon, Goethals, Bart, and Prado, Adriana
- Abstract
Almost a decade ago, Imielinski and Mannila introduced the notion of Inductive Databases to manage KDD applications just as DBMSs successfully manage business applications. The goal is to follow one of the key DBMS paradigms: building optimizing compilers for ad hoc queries. During the past decade, several researchers proposed extensions to the popular relational query language, SQL, in order to express such mining queries. In this paper, we propose a completely different and new approach, which extends the DBMS itself, not the query language, and integrates the mining algorithms into the database query optimizer. To this end, we introduce virtual mining views, which can be queried as if they were traditional relational tables (or views). Every time the database system accesses one of these virtual mining views, a mining algorithm is triggered to materialize all tuples needed to answer the query. We show how this can be done effectively for the popular association rule and frequent set mining problems. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
5. Implicit Enumeration of Patterns.
- Author
-
Goethals, Bart, Siebes, Arno, and Mielikäinen, Taneli
- Abstract
Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented. In this paper we study implicit enumeration of patterns, i.e., how to represent pattern collections by listing only the interestingness values of the patterns. The main problem is that the pattern classes are typically huge compared to the collections of interesting patterns in them. We solve this problem by choosing a good ordering of listing the patterns in the class such that the ordering admits effective pruning and prediction of the interestingness values of the patterns. This representation of interestingness values enables us to quantify how surprising a pattern is in the collection. Furthermore, the encoding of the interestingness values reflects our understanding of the pattern collection. Thus the size of the encoding can be used to evaluate the correctness of our assumptions about the pattern collection and the interestingness measure. [ABSTRACT FROM AUTHOR]
- Published
- 2005
6. Condensed Representation of EPs and Patterns Quantified by Frequency-Based Measures.
- Author
-
Goethals, Bart, Siebes, Arno, Soulet, Arnaud, Crémilleux, Bruno, and Rioult, François
- Abstract
Emerging patterns (EPs) are associations of features whose frequencies increase significantly from one class to another. They have been proven useful to build powerful classifiers and to help establishing diagnosis. Because of the huge search space, mining and representing EPs is a hard and complex task for large datasets. Thanks to the use of recent results on condensed representations of frequent closed patterns, we propose here an exact condensed representation of EPs (i.e., all EPs and their growth rates). From this condensed representation, we give a method to provide interesting EPs, in fact those with the highest growth rates. We call strong emerging patterns (SEPs) these EPs. We also highlight a property characterizing the jumping emerging patterns. Experiments quantify the interests of SEPs (smaller number, ability to extract longer and less frequent patterns) and show their usefulness (in collaboration with the Philips company, SEPs successfully enabled to identify the failures of a production chain of silicon plates). These concepts of condensed representation and "strong patterns" with respect to a measure are generalized to other interestingness measures based on frequencies. Keywords: Emerging patterns, condensed representations, closed patterns, characterization of classes, frequency-based measures. [ABSTRACT FROM AUTHOR]
- Published
- 2005
7. Models and Indices for Integrating Unstructured Data with a Relational Database.
- Author
-
Goethals, Bart, Siebes, Arno, and Sarawagi, Sunita
- Abstract
Database systems are islands of structure in a sea of unstructured data sources. Several real-world applications now need to create bridges for smooth integration of semi-structured sources with existing structured databases for seamless querying. This integration requires extracting structured column values from the unstructured source and mapping them to known database entities. Existing methods of data integration do not effectively exploit the wealth of information available in multi-relational entities. We present statistical models for co-reference resolution and information extraction in a database setting. We then go over the performance challenges of training and applying these models efficiently over very large databases. This requires us to break open a black box statistical model and extract predicates over indexable attributes of the database. We show how to extract such predicates for several classification models, including naive Bayes classifiers and support vector machines. We extend these indexing methods for supporting similarity predicates needed during data integration. [ABSTRACT FROM AUTHOR]
- Published
- 2005
8. An Automata Approach to Pattern Collections.
- Author
-
Goethals, Bart, Siebes, Arno, and Mielikäinen, Taneli
- Abstract
Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented. In this paper we study how condensed representations of frequent itemsets can be concretely represented: we propose the use of deterministic finite automata to represent pattern collections and study the properties of the automata representation. The automata representation supports visualization of the patterns in the collection and clustering of the patterns based on their structural properties and interestingness values. Furthermore, we show experimentally that finite automata provide a space-efficient way to represent itemset collections. [ABSTRACT FROM AUTHOR]
- Published
- 2005
9. An Efficient Algorithm for Mining String Databases Under Constraints.
- Author
-
Goethals, Bart, Siebes, Arno, Lee, Sau Dan, and Raedt, Luc
- Abstract
We study the problem of mining substring patterns from string databases. Patterns are selected using a conjunction of monotonic and anti-monotonic predicates. Based on the earlier introduced version space tree data structure, a novel algorithm for discovering substring patterns is introduced. It has the nice property of requiring only one database scan, which makes it highly scalable and applicable in distributed environments, where the data are not necessarily stored in local memory or disk. The algorithm is experimentally compared to a previously introduced algorithm in the same setting. [ABSTRACT FROM AUTHOR]
- Published
- 2005
10. Database Transposition for Constrained (Closed) Pattern Mining.
- Author
-
Goethals, Bart, Siebes, Arno, Jeudy, Baptiste, and Rioult, François
- Abstract
Recently, different works proposed a new way to mine patterns in databases with pathological size. For example, experiments in genome biology usually provide databases with thousands of attributes (genes) but only tens of objects (experiments). In this case, mining the "transposed" database runs through a smaller search space, and the Galois connection allows to infer the closed patterns of the original database. We focus here on constrained pattern mining for those unusual databases and give a theoretical framework for database and constraint transposition. We discuss the properties of constraint transposition and look into classical constraints. We then address the problem of generating the closed patterns of the original database satisfying the constraint, starting from those mined in the "transposed" database. Finally, we show how to generate all the patterns satisfying the constraint from the closed ones. [ABSTRACT FROM AUTHOR]
- Published
- 2005
11. Mining Interesting XML-Enabled Association Rules with Templates.
- Author
-
Goethals, Bart, Siebes, Arno, Ling Feng, and Dillon, Tharam
- Abstract
XML-enabled association rule framework [FDWC03] extends the notion of associated items to XML fragments to present associations among trees rather than simple-structured items of atomic values. They are more flexible and powerful in representing both simple and complex structured association relationships inherent in XML data. Compared with traditional association mining in the well-structured world, mining from XML data, however, is confronted with more challenges due to the inherent flexibilities of XML in both structure and semantics. The primary challenges include 1) a more complicated hierarchical data structure; 2) an ordered data context; and 3) a much bigger data size. In order to make XML-enabled association rule mining truly practical and computationally tractable, in this study, we present a template model to help users specify the interesting XML-enabled associations to be mined. Techniques for template-guided mining of association rules from large XML data are also described in the paper. We demonstrate the effectiveness of these techniques through a set of experiments on both synthetic and real-life data. [ABSTRACT FROM AUTHOR]
- Published
- 2005
12. Theoretical Bounds on the Size of Condensed Representations.
- Author
-
Goethals, Bart, Siebes, Arno, Dexters, Nele, and Calders, Toon
- Abstract
Recent studies demonstrate the usefulness of condensed representations as a semantic compression technique for the frequent itemsets. Especially in inductive databases, condensed representations are a useful tool as an intermediate format to support exploration of the itemset space. In this paper we establish theoretical upper bounds on the maximal size of an itemset in different condensed representations. A central notion in the development of the bounds are the l-free sets, that form the basis of many well-known representations. We will bound the maximal cardinality of an l-free set based on the size of the database. More concrete, we compute a lower bound for the size of the database in terms of the size of the l-free set, and when the database size is smaller than this lower bound, we know that the set cannot be l-free. An efficient method for calculating the exact value of the bound, based on combinatorial identities of partial row sums, is presented. We also present preliminary results on a statistical approximation of the bound and we illustrate the results with some simulations. [ABSTRACT FROM AUTHOR]
- Published
- 2005
13. Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data.
- Author
-
Goethals, Bart, Siebes, Arno, Besson, Jérémy, Robardet, Céline, and Boulicaut, Jean-François
- Abstract
We are designing new data mining techniques on boolean contexts to identify a priori interesting bi-sets (i.e., sets of objects or transactions associated to sets of attributes or items). A typical important case concerns formal concept mining (i.e., maximal rectangles of true values or associated closed sets by means of the so-called Galois connection). It has been applied with some success to, e.g., gene expression data analysis where objects denote biological situations and attributes denote gene expression properties. However in such real-life application domains, it turns out that the Galois association is a too strong one when considering intrinsically noisy data. It is clear that strong associations that would however accept a bounded number of exceptions would be extremely useful. We study the new pattern domain of α/β concepts, i.e., consistent maximal bi-sets with less than α false values per row and less than β false values per column. We provide a complete algorithm that computes all the α/β concepts based on the generation of concept unions pruned thanks to anti-monotonic constraints. An experimental validation on synthetic data is given. It illustrates that more relevant associations can be discovered in noisy data. We also discuss a practical application in molecular biology that illustrates an incomplete but quite useful extraction when all the concepts that are needed beforehand can not be discovered. [ABSTRACT FROM AUTHOR]
- Published
- 2005
14. Constraint Relaxations for Discovering Unknown Sequential Patterns.
- Author
-
Goethals, Bart, Siebes, Arno, Antunes, Cláudia, and Oliveira, Arlindo L.
- Abstract
The main drawbacks of sequential pattern mining have been its lack of focus on user expectations and the high number of discovered patterns. However, the solution commonly accepted - the use of constraints - approximates the mining process to a verification of what are the frequent patterns among the specified ones, instead of the discovery of unknown and unexpected patterns. In this paper, we propose a new methodology to mine sequential patterns, keeping the focus on user expectations, without compromising the discovery of unknown patterns. Our methodology is based on the use of constraint relaxations, and it consists on using them to filter accepted patterns during the mining process. We propose a hierarchy of relaxations, applied to constraints expressed as context-free languages, classifying the existing relaxations (legal, valid and naïve, previously proposed), and proposing several new classes of relaxations. The new classes range from the approx and non-accepted, to the composition of different types of relaxations, like the approx-legal or the non-prefix-valid relaxations. Finally, we present a case study that shows the results achieved with the application of this methodology to the analysis of the curricular sequences of computer science students. [ABSTRACT FROM AUTHOR]
- Published
- 2005
15. On Private Scalar Product Computation for Privacy-Preserving Data Mining.
- Author
-
Choonsik Park, Seongtaek Chee, Goethals, Bart, Laur, Sven, Lipmaa, Helger, and Mielikäinen, Taneli
- Abstract
In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure. We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets. Keywords: Privacy-preserving data mining, private scalar product protocol, vertically partitioned frequent pattern mining. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
16. Mining frequent itemsets in a stream
- Author
-
Toon Calders, Joris J. M. Gillis, Nele Dexters, Bart Goethals, Calders, Toon, Dexters, Nele, GILLIS, Joris, and GOETHALS, Bart
- Subjects
Computer. Automation ,Measure (data warehouse) ,Computer science ,Data stream mining ,Sentiment analysis ,InformationSystems_DATABASEMANAGEMENT ,computer.software_genre ,Frequent itemset mining ,Datastream ,Theory ,Algorithm ,Experiments ,ComputingMethodologies_PATTERNRECOGNITION ,Hardware and Architecture ,Data mining ,computer ,Software ,Information Systems - Abstract
Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it has many useful applications; e.g., opinion and sentiment analysis from social networks. Current stream mining algorithms are based on approximations. In earlier work, mining frequent items in a stream under the max-frequency measure proved to be effective for items. In this paper, we extended our work from items to itemsets. Firstly, an optimized incremental algorithm for mining frequent itemsets in a stream is presented. The algorithm maintains a very compact summary of the stream for selected itemsets. Secondly, we show that further compacting the summary is non-trivial. Thirdly, we establish a connection between the size of a summary and results from number theory. Fourthly, we report results of extensive experimentation, both of synthetic and real-world datasets, showing the efficiency of the algorithm both in terms of time and space. (C) 2012 Elsevier Ltd. All rights reserved.
- Published
- 2014
- Full Text
- View/download PDF
17. Generating Diverse Realistic Data Sets for Episode Mining
- Author
-
Albrecht Zimmermann, Vreeken, Jilles, Ling, Charles, Zaki, Mohammed Javeed, Siebes, Arno, Yu, Jeffrey Xu, Goethals, Bart, Webb, Geoffrey I, Wu, Xindong, and Webb, Geoffrey I.
- Subjects
Computer science ,business.industry ,Test data generation ,Data stream mining ,data generation ,Context (language use) ,Concept mining ,computer.software_genre ,Machine learning ,Temporal database ,Data set ,Text mining ,Knowledge extraction ,episode mining ,Artificial intelligence ,Data mining ,Data pre-processing ,business ,computer ,Generator (mathematics) - Abstract
Frequent episode mining has been proposed as a data mining task with the goal of recovering sequential patterns from temporal data sequences. While several episode mining approaches have been proposed in the last fifteen years, most of the developed techniques have not been evaluated on a common benchmark data set, limiting the insights gained from experimental evaluations. In particular, it is unclear how well episodes are actually being recovered, leaving an episode mining user without guidelines in the knowledge discovery process. One reason for this can be found in non-disclosure agreements that prevent real life data sets on which approaches have been evaluated from entering the public domain. But even easily accessible real life data sets would not allow to ascertain miners' abilities to identify underlying patterns. A solution to this problem can be seen in generating artificial data, which has the added advantage that patterns can be known, allowing to evaluate the accuracy of mined patterns. Based on insights and experiences stemming from consultations with industrial partners and work with real life data, we propose a data generator for the generation of diverse data sets that reflect realistic data characteristics. We discuss in detail which characteristics real life data can be expected to have and how our generator models them. Finally, we show that we can recreate artificial data that has been used in the literature, contrast it with real life data showing very different characteristics, and show how our generator can be used to create data with realistic characteristics. ispartof: pages:611-618 ispartof: 12th IEEE International Conference on Data Mining Workshops, ICDM Workshops pages:611-618 ispartof: IEEE International Conference on Data Mining location:Brussels date:10 Dec - 13 Dec 2012 status: published
- Published
- 2012
- Full Text
- View/download PDF
18. An integrated workflow for robust alignment and simplified quantitative analysis of NMR spectrometry data
- Author
-
Kris Laukens, Roger Dommisse, Trung Nghia Vu, Kim A. Verwaest, Bart Goethals, Dirk Valkenborg, Filip Lemière, Koen Smets, Alain Verschoren, Vu, Trung N., VALKENBORG, Dirk, SMETS, Koen, Verwaest, Kim A., Dommisse, Roger, Lemiere, Filip, Verschoren, Alain, GOETHALS, Bart, and Laukens, Kris
- Subjects
Normalization (statistics) ,Magnetic Resonance Spectroscopy ,Computer science ,Fast Fourier transform ,lcsh:Computer applications to medicine. Medical informatics ,Mass spectrometry ,Biochemistry ,Workflow ,Metabolomics ,Structural Biology ,medicine ,Statistical inference ,Cluster Analysis ,lcsh:QH301-705.5 ,Biology ,Molecular Biology ,Bootstrapping (statistics) ,Computer. Automation ,Analysis of Variance ,medicine.diagnostic_test ,Methodology Article ,Applied Mathematics ,Explained sum of squares ,Experimental data ,Magnetic resonance imaging ,Nuclear magnetic resonance spectroscopy ,Magnetic Resonance Imaging ,Data science ,Computer Science Applications ,Hierarchical clustering ,Chemistry ,Tree (data structure) ,lcsh:Biology (General) ,lcsh:R858-859.7 ,Biochemical Research Methods ,Biotechnology & Applied Microbiology ,Mathematical & Computational Biology ,peak alignment ,mass-spectrometry ,chromatographic data ,cross-correlation ,H1-NMR spectra ,classification ,algorithms ,signals ,Engineering sciences. Technology ,Quantitative analysis (chemistry) ,Algorithm ,Mathematics ,Algorithms ,Software - Abstract
Background: Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. Results: We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. Conclusions: The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/. TNV acknowledges support by a BOF interdisciplinary grant of the University of Antwerp. Koen Smets is supported by a Ph.D. Fellowship of the Research Foundation- Flanders (FWO). This work is further supported by an SBO grant (IWT-600450).
- Published
- 2011
- Full Text
- View/download PDF
19. Inductive Querying with Virtual Mining Views
- Author
-
Blockeel, H., Calders, T.G.K., Fromont, É., Goethals, B., Prado, A., Robardet, C., Dzeroski, S., Panov, P., Declarative Languages and Artificial Intelligence (DTAI), Université Catholique de Louvain = Catholic University of Louvain (UCL), Laboratoire Hubert Curien (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet - Saint-Étienne (UJM)-Centre National de la Recherche Scientifique (CNRS), Advanced Database Research and Modelling (Adrem), University of Antwerp (UA), Data Mining and Machine Learning (DM2L), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), Džeroski, Sašo, Goethals, Bart, Panov, Panče (Eds.), European Project: 30147,IQ, Dzeroski, Saso, Goethals, Bart, Panov, Pance, Laboratoire Hubert Curien [Saint Etienne] (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet [Saint-Étienne] (UJM)-Centre National de la Recherche Scientifique (CNRS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), and Université de Lyon-Université Lumière - Lyon 2 (UL2)
- Subjects
SQL ,View ,Computer science ,Inductive databases ,Concept mining ,02 engineering and technology ,computer.software_genre ,Query optimization ,Constraint extraction ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Query by Example ,Data mining ,computer.programming_language ,Computer. Automation ,Information retrieval ,Data stream mining ,InformationSystems_DATABASEMANAGEMENT ,data mining ,inductive databases ,020201 artificial intelligence & image processing ,Sargable ,computer ,RDF query language - Abstract
An important motivation for the development of inductive databases and query languages for data mining is that such an approach will increase the flexibility with which data mining can be performed. By integrating data mining more closely into a database querying framework, separate steps such as data preprocessing, data mining, and postprocessing of the results, can all be handled using one query language. In this chapter, we compare 6 existing data mining query languages, all extensions of the standard relational query language SQL, from this point of view: how flexible are they with respect to the tasks they can be used for, and how easily can those tasks be performed? We verify whether and how these languages can be used to perform four prototypical data mining tasks in the domain of itemset and associa- tion rule mining, and summarize their stronger and weaker points. Besides offering a comparative evaluation of different data mining query languages, this chapter also provides a motivation for the next chapter, where a deeper integration of data mining into databases is proposed, one that does not rely on the development of a new query language, but where the structure of the database itself is extended.
- Published
- 2010
20. Predicting Gene Function using Predictive Clustering Trees
- Author
-
Celine Vens, Hendrik Blockeel, Jan Struyf, Dragi Kocev, Leander Schietgat, Sašo Džeroski, Dzeroski, Saso, Goethals, Bart, and Panov, Pance
- Subjects
Hierarchy (mathematics) ,Computer science ,business.industry ,Decision tree learning ,Function (mathematics) ,Machine learning ,computer.software_genre ,Task (project management) ,Tree (data structure) ,Tree structure ,Protein function prediction ,Artificial intelligence ,Cluster analysis ,business ,computer - Abstract
In this chapter, we show how the predictive clustering tree framework can be used to predict the functions of genes. The gene function prediction task is an example of a hierarchical multi-label classification (HMC) task: genes may have multiple functions and these functions are organized in a hierarchy. The hierarchy of functions can be such that each function has at most one parent (tree structure) or such that functions may have multiple parents (DAG structure). © 2010 Springer Science+Business Media, LLC. ispartof: Inductive Databases and Constraint-based Data Mining pages:365-387 ispartof: pages:365-387 status: published
- Published
- 2010
- Full Text
- View/download PDF
21. Probabilistic Inductive Querying Using ProbLog
- Author
-
Kristian Kersting, Vítor Santos Costa, Angelika Kimmig, Hannu Toivonen, Bernd Gutmann, Luc De Raedt, Dzeroski, Saso, Goethals, Bart, Panov, Pance, Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan), Helsinki Institute for Information Technology, Department of Computer Science, and Discovery Research Group/Prof. Hannu Toivonen
- Subjects
Theoretical computer science ,Binary decision diagram ,business.industry ,Computer science ,Probabilistic logic ,InformationSystems_DATABASEMANAGEMENT ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,02 engineering and technology ,Extension (predicate logic) ,Inductive reasoning ,computer.software_genre ,113 Computer and information sciences ,Prolog ,Inductive logic programming ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Logic programming ,Natural language processing ,Biological network ,computer.programming_language - Abstract
We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks. ispartof: Inductive Databases and Constraint-Based Data Mining pages:229-262 ispartof: pages:229-262 status: published
- Published
- 2010
22. Inductive Queries for a Drug Designing Robot Scientist
- Author
-
Jem J. Rowland, Ross D. King, Amanda Clare, Siegfried Nijssen, Andrew Sparkes, Jan Ramon, Amanda C. Schierz, Dzeroski, Saso, Goethals, Bart, Panov, Pance, and UCL - SST/ICTM/INGI - Pôle en ingénierie informatique
- Subjects
Discovery science ,ge ,business.industry ,Computer science ,aintel ,Machine learning ,computer.software_genre ,Drug design ,Business process discovery ,Knowledge extraction ,Inductive logic programming ,chem ,Key (cryptography) ,Robot ,Artificial intelligence ,Representation (mathematics) ,business ,Adaptation (computer science) ,Data mining ,computer - Abstract
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments. ispartof: Inductive Databases and Constraint-Based Data Mining pages:425-451 ispartof: pages:425-451 status: published
- Published
- 2010
- Full Text
- View/download PDF
23. A practical comparative study of data mining query languages
- Author
-
Hendrik Blockeel, Toon Calders, Adriana Prado, Elisa Fromont, Céline Robardet, Bart Goethals, Dzeroski, Saso, Goethals, Bart, Panov, Pance, Declarative Languages and Artificial Intelligence (DTAI), Université Catholique de Louvain = Catholic University of Louvain (UCL), Laboratoire Hubert Curien [Saint Etienne] (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet [Saint-Étienne] (UJM)-Centre National de la Recherche Scientifique (CNRS), Advanced Database Research and Modelling (Adrem), University of Antwerp (UA), Data Mining and Machine Learning (DM2L), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), Džeroski, Sašo, Goethals, Bart, Panov, Panče (Eds.), European Project: 30147,IQ, Laboratoire Hubert Curien (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet - Saint-Étienne (UJM)-Centre National de la Recherche Scientifique (CNRS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), and Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
View ,Computer science ,Inductive databases ,Concept mining ,02 engineering and technology ,Query optimization ,Query language ,computer.software_genre ,Constraint extraction ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Query by Example ,Data mining ,computer.programming_language ,Computer. Automation ,Information retrieval ,Web search query ,Data stream mining ,InformationSystems_DATABASEMANAGEMENT ,data mining ,inductive databases ,020201 artificial intelligence & image processing ,computer ,RDF query language - Abstract
An important motivation for the development of inductive databases and query languages for data mining is that such an approach will increase the flexibility with which data mining can be performed. By integrating data mining more closely into a database querying framework, separate steps such as data preprocessing, data mining, and postprocessing of the results, can all be handled using one query language. In this chapter, we compare 6 existing data mining query languages, all extensions of the standard relational query language SQL, from this point of view: how flexible are they with respect to the tasks they can be used for, and how easily can those tasks be performed? We verify whether and how these languages can be used to perform four prototypical data mining tasks in the domain of itemset and association rule mining, and summarize their stronger and weaker points. Besides offering a comparative evaluation of different data mining query languages, this chapter also provides a motivation for the next chapter, where a deeper integration of data mining into databases is proposed, one that does not rely on the development of a new query language, but where the structure of the database itself is extended. ispartof: Inductive Databases and Constraint-Based Data Mining pages:59-77 ispartof: pages:59-77 status: published
- Published
- 2010
24. Mining frequent items in a stream using flexible windows
- Author
-
Nele Dexters, Bart Goethals, Toon Calders, GOETHALS, Bart, Calders, T., Dexters, N., and Databases and Hypermedia
- Subjects
Computer. Automation ,Computer science ,Data stream mining ,Window (computing) ,computer.software_genre ,Measure (mathematics) ,Theoretical Computer Science ,Current (stream) ,Artificial Intelligence ,Incremental algorithm ,Point (geometry) ,Computer Vision and Pattern Recognition ,Data mining ,State (computer science) ,computer - Abstract
In this paper we study the problem of finding frequent items in a continuous stream of items. A new frequency measure is introduced, based on a flexible window length. For a given item, its current frequency in the stream is defined as the maximal frequency over all windows from any point in the past until the current state. We study the properties of the new measure, and propose an incremental algorithm that allows to produce the current frequency of an item immediately at any time. It is shown experimentally that the momry requirements of the algorithm are extremely small for many different realistic data distributions.
- Published
- 2008
25. On Private Scalar Product Computation for Privacy-Preserving Data Mining
- Author
-
Sven Laur, Helger Lipmaa, Taneli Mielikäinen, Bart Goethals, GOETHALS, Bart, Laur, Sven, Lipmaa, Helger, and Mielikäinen, Taneli
- Subjects
Information privacy ,business.industry ,Computer science ,Computation ,Scalar (mathematics) ,Homomorphic encryption ,Cryptography ,02 engineering and technology ,computer.software_genre ,Computer security ,Encryption ,Privacy preserving ,Information extraction ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,business ,computer - Abstract
In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure. We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets.
- Published
- 2005
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.