Descriptor: "mining methods and algorithms" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"mining methods and algorithms"' showing total 162 results

Start Over Descriptor "mining methods and algorithms"

162 results on '"mining methods and algorithms"'

1. Coupling MDL and Markov chain Monte Carlo to sample diverse pattern sets

Author: Camelin, François, Loudni, Samir, Pesant, Gilles, and Truchet, Charlotte
Published: 2025
Full Text: View/download PDF

2. Using Subgroup Discovery to Relate Odor Pleasantness and Intensity to Peripheral Nervous System Reactions.

Author: Moranges, Maelle, Plantevit, Marc, and Bensafi, Moustafa
Abstract: Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely; each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

3. Change pattern relationships in event logs.

Author: Cremerius, Jonas, Patzlaff, Hendrik, and Weske, Mathias
Subjects: *BLOOD sugar measurement, *PROCESS mining, *HOSPITAL care, *EXECUTIONS & executioners
Abstract: Process mining utilises process execution data to discover and analyse business processes. Event logs represent process executions, providing information about the activities executed. In addition to generic event attributes like activity name and timestamp, events might contain domain-specific attributes, such as a blood sugar measurement in a healthcare environment. Many of these values change during a typical process quite frequently. We refer to those as dynamic event attributes. Change patterns can be derived from dynamic event attributes, describing if the attribute values change from one activity to another. So far, change patterns can only be identified in an isolated manner, neglecting the chance of finding co-occuring change patterns. This paper provides an approach to identifying relationships between change patterns by utilising correlation methods from statistics. We applied the proposed technique on two event logs derived from the MIMIC-IV real-world dataset on hospitalisations in the US and evaluated the results with a medical expert. It turns out that relationships between change patterns can be detected within the same directly or eventually follows relation and even beyond that. Further, we identify unexpected relationships that are occurring only at certain parts of the process. Thus, the process perspective reveals novel insights on how dynamic event attributes change together during process execution. The approach is implemented in Python using the PM4Py framework. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Behavior Action Mining

Author: Peng Su, Daniel Zeng, and Huimin Zhao
Subjects: Business, decision support, knowledge and data engineering tools and techniques, mining methods and algorithms, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: The actionable behavioral rules suggest specific actions that may influence certain behavior in the stakeholders' best interest. In mining such rules, it was assumed previously that all attributes are categorical while the numerical attributes have been discretized in advance. However, this assumption significantly reduces the solution space, and thus hinders the potential of mining algorithms, especially when the numerical attributes are prevalent. As the numerical data are ubiquitous in business applications, there is a crucial need for new mining methodologies that can better leverage such data. To meet this need, in this paper, we define a new data mining problem, named behavior action mining, as a problem of continuous variable optimization of expected utility for action. We then develop three approaches to solving this new problem, which uses regression as a technical basis. The experimental results based on a marketing dataset demonstrate the validity and superiority of our proposed approaches.
Published: 2019
Full Text: View/download PDF

5. Boosting Discrimination Information Based Document Clustering Using Consensus and Classification

Author: Ahmad Muqeem Sheri, Muhammad Aasim Rafique, Malik Tahir Hassan, Khurum Nazir Junejo, and Moongu Jeon
Subjects: Consensus clustering, discrimination information, document clustering, evidence combination, knowledge reuse, mining methods and algorithms, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Adequate choice of term discrimination information measure (DIM) stipulates guaranteed document clustering. Exercise for the right choice is empirical in nature, and characteristics of data in the documents help experts to speculate a viable solution. Thus, a consistent DIM for the clustering is a mere conjecture and demands intelligent selection of the information measure. In this work, we propose an automated consensus building measure based on a text classifier. Two distinct DIMs construct basic partitions of documents and form base clusters. The consensus building measure method uses the clusters information to find concordant documents and constitute a dataset to train the text classifier. The classifier predicts labels for discordant documents from earlier clustering stage and forms new clusters. The experimentation is performed with eight standard data sets to test efficacy of the proposed technique. The improvement observed by applying the proposed consensus clustering demonstrates its superiority over individual results. Relative Risk (RR) and Measurement of Discrimination Information (MDI) are the two discrimination information measures used for obtaining the base clustering solutions in our experiments.
Published: 2019
Full Text: View/download PDF

6. Multiangle P2P Borrower Characterization Analytics by Attributes Partition Considering Business Process.

Author: Liu, Shuaiqi and Wu, Sen
Abstract: In the research of P2P lending data, the study of borrower characteristics is of great value for the establishment of target customers and risk management. Because of high dimensionality, mixed attributes, different importance, and different generation time of information, P2P lending data often leads to the mining results unable to reflect the important borrower characteristics that affect the approval results and the approval loan amount. In this article, we are the first to propose the attributes partition of lending data considering the business process to classify variables into different types. Furthermore, we propose a multiangle data mining method for lending data by attributes partition considering the business process to discover the characteristics of P2P borrowers from multiple perspectives. Experimental results on the real dataset demonstrate that the method depicts the important characteristics of borrowers that affect the approval results and the loan amount, makes the research on P2P borrower characteristics more comprehensive and specific, and provides new ideas for the research on high-dimensional lending data. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

7. XOnto-Apriori: An Effective Association Rule Mining Algorithm for Personalized Recommendation Systems

Author: Gim, Jangwon, Jung, Hanmin, Jeong, Do-Heon, Park, James J. (Jong Hyuk), editor, Stojmenovic, Ivan, editor, Jeong, Hwa Young, editor, and Yi, Gangman, editor
Published: 2015
Full Text: View/download PDF

8. Recognition algorithm for cross-texting in text chat conversations.

Author: Lee, Da-Young and Cho, Hwan-Gue
Subjects: *ONLINE chat, *NATURAL language processing, *DEEP learning, *ALGORITHMS, *KOREAN language, *TEXT mining
Abstract: As the development of the Internet and IT technology, short-text based communication is so popular compared with voice based one. Chat-based communication enables rapid, short and massive exchange of message with many people, creates new social problems. 'Cross-texting' is one of them. It refers to accidentally sending a text to an unintended person during the concurrent conversations with separated multiple people. Cross-texting would be a serious problem in languages where respectful expressions are required. As text-based communication is getting popular, it is a crucial work to prevent cross-texting by detecting it in advance in languages with honorifics expression such as Korean. In this paper, we proposed two methods detecting a cross-text using a deep learning model. The first model is the formal feature vector, which models dialog by explicitly defining the politeness and completeness features. The second one is the grpah2vec based ChatGram-net model, which models the dialog based on the syllable occurrence relationship. To evaluate the detection performance, we suggest a generating method for cross-text datasets from a actual messenger corpus. In experiment we show that both proposed models detected cross-text effectively, and exceeded the performance of the baseline models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. SOM++: Integration of Self-Organizing Map and K-Means++ Algorithms

Author: Dogan, Yunus, Birant, Derya, Kut, Alp, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, and Perner, Petra, editor
Published: 2013
Full Text: View/download PDF

10. Experimental identification of hard data sets for classification and feature selection methods with insights on method selection.

Author: Luan, Cuiju and Dong, Guozhu
Subjects: *BIG data, *BENCHMARKING (Management), *FEATURE selection, *RANDOM forest algorithms, *MINING methodology
Abstract: Abstract The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8; it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on the easy data sets. This paper is the first to rank methods separately for hard data sets and for easy data sets. It turns out that the classifier rankings resulting from our experiments are somehow different from those in the literature and hence they offer new insights on method selection. It should be noted that the Random Forest method remains to be the best in all groups of experiments. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

11. On mining approximate and exact fault-tolerant frequent itemsets.

Author: Liu, Shengxin and Poon, Chung Keung
Subjects: DATA mining, FAULT-tolerant computing, ROBUST statistics, HEURISTIC algorithms, PROBLEM solving, NP-hard problems
Abstract: Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

12. Robotic process automation using process mining — A systematic literature review.

Author: El-Gharib, Najah Mary and Amyot, Daniel
Subjects: *ROBOTIC process automation, *PROCESS mining, *SURGICAL robots
Abstract: Process mining (PM) aims to construct, from event logs, process maps that can help discover, automate, improve, and monitor organizational processes. Robotic process automation (RPA) uses software robots to perform some tasks usually executed by humans. It is usually difficult to determine what processes and steps to automate, especially with RPA. PM is seen as one way to address such difficulty. This paper aims to assess the applicability of process mining in accelerating and improving the implementation of RPA, along with the challenges encountered throughout project lifecycle. A systematic literature review was conducted to examine the approaches where PM techniques were used to understand the as-is processes that can be automated with software robots. Seven databases were used to identify papers on this topic. A total of 32 papers, all published since 2018, were selected from 605 unique candidate papers and then analyzed. There is a steady increase in the number of publications in this domain, especially during the year 2022, which suggests a raising interest in the combined use of PM with RPA. The literature mainly focuses on the methods to record the events that occur at the level of user interactions with the application, and on the preprocessing methods that are needed to discover routines with the steps that can be automated. Important challenges are faced with preprocessing such event logs, and many lifecycle steps of automation projects are weakly supported by existing approaches suggesting corresponding research areas in need of further attention. • Combining process mining (PM) and RPA offers unique process management opportunities. • PM techniques must better support discovering processes based on UI logs for RPA. • Tools need to better integrate PM and RPA features together, in a synergetic way. • Challenges common to PM and RPA remain, e.g., with data gathering and preprocessing. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

13. Feature weighted clustering for user profiling.

Author: Cufoglu, Ayse, Lohi, Mahi, and Everiss, Colin
Subjects: INTERNET user profiling, WEB personalization, ALGORITHMS, CLUSTER analysis (Statistics), COMPUTER simulation
Abstract: Personalization is the adaptation of the services to fit the user's interests, characteristics and needs. The key to effective personalization is user profiling. Apart from traditional collaborative and content-based approaches, a number of classification and clustering algorithms have been used to classify user related information to create user profiles. However, they are not able to achieve accurate user profiles. In this paper, we present a new clustering algorithm, namely Multi-Dimensional Clustering (MDC), to determine user profiling. The MDC is a version of the Instance-Based Learner (IBL) algorithm that assigns weights to feature values and considers these weights for the clustering. Three feature weight methods are proposed for the MDC and, all three, have been tested and evaluated. Simulations were conducted with using two sets of user profile datasets, which are the training (includes 10,000 instances) and test (includes 1000 instances) datasets. These datasets reflect each user's personal information, preferences and interests. Additional simulations and comparisons with existing weighted and non-weighted instance-based algorithms were carried out in order to demonstrate the performance of proposed algorithm. Experimental results using the user profile datasets demonstrate that the proposed algorithm has better clustering accuracy performance compared to other algorithms. This work is based on the doctoral thesis of the corresponding author. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

14. A hybrid framework for mining high-utility itemsets in a sparse transaction database.

Author: Dawar, Siddharth, Goyal, Vikram, and Bera, Debajyoti
Subjects: DATA mining, DATABASES, INVENTORY management systems, INVERSE document frequency, INFORMATION retrieval
Abstract: High-utility itemset mining aims to find the set of items with utility no less than a user-defined threshold in a transaction database. High-utility itemset mining is an emerging research area in the field of data mining and has important applications in inventory management, query recommendation, systems operation research, bio-medical analysis, etc. Currently, known algorithms for this problem can be classified as either 1-phase or 2-phase algorithms. The 2-phase algorithms typically consist of tree-based algorithms which generate candidate high-utility itemsets and verify them later. A tree data structure generates candidate high-utility itemsets quickly by storing some upper bound utility estimate at each node. The 1-phase algorithms typically consist of inverted-list based and transaction projection based algorithms which avoid the generation of candidate high-utility itemsets. The inverted list and transaction projection allows computation of exact utility estimates. We propose a novel hybrid framework that combines a tree-based and an inverted-list based algorithm to efficiently mine high-utility itemsets. Algorithms based on the framework can harness benefits of both types of algorithms. We report experiment results on real and synthetic datasets to demonstrate the effectiveness of our framework. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

15. Target-Based, Privacy Preserving, and Incremental Association Rule Mining.

Author: Ahluwalia, Madhu V., Gangopadhyay, Aryya, Chen, Zhiyuan, and Yesha, Yelena
Abstract: We consider a special case in association rule mining where mining is conducted by a third party over data located at a central location that is updated from several source locations. The data at the central location is at rest while that flowing in through source locations is in motion. We impose some limitations on the source locations, so that the central target location tracks and privatizes changes and a third party mines the data incrementally. Our results show high efficiency, privacy and accuracy of rules for small to moderate updates in large volumes of data. We believe that the framework we develop is therefore applicable and valuable for securely mining big data. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

16. Using subgroup discovery to relate odor pleasantness and intensity to peripheral nervous system reactions

Author: Maelle Moranges, Marc Plantevit, Moustafa Bensafi, Université Claude Bernard Lyon 1 (UCBL), Université de Lyon, Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), and BENSAFI, Moustafa
Subjects: Human-Computer Interaction, [SCCO]Cognitive science, Pattern analysis, Mining methods and algorithms, Physiological measures, [INFO]Computer Science [cs], [SCCO] Cognitive science, [INFO] Computer Science [cs], Software
Abstract: International audience; Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely; each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets.
Published: 2022
Full Text: View/download PDF

17. Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets.

Author: Adikaram, K. K. L. B., Hussein, Mohamed A., Effenberger, Mathias, and Becker, Thomas
Subjects: *BIG data, *BIT-mapped graphics, *GRAPHICS processing units, *REAL-time computing, *PIXELS
Abstract: Big data are visually cluttered by overlapping data points. Rather than removing, reducing or reformulating overlap, we propose a simple, effective and powerful technique for density cluster generation and visualization, where point marker (graphical symbol of a data point) overlap is exploited in an additive fashion in order to obtain bitmap data summaries in which clusters can be identified visually, aided by automatically generated contour lines. In the proposed method, the plotting area is a bitmap and the marker is a shape of more than one pixel. As the markers overlap, the red, green and blue (RGB) colour values of pixels in the shared region are added. Thus, a pixel of a 24-bit RGB bitmap can code up to 224 (over 1.6 million) overlaps. A higher number of overlaps at the same location makes the colour of this area identical, which can be identified by the naked eye. A bitmap is a matrix of colour values that can be represented as integers. The proposed method updates this matrix while adding new points. Thus, this matrix can be considered as an up-to-time knowledge unit of processed data. Results show cluster generation, cluster identification, missing and out-of-range data visualization, and outlier detection capability of the newly proposed method. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

18. Block-Constraint Robust Principal Component Analysis and its Application to Integrated Analysis of TCGA Data.

Author: Liu, Jin-Xing, Gao, Ying-Lian, Zheng, Chun-Hou, Xu, Yong, and Yu, Jiguo
Abstract: The Cancer Genome Atlas (TCGA) dataset provides us more opportunities to systematically and comprehensively learn some biological mechanism of cancers formation, growth and metastasis. Since TCGA dataset includes heterogeneous data, it is one of the bioinformatics bottlenecks to mine some meaningful information from them. In this paper, to improve the performance of Robust Principal Component Analysis (RPCA) analyzing these heterogeneous data, a modified RPCA-based method, Block-Constraint Robust Principal Component Analysis (BCRPCA), is proposed. Since different categories data have different peculiarities, BCRPCA enforces different constraint intensities on different categories to improve the performance of RPCA. Firstly, the observation matrix of TCGA data is decomposed into two adding matrices A and S by using BCRPCA. Secondly, we use a ranking scheme to evaluate every feature and project these features to the genes. Then, the genes with high scores will be identified as differentially expressed ones. The main contributions of this paper are as following: firstly, it proposes, for the first time, the idea and method of BCRPCA to model TCGA data; secondly, it provides a BCRPCA-based framework for integrated analysis of TCGA data. The results show that our method is effective and suitable to analyze these data. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

19. Parallel community detection on large graphs with MapReduce and GraphChi.

Author: Moon, Seunghyeon, Lee, Jae-Gil, Kang, Minseo, Choy, Minsoo, and Lee, Jin-woo
Subjects: *SOCIAL networks, *DOCUMENT clustering, *DATA mining, *HIERARCHICAL clustering (Cluster analysis), *ALGORITHMS
Abstract: Community detection from social network data gains much attention from academia and industry since it has many real-world applications. The Girvan–Newman (GN) algorithm is a divisive hierarchical clustering algorithm for community detection, which is regarded as one of the most popular algorithms. It exploits the concept of edge betweenness to divide a network into multiple communities. Though it is being widely used, it has limitations in supporting large-scale networks since it needs to calculate the shortest path between every pair of vertices in a network. In this paper, we develop two parallel versions of the GN algorithm to support large-scale networks. First, we propose a new algorithm, which we call S hortest P ath B etweenness M ap R educe A lgorithm (SPB-MRA), that utilizes the MapReduce model. Second, we propose another new algorithm, which we call S hortest P ath B etweenness V ertex- C entric A lgorithm (SPB-VCA), that utilizes the vertex-centric model. An approximation technique is also developed to further speed up community detection processes. We implemented SPB-MRA using Hadoop and SPB-VCA using GraphChi, and then evaluated the performance of SPB-MRA on Amazon EC2 instances and that of SPB-VCA on a single commodity PC. The evaluation results showed that the elapsed time of SPB-MRA decreased almost linearly as the number of reducers increased, SPB-VCA outperformed SPB-MRA just on a single PC by 4–6 times, and the approximation technique introduced negligible errors. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

20. An Automatic and Clause-Based Approach to Learn Relations for Ontologies.

Author: THENMOZHI, D. and ARAVINDAN, CHANDRABOSE
Subjects: *ONTOLOGIES (Information retrieval), *KNOWLEDGE acquisition (Expert systems), *SEMANTIC computing, *SOFT computing, *MACHINE learning
Abstract: Ontology learning from text is one of the knowledge acquisition processes that facilitates construction of ontology. Considerable research is being done on learning concepts and relations, especially on acquiring semantic relations between concepts for a specific domain. However, more of the research contributions are in learning either taxonomic relations or semantic relations but not in both. Even those few research works that address learning of both types of relations deal with simple sentences only resulting in low recall value. Further, these approaches are semi-automatic, which require either user's feedback or domain expert's knowledge. In this paper, we propose a single framework that is automatic and domain-independent that helps in learning both taxonomic and non-taxonomic relations. We have developed a clause-based approach that automatically extracts the relations for concepts from unstructured text documents. Our approach is capable of handling complex sentences by identifying hidden triples present in the sentences. We have evaluated our methodology of relation learning for the concepts specified by AGROVOC and Open Directory Project using a corpus of web documents. The precision, recall and F1-measure of our method were observed to be considerably higher than those of state-of-the-art methodologies for relation learning. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

21. Temporal expression extraction with extensive feature type selection and a posteriori label adjustment.

Author: Filannino, Michele and Nenadic, Goran
Subjects: *FEATURE extraction, *FEATURE selection, *INFORMATION theory, *NATURAL language processing, *NOUN phrases (Grammar), *MORPHOLOGY (Grammar)
Abstract: The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. It allows to filter information and infer temporal flows of events. This paper presents ManTIME, a general domain temporal expression identification and normalisation system, and systematically explores the impact of different features and training corpora on the performance. The identification phase combines the use of conditional random fields along with a post-processing pipeline, whereas the normalisation phase is carried out using NorMA, an open-source rule-based temporal normaliser. We investigate the performance variation with respect to different feature types. Specifically, we show that the use of WordNet-based features in the identification task negatively affects the overall performance, and that there is no statistically significant difference in the results based on gazetteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features. We also show that the use of silver data (alone or in addition to the human-annotated ones) does not improve the performance. We evaluate six combinations of training data and post-processing pipeline with respect to the TempEval-3 benchmark test set. The best run achieved 0.95 (precision), 0.85 (recall) and 0.90 (F β =1 ) in the identification phase. Normalisation accuracies are 0.86 (for type attribute) and 0.77 (for value attribute). The proposed approach ranked 3rd in the TempEval-3 challenge (task A) as the best performing machine learning-based system among 21 participants. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

22. Mining time-interval univariate uncertain sequential patterns.

Author: Liu, Ying-Ho
Subjects: *UNIVARIATE analysis, *SEQUENTIAL pattern mining, *ALGORITHMS, *PROBABILITY density function, *DATA mining
Abstract: In this study, we propose two algorithms to discover time - interval univariate uncertain ( U2 ) - sequential patterns from a set of univariate uncertain ( U2 )- sequences . A U2-sequence is a sequence that contains transactions of univariate uncertain data , where each attribute in a transaction is associated with a quantitative interval and a probability density function indicating the possibility that each value exists in the interval. Many sources record U2-sequences, such as atmospheric pollution sensors and network monitoring systems. Mining sequential patterns from these U2-sequences is important for understanding the intrinsic characteristics of the U2-sequences. The proposed two algorithms are based on the candidate generate-and-test methodology and pattern growth methodology, respectively. We performed a series of experiments to evaluate them in terms of runtime and memory consumption. The experimental results show that different algorithms excel when applied to different conditions. In general, the algorithm based on the pattern growth methodology is the better choice. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

23. Efficient community identification and maintenance at multiple resolutions on distributed datastores.

Author: Aksu, Hidayet, Canim, Mustafa, Chang, Yuan-Chi, Korpeoglu, Ibrahim, and Ulusoy, Özgür
Subjects: *VIRTUAL communities, *DISTRIBUTED computing, *WEBSITES, *COMPUTER networks, *INFORMATION storage & retrieval systems, *BLOGS
Abstract: The topic of network community identification at multiple resolutions is of great interest in practice to learn high cohesive subnetworks about different subjects in a network. For instance, one might examine the interconnections among web pages, blogs and social content to identify pockets of influencers on subjects like ‘Big Data’, ‘smart phone’ or ‘global warming’. With dynamic changes to its graph representation and content, the incremental maintenance of a community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k -core metric projected at multiple k -values, so that multiple community resolutions are represented with multiple k -core graphs. Recognizing that large graphs and their even larger attributed content cannot be stored and managed by a single server, we then propose distributed algorithms to construct and maintain a multi- k -core graph, implemented on the scalable Big Data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi- k -core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on multiple subjects in rich network content simultaneously. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

24. Pattern-Aided Regression Modeling and Prediction Model Analysis.

Author: Dong, Guozhu and Taslimitehrani, Vahid
Subjects: *PATTERNS (Mathematics), *REGRESSION analysis, *PREDICTION models, *STATISTICAL correlation, *ERROR analysis in mathematics, *DATA mining
Abstract: This paper first introduces pattern aided regression (PXR) models, a new type of regression models designed to represent accurate and interpretable prediction models. This was motivated by two observations: (1) Regression modeling applications often involve complex diverse predictor-response relationships, which occur when the optimal regression models (of given regression model type) fitting two or more distinct logical groups of data are highly different. (2) State-of-the-art regression methods are often unable to adequately model such relationships. This paper defines PXR models using several patterns and local regression models, which respectively serve as logical and behavioral characterizations of distinct predictor-response relationships. The paper also introduces a contrast pattern aided regression (CPXR) method, to build accurate PXR models. In experiments, the PXR models built by CPXR are very accurate in general, often outperforming state-of-the-art regression methods by big margins. Usually using (a) around seven simple patterns and (b) linear local regression models, those PXR models are easy to interpret; in fact, their complexity is just a bit higher than that of (piecewise) linear regression models and is significantly lower than that of traditional ensemble based regression models. CPXR is especially effective for high-dimensional data. The paper also discusses how to use CPXR methodology for analyzing prediction models and correcting their prediction errors. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

25. Anonymizing graphs: measuring quality for clustering.

Author: Casas-Roma, Jordi, Herrera-Joancomartí, Jordi, and Torra, Vicenç
Subjects: DATA mining, MINING methodology, ALGORITHMS, XML (Extensible Markup Language), INFORMATION processing
Abstract: Anonymization of graph-based data is a problem, which has been widely studied last years, and several anonymization methods have been developed. Information loss measures have been carried out to evaluate the noise introduced in the anonymized data. Generic information loss measures ignore the intended anonymized data use. When data has to be released to third-parties, and there is no control on what kind of analyses users could do, these measures are the standard ones. In this paper we study different generic information loss measures for graphs comparing such measures to the cluster-specific ones. We want to evaluate whether the generic information loss measures are indicative of the usefulness of the data for subsequent data mining processes. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

26. A parameter-free KNN for rating prediction.

Author: Fopa, Medjeu, Gueye, Modou, Ndiaye, Samba, and Naacke, Hubert
Subjects: *K-nearest neighbor classification, *MATHEMATICAL optimization, *RECOMMENDER systems, *FORECASTING
Abstract: Among the most popular collaborative filtering algorithms are methods based on the K nearest neighbors (KNN). In their basic operation, KNN methods consider a fixed number of neighbors to make recommendations. However, it is not easy to choose an appropriate number of neighbors. Thus, it is generally fixed by calibration to avoid inappropriate values which would negatively affect the accuracy of the recommendations. In the literature, some authors have addressed the problem of dynamically finding an appropriate number of neighbors. But they use additional parameters which limit their proposals because these parameters also require calibration. In this paper, we propose a parameter-free KNN method for rating prediction. It is able to dynamically select an appropriate number of neighbors to use. The experiments that we did on four publicly available datasets demonstrate the efficiency of our proposal. It rivals those of the state of the art in their best configurations. • Parameter-free KNN for rating prediction. • Optimization of the KNN algorithm by the choice of neighbors. • Dynamic selection of an optimal number of neighbors. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

27. Design of computationally efficient density-based clustering algorithms.

Author: Nanda, Satyasai Jagannath and Panda, Ganapati
Subjects: *ALGORITHMS, *APPLICATION software, *DATABASES, *COMPUTATIONAL complexity, *DATA mining, *ASSOCIATION rule mining
Abstract: The basic DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm uses minimum number of input parameters, very effective to cluster large spatial databases but involves more computational complexity. The present paper proposes a new strategy to reduce the computational complexity associated with the DBSCAN by efficiently implementing new merging criteria at the initial stage of evolution of clusters. Further new density based clustering (DBC) algorithms are proposed considering correlation coefficient as similarity measure. These algorithms though computationally not efficient, found to be effective when there is high similarity between patterns of dataset. The computations associated with DBC based on correlation algorithms are reduced with new cluster merging criteria. Test on several synthetic and real datasets demonstrates that these computationally efficient algorithms are comparable in accuracy to the traditional one. An interesting application of the proposed algorithm has been demonstrated to identify the regional hazard regions present in the seismic catalog of Japan. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

28. Process Discovery Algorithms Using Numerical Abstract Domains.

Author: Carmona, Josep and Cortadella, Jordi
Subjects: *FORMAL methods (Computer science), *ARTIFICIAL neural networks, *INFORMATION technology, *SOFTWARE engineering, *PETRI nets
Abstract: The discovery of process models from event logs has emerged as one of the crucial problems for enabling the continuous support in the life-cycle of an information system. However, in a decade of process discovery research, the algorithms and tools that have appeared are known to have strong limitations in several dimensions. The size of the logs and the formal properties of the model discovered are the two main challenges nowadays. In this paper we propose the use of numerical abstract domains for tackling these two problems, for the particular case of the discovery of Petri nets. First, numerical abstract domains enable the discovery of general process models, requiring no knowledge (e.g., the bound of the Petri net to derive) for the discovery algorithm. Second, by using divide and conquer techniques we are able to control the size of the process discovery problems. The methods proposed in this paper have been implemented in a prototype tool and experiments are reported illustrating the significance of this fresh view of the process discovery problem. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

29. Motif-Based Hyponym Relation Extraction from Wikipedia Hyperlinks.

Author: Wei, Bifan, Liu, Jun, Ma, Jian, Zheng, Qinghua, Zhang, Wei, and Feng, Boqin
Subjects: *FEATURE extraction, *HYPERLINKS, *MACHINE learning, *KNOWLEDGE acquisition (Expert systems), *DATA mining, *ELECTRONIC publishing
Abstract: Discovering hyponym relations among domain-specific terms is a fundamental task in taxonomy learning and knowledge acquisition. However, the great diversity of various domain corpora and the lack of labeled training sets make this task very challenging for conventional methods that are based on text content. The hyperlink structure of Wikipedia article pages was found to contain recurring network motifs in this study, indicating the probability of a hyperlink being a hyponym hyperlink. Hence, a novel hyponym relation extraction approach based on the network motifs of Wikipedia hyperlinks was proposed. This approach automatically constructs motif-based features from the hyperlink structure of a domain; every hyperlink is mapped to a 13-dimensional feature vector based on the 13 types of three-node motifs. The approach extracts structural information from Wikipedia and heuristically creates a labeled training set. Classification models were determined from the training sets for hyponym relation extraction. Two experiments were conducted to validate our approach based on seven domain-specific datasets obtained from Wikipedia. The first experiment, which utilized manually labeled data, verified the effectiveness of the motif-based features. The second experiment, which utilized an automatically labeled training set of different domains, showed that the proposed approach performs better than the approach based on lexico-syntactic patterns and achieves comparable result to the approach based on textual features. Experimental results show the practicability and fairly good domain scalability of the proposed approach. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

30. Cost-Sensitive Online Classification.

Author: Wang, Jialei, Zhao, Peilin, and Hoi, Steven C. H.
Subjects: *DISTANCE education, *COST effectiveness, *PROBLEM solving, *ALGORITHMS, *MACHINE learning, *DATA mining
Abstract: Both cost-sensitive classification and online learning have been extensively studied in data mining and machine learning communities, respectively. However, very limited study addresses an important intersecting problem, that is, “Cost-Sensitive Online Classification”. In this paper, we formally study this problem, and propose a new framework for Cost-Sensitive Online Classification by directly optimizing cost-sensitive measures using online gradient descent techniques. Specifically, we propose two novel cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks. Finally, we demonstrate the application of the proposed technique for solving several online anomaly detection tasks, showing that the proposed technique could be a highly efficient and effective tool to tackle cost-sensitive online classification tasks in various application domains. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

31. Outlier Detection for Temporal Data: A Survey.

Author: Gupta, Manish, Gao, Jing, Aggarwal, Charu C., and Han, Jiawei
Subjects: *OUTLIER detection, *TEMPORAL databases, *TIME series analysis, *COMPUTER software, *DISTRIBUTED databases, *PATTERN matching
Abstract: In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

32. Surfing the Network for Ranking by Multidamping.

Author: Kollias, Giorgos, Gallopoulos, Efstratios, and Grama, Ananth
Subjects: *INTERNET, *APPROXIMATION algorithms, *MARKOV processes, *MONTE Carlo method, *INTERNET users
Abstract: PageRank is one of the most commonly used techniques for ranking nodes in a network. It is a special case of a family of link-based rankings, commonly referred to as functional rankings. Functional rankings are computed as power series of a stochastic matrix derived from the adjacency matrix of the graph. This general formulation of functional rankings enables their use in diverse applications, ranging from traditional search applications to identification of spam and outliers in networks. This paper presents a novel algorithmic (re)formulation of commonly used functional rankings, such as LinearRank, TotalRank and Generalized Hyperbolic Rank. These rankings can be approximated by finite series representations. We prove that polynomials of stochastic matrices can be expressed as products of Google matrices (matrices having the form used in Google’s original PageRank formulation). Individual matrices in these products are parameterized by different damping factors. For this reason, we refer to our formulation as multidamping. We demonstrate that multidamping has a number of desirable characteristics: (i) for problems such as finding the highest ranked pages, multidamping admits extremely fast approximate solutions; (ii) multidamping provides an intuitive interpretation of existing functional rankings in terms of the surfing habits of model web users; (iii) multidamping provides a natural framework based on Monte Carlo type methods that have efficient parallel and distributed implementations. It also provides the basis for constructing new link-based rankings based on inhomogeneous products of Google matrices. We present algorithms for computing damping factors for existing functional rankings analytically and numerically. We validate various benefits of multidamping on a number of real datasets. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

33. Probabilistic Aspect Mining Model for Drug Reviews.

Author: Cheng, Victor C., Leung, C. H. C., Liu, Jiming, and Milani, Alfredo
Subjects: *PROBABILISTIC databases, *DATA mining, *CLINICAL drug trials, *DATA modeling, *TEXT mining, *FEATURE extraction
Abstract: Recent findings show that online reviews, blogs, and discussion forums on chronic diseases and drugs are becoming important supporting resources for patients. Extracting information from these substantial bodies of texts is useful and challenging. We developed a generative probabilistic aspect mining model (PAMM) for identifying the aspects/topics relating to class labels or categorical meta-information of a corpus. Unlike many other unsupervised approaches or supervised approaches, PAMM has a unique feature in that it focuses on finding aspects relating to one class only rather than finding aspects for all classes simultaneously in each execution. This reduces the chance of having aspects formed from mixing concepts of different classes; hence the identified aspects are easier to be interpreted by people. The aspects found also have the property that they are class distinguishing: They can be used to distinguish a class from other classes. An efficient EM-algorithm is developed for parameter estimation. Experimental results on reviews of four different drugs show that PAMM is able to find better aspects than other common approaches, when measured with mean pointwise mutual information and classification accuracy. In addition, the derived aspects were also assessed by humans based on different specified perspectives, and PAMM was found to be rated highest. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

34. Discovery of temporal neighborhoods through discretization methods.

Author: Dey, Sandipan, Janeja, Vandana P., and Gangopadhyay, Aryya
Subjects: *KNOWLEDGE management research, *TEMPORAL databases, *DATA mining, *DATA science, *DISCRETIZATION methods
Abstract: Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this paper we propose a class of algorithms to generate temporal neighborhoods through unequal depth discretization. We describe four novel algorithms (a) Similarity based Merging (SMerg), (b) Stationary distribution based Merging (StMerg), (c) Greedy Merge (GMerg) and, (d) Optimal Merging (OptMerg). The SMerg and STMerg algorithms are based on the robust framework of Markov models and the Markov Stationary distribution respectively. GMerg is a greedy approach and OptMerg algorithm is geared towards discovering optimal binning strategies for the most effective partitioning of the data into temporal neighborhoods. Both these algorithms do not use Markov models. We identify temporal neighborhoods with distinct demarcations based on unequal depth discretization of the data. We discuss detailed experimental results in both synthetic and real world data. Specifically, we show (i) the efficacy of our algorithms through precision and recall of labeled bins, (ii) the ground truth validation in real world traffic monitoring datasets and, (iii) Knowledge discovery in the temporal neighborhoods such as global anomalies. Our results indicate that we are able to identify valuable knowledge based on our ground truth validation from real world traffic data. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

35. A Simple but Powerful Heuristic Methodfor Accelerating k -Means Clusteringof Large-Scale Data in Life Science.

Author: Ichikawa, Kazuki and Morishita, Shinichi
Abstract: K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10–2001 (∼400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/∼ichikawa/boostKCP/. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

36. Discovering Conservation Rules.

Author: Golab, Lukasz, Karloff, Howard, Korn, Flip, Saha, Barna, and Srivastava, Divesh
Subjects: *DATABASE management, *DATA quality, *INFORMATION storage & retrieval systems, *APPROXIMATION algorithms, *PROGRAMMING language semantics, *TEXT mining
Abstract: Many applications process data in which there exists a “conservation law” between related quantities. For example, in traffic monitoring, every incoming event, such as a packet's entering a router or a car's entering an intersection, should ideally have an immediate outgoing counterpart. We propose a new class of constraints—Conservation Rules—that express the semantics and characterize the data quality of such applications. We give confidence metrics that quantify how strongly a conservation rule holds and present approximation algorithms (with error guarantees) for the problem of discovering a concise summary of subsets of the data that satisfy a given conservation rule. Using real data, we demonstrate the utility of conservation rules and we show order-of-magnitude performance improvements of our discovery algorithms over naive approaches. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

37. The Sum-over-Forests Density Index: Identifying Dense Regions in a Graph.

Author: Senelle, Mathieu, Garcia-Diez, Silvia, Mantrach, Amin, Shimbo, Masashi, Saerens, Marco, and Fouss, Francois
Subjects: *GRAPH theory, *MAXWELL-Boltzmann distribution law, *RANDOM forest algorithms, *SET theory, *MATRIX inversion
Abstract: This work introduces a novel nonparametric density index defined on graphs, the Sum-over-Forests (SoF) density index. It is based on a clear and intuitive idea: high-density regions in a graph are characterized by the fact that they contain a large amount of low-cost trees with high outdegrees while low-density regions contain few ones. Therefore, a Boltzmann probability distribution on the countable set of forests in the graph is defined so that large (high-cost) forests occur with a low probability while short (low-cost) forests occur with a high probability. Then, the SoF density index of a node is defined as the expected outdegree of this node on the set of forests, thus providing a measure of density around that node. Following the matrix-forest theorem and a statistical physics framework, it is shown that the SoF density index can be easily computed in closed form through a simple matrix inversion. Experiments on artificial and real datasets show that the proposed index performs well on finding dense regions, for graphs of various origins. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

38. Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases.

Author: Zhao, Zhou, Yan, Da, and Ng, Wilfred
Subjects: *DATABASES, *GLOBAL Positioning System, *DATA mining, *PROBABILITY theory, *TEMPERATURE sensors, *DATA modeling
Abstract: dData uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. Mining sequential patterns from inaccurate data, such as those data arising from sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. In this paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that conform to our models. However, the number of possible worlds is extremely large, which makes the mining prohibitively expensive. Inspired by the famous PrefixSpan algorithm, we develop two new algorithms, collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible worlds explosion”, and when combined with our four pruning and validating methods, achieves even better performance. We also propose a fast validating method to further speed up our U-PrefixSpan algorithm. The efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and synthetic datasets. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

39. Adaptation Regularization: A General Framework for Transfer Learning.

Author: Long, Mingsheng, Wang, Jianmin, Ding, Guiguang, Pan, Sinno Jialin, and Yu, Philip S.
Subjects: *MATHEMATICAL regularization, *SUPPORT vector machines, *LEAST squares, *HILBERT space, *MARGINAL distributions, *MACHINE theory
Abstract: Domain transfer learning, which learns a target classifier using labeled data from a different distribution, has shown promising value in knowledge discovery yet still been a challenging problem. Most previous works designed adaptive classifiers by exploring two learning strategies independently: distribution adaptation and label propagation. In this paper, we propose a novel transfer learning framework, referred to as Adaptation Regularization based Transfer Learning (ARTL), to model them in a unified way based on the structural risk minimization principle and the regularization theory. Specifically, ARTL learns the adaptive classifier by simultaneously optimizing the structural risk functional, the joint distribution matching between domains, and the manifold consistency underlying marginal distribution. Based on the framework, we propose two novel methods using Regularized Least Squares (RLS) and Support Vector Machines (SVMs), respectively, and use the Representer theorem in reproducing kernel Hilbert space to derive corresponding solutions. Comprehensive experiments verify that ARTL can significantly outperform state-of-the-art learning methods on several public text and image datasets. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

40. Multi-Relational Model Tree Induction Tightly-Coupled with a Relational Database.

Author: Appice, Annalisa, Ceci, Michelangelo, and Malerba, Donato
Subjects: *DATA mining, *RELATIONAL databases, *REGRESSION analysis, *GENERALIZATION, *DATA structures, *ELECTRONIC data processing, *STRUCTURAL frame models
Abstract: Multi-Relational Data Mining (MRDM) refers to the process of discovering implicit, previously unknown and potentially useful information from data scattered in multiple tables of a relational database. Following the mainstream of MRDM research, we tackle the regression where the goal is to examine samples of past experience with known continuous answers (response) and generalize future cases through an inductive process. Mr-SMOTI, the solution we propose, resorts to the structural approach in order to recursively partition data stored into a tightly-coupled database and build a multi-relational model tree which captures the linear dependence between the response variable and one or more explanatory variables. The model tree is top-down induced by choosing, at each step, either to partition the training space or to introduce a regression variable in the linear models with the leaves. The tight-coupling with the database makes the knowledge on data structures (foreign keys) available free of charge to guide the search in the multi-relational pattern space. Experiments on artificial and real databases demonstrate that in general Mr-SMOTI outperforms both SMOTI and M5' which are two propositional model tree induction systems, and TILDE-RT which is a state-of-art structural model tree induction system. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

41. Mining frequent itemsets in data streams within a time horizon.

Author: Troiano, Luigi and Scibelli, Giacomo
Subjects: *DATA mining, *TIME perspective, *ALGORITHMS, *SMOOTHNESS of functions, *SET theory, *APPLICATION software
Abstract: Abstract: In this paper, we present an algorithm for mining frequent itemsets in a stream of transactions within a limited time horizon. In contrast to other approaches that are presented in the literature, the proposed algorithm makes use of a test window that can discard non-frequent itemsets from a set of candidates. The efficiency of this approach relies on the property that the higher the support threshold is, the smaller the test window is. In addition to considering a sharp horizon, we consider a smooth window. Indeed, in many applications that are of practical interest, not all of the time slots have the same relevance, e.g., more recent slots can be more interesting than older slots. Smoothness can be determined in both qualitative and quantitative terms. A comparison to other algorithms is conducted. The experimental results prove that the proposed solution is faster than other approaches but has a slightly higher cost in terms of memory. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

42. Following the entire solution path of sparse principal component analysis by coordinate-pairwise algorithm.

Author: Meng, Deyu, Cui, Hengbin, Xu, Zongben, and Jing, Kaili
Subjects: *PATHS & cycles in graph theory, *PRINCIPAL components analysis, *ALGORITHMS, *ITERATIVE methods (Mathematics), *PROBLEM solving, *COMPUTER simulation
Abstract: Abstract: In this paper we derive an algorithm to follow the entire solution path of the sparse principal component analysis (PCA) problem. The core idea is to iteratively identify the pairwise variables along which the objective function of the sparse PCA model can be largely increased, and then incrementally update the coefficients of the two variables so selected by a small stepsize. The new algorithm dominates on its capability of providing a computational shortcut to attain the entire spectrum of solutions of the sparse PCA problem, which is always beneficial to real applications. The proposed algorithm is simple and easy to be implemented. The effectiveness of our algorithm is empirically verified by a series of experiments implemented on synthetic and real problems, as compared with other typical sparse PCA methods. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

43. Efficient Cluster Labeling for Support Vector Clustering.

Author: D'Orangeville, V., Mayers, M. Andre, Monga, M. Ernest, and Wang, M. Shengrui
Subjects: *SUPPORT vector machines, *ALGORITHMS, *PROBLEM solving, *INTEGRATED circuit interconnections, *KERNEL operating systems, *DATA mining
Abstract: We propose a new efficient algorithm for solving the cluster labeling problem in support vector clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improves the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude. [ABSTRACT FROM AUTHOR]
Published: 2013
Full Text: View/download PDF

44. Sentiment Word Relations with Affect,Judgment, and Appreciation.

Author: Neviarouskaya, Alena and Aono, Masaki
Abstract: In this work, we propose a method for automatic analysis of attitude (affect, judgment, and appreciation) in sentiment words. The first stage of the proposed method is an automatic separation of unambiguous affective and judgmental adjectives from those that express appreciation or different attitudes depending on context. In our experiments with machine learning algorithms we employed three feature sets based on Pointwise Mutual Information, word-pattern co-occurrence, and minimal path length. The next stage of the proposed method is to estimate the potentials of miscellaneous adjectives to convey affect, judgment, and appreciation. Based on the sentences automatically collected for each adjective, the algorithm analyses the context of phrases that contain sentiment words by considering morphological tags, high-level concepts, and named entities, and then makes decisions about contextual attitude labels. Finally, the appraisal potentials of a word are calculated based on the number of sentences related to each type of attitude. Our two-stage method was evaluated on two data sets, and promising results were obtained. The performance of our method was also compared with the method from previous work. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

45. On detection of emerging anomalous traffic patterns using GPS data.

Author: Pang, Linsey Xiaolin, Chawla, Sanjay, Liu, Wei, and Zheng, Yu
Subjects: *GLOBAL Positioning System, *TRAFFIC flow, *DATA mining, *REMOTE sensing, *TRANSPORTATION, *METROPOLITAN areas
Abstract: Abstract: The increasing availability of large-scale trajectory data provides us great opportunity to explore them for knowledge discovery in transportation systems using advanced data mining techniques. Nowadays, large number of taxicabs in major metropolitan cities are equipped with a GPS device. Since taxis are on the road nearly 24h a day (with drivers changing shifts), they can now act as reliable sensors to monitor the behavior of traffic. In this article, we use GPS data from taxis to monitor the emergence of unexpected behavior in the Beijing metropolitan area, which has the potential to estimate and improve traffic conditions in advance. We adapt likelihood ratio test statistic (LRT) which have previously been mostly used in epidemiological studies to describe traffic patterns. To the best of our knowledge the use of LRT in traffic domain is not only novel but results in accurate and rapid detection of anomalous behavior. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

46. Pattern-growth based frequent serial episode discovery.

Author: Achar, Avinash, A, Ibrahim, and Sastry, P.S.
Subjects: *PATTERN recognition systems, *SEQUENTIAL analysis, *TELECOMMUNICATION, *ALGORITHMS, *COMPUTER software correctness, *FAULT location (Engineering)
Abstract: Abstract: Frequent episode discovery is a popular framework for pattern discovery from sequential data. It has found many applications in domains like alarm management in telecommunication networks, fault analysis in the manufacturing plants, predicting user behavior in web click streams and so on. In this paper, we address the discovery of serial episodes. In the episodes context, there have been multiple ways to quantify the frequency of an episode. Most of the current algorithms for episode discovery under various frequencies are apriori-based level-wise methods. These methods essentially perform a breadth-first search of the pattern space. However currently there are no depth-first based methods of pattern discovery in the frequent episode framework under many of the frequency definitions. In this paper, we try to bridge this gap. We provide new depth-first based algorithms for serial episode discovery under non-overlapped and total frequencies. Under non-overlapped frequency, we present algorithms that can take care of span constraint and gap constraint on episode occurrences. Under total frequency we present an algorithm that can handle span constraint. We provide proofs of correctness for the proposed algorithms. We demonstrate the effectiveness of the proposed algorithms by extensive simulations. We also give detailed run-time comparisons with the existing apriori-based methods and illustrate scenarios under which the proposed pattern-growth algorithms perform better than their apriori counterparts. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

47. Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary.

Author: Bleik, Said, Mishra, Meenakshi, Huan, Jun, and Song, Min
Abstract: Recently, graph representations of text have been showing improved performance over conventional bag-of-words representations in text categorization applications. In this paper, we present a graph-based representation for biomedical articles and use graph kernels to classify those articles into high-level categories. In our representation, common biomedical concepts and semantic relationships are identified with the help of an existing ontology and are used to build a rich graph structure that provides a consistent feature set and preserves additional semantic information that could improve a classifier's performance. We attempt to classify the graphs using both a set-based graph kernel that is capable of dealing with the disconnected nature of the graphs and a simple linear kernel. Finally, we report the results comparing the classification performance of the kernel classifiers to common text-based classifiers. [ABSTRACT FROM PUBLISHER]
Published: 2013
Full Text: View/download PDF

48. Grammar-based multi-objective algorithms for mining association rules.

Author: Luna, J.M., Romero, J.R., and Ventura, S.
Subjects: *DATA mining, *APPLICATION software, *DATA quality, *GENETIC programming, *EVOLUTIONARY algorithms, *SOFTWARE reliability
Abstract: Abstract: In association rule mining, the process of extracting relations from a dataset often requires the application of more than one quality measure and, in many cases, such measures involve conflicting objectives. In such a situation, it is more appropriate to attain the optimal trade-off between measures. This paper deals with the association rule mining problem under a multi-objective perspective by proposing grammar guided genetic programming (G3P) models, that enable the extraction of both numerical and nominal association rules in only one single step. The strength of G3P is its ability to restrict the search space and build rules conforming to a given context-free grammar. Thus, the proposals presented in this paper combine the advantages of G3P models with those of multi-objective approaches. Both approaches follow the philosophy of two well-known multi-objective algorithms: the Non-dominated Sort Genetic Algorithm (NSGA-2) and the Strength Pareto Evolutionary Algorithm (SPEA-2). In the experimental stage, we compare both multi-objective algorithms to a single-objective G3P proposal for mining association rules and perform an analysis of the mined rules. The results obtained show that multi-objective proposals obtain very frequent (with support values above 95% in most cases) and reliable (with confidence values close to 100%) rules when attaining the optimal trade-off between support and confidence. Furthermore, for the trade-off between support and lift, the multi-objective proposals also produce very interesting and representative rules. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

49. Mining Order-Preserving Submatrices from Data with Repeated Measurements.

Author: Yip, Kevin Y., Kao, Ben, Zhu, Xinjie, Chui, Chun Kit, Lee, Sau Dan, and Cheung, David W.
Subjects: *DATA mining, *GENE expression, *BIOINFORMATICS, *DATABASES, *NOISE measurement, *ALGORITHMS
Abstract: Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent patterns in data when the relative magnitudes of data items are more important than their exact values. For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative magnitudes are important both because they represent the change of gene activities across the experiments, and because there is typically a high level of noise in data that makes the exact values untrustable. To cope with data noise, repeated experiments are often conducted to collect multiple measurements. We propose and study a more robust version of OPSM, where each data item is represented by a set of values obtained from replicated experiments. We call the new problem OPSM-RM (OPSM with repeated measurements). We define OPSM-RM based on a number of practical requirements. We discuss the computational challenges of OPSM-RM and propose a generic mining algorithm. We further propose a series of techniques to speed up two time dominating components of the algorithm. We show the effectiveness and efficiency of our methods through a series of experiments conducted on real microarray data. [ABSTRACT FROM AUTHOR]
Published: 2013
Full Text: View/download PDF

50. A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness.

Author: Dogan, Neslihan and Tanrikulu, Zuhal
Subjects: *DATA mining, *CLASSIFICATION algorithms, *COMPARATIVE studies, *ROBUST control, *FEATURE extraction, *COMPUTATIONAL complexity, *MOTHERBOARDS
Abstract: Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets. [ABSTRACT FROM AUTHOR]
Published: 2013
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

162 results on '"mining methods and algorithms"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources