163 results on '"mining methods and algorithms"'
Search Results
2. Using Subgroup Discovery to Relate Odor Pleasantness and Intensity to Peripheral Nervous System Reactions.
- Author
-
Moranges, Maelle, Plantevit, Marc, and Bensafi, Moustafa
- Abstract
Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely; each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. Change pattern relationships in event logs.
- Author
-
Cremerius, Jonas, Patzlaff, Hendrik, and Weske, Mathias
- Subjects
- *
BLOOD sugar measurement , *PROCESS mining , *HOSPITAL care , *EXECUTIONS & executioners - Abstract
Process mining utilises process execution data to discover and analyse business processes. Event logs represent process executions, providing information about the activities executed. In addition to generic event attributes like activity name and timestamp, events might contain domain-specific attributes, such as a blood sugar measurement in a healthcare environment. Many of these values change during a typical process quite frequently. We refer to those as dynamic event attributes. Change patterns can be derived from dynamic event attributes, describing if the attribute values change from one activity to another. So far, change patterns can only be identified in an isolated manner, neglecting the chance of finding co-occuring change patterns. This paper provides an approach to identifying relationships between change patterns by utilising correlation methods from statistics. We applied the proposed technique on two event logs derived from the MIMIC-IV real-world dataset on hospitalisations in the US and evaluated the results with a medical expert. It turns out that relationships between change patterns can be detected within the same directly or eventually follows relation and even beyond that. Further, we identify unexpected relationships that are occurring only at certain parts of the process. Thus, the process perspective reveals novel insights on how dynamic event attributes change together during process execution. The approach is implemented in Python using the PM4Py framework. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Behavior Action Mining
- Author
-
Peng Su, Daniel Zeng, and Huimin Zhao
- Subjects
Business ,decision support ,knowledge and data engineering tools and techniques ,mining methods and algorithms ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The actionable behavioral rules suggest specific actions that may influence certain behavior in the stakeholders' best interest. In mining such rules, it was assumed previously that all attributes are categorical while the numerical attributes have been discretized in advance. However, this assumption significantly reduces the solution space, and thus hinders the potential of mining algorithms, especially when the numerical attributes are prevalent. As the numerical data are ubiquitous in business applications, there is a crucial need for new mining methodologies that can better leverage such data. To meet this need, in this paper, we define a new data mining problem, named behavior action mining, as a problem of continuous variable optimization of expected utility for action. We then develop three approaches to solving this new problem, which uses regression as a technical basis. The experimental results based on a marketing dataset demonstrate the validity and superiority of our proposed approaches.
- Published
- 2019
- Full Text
- View/download PDF
5. Boosting Discrimination Information Based Document Clustering Using Consensus and Classification
- Author
-
Ahmad Muqeem Sheri, Muhammad Aasim Rafique, Malik Tahir Hassan, Khurum Nazir Junejo, and Moongu Jeon
- Subjects
Consensus clustering ,discrimination information ,document clustering ,evidence combination ,knowledge reuse ,mining methods and algorithms ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Adequate choice of term discrimination information measure (DIM) stipulates guaranteed document clustering. Exercise for the right choice is empirical in nature, and characteristics of data in the documents help experts to speculate a viable solution. Thus, a consistent DIM for the clustering is a mere conjecture and demands intelligent selection of the information measure. In this work, we propose an automated consensus building measure based on a text classifier. Two distinct DIMs construct basic partitions of documents and form base clusters. The consensus building measure method uses the clusters information to find concordant documents and constitute a dataset to train the text classifier. The classifier predicts labels for discordant documents from earlier clustering stage and forms new clusters. The experimentation is performed with eight standard data sets to test efficacy of the proposed technique. The improvement observed by applying the proposed consensus clustering demonstrates its superiority over individual results. Relative Risk (RR) and Measurement of Discrimination Information (MDI) are the two discrimination information measures used for obtaining the base clustering solutions in our experiments.
- Published
- 2019
- Full Text
- View/download PDF
6. Multiangle P2P Borrower Characterization Analytics by Attributes Partition Considering Business Process.
- Author
-
Liu, Shuaiqi and Wu, Sen
- Abstract
In the research of P2P lending data, the study of borrower characteristics is of great value for the establishment of target customers and risk management. Because of high dimensionality, mixed attributes, different importance, and different generation time of information, P2P lending data often leads to the mining results unable to reflect the important borrower characteristics that affect the approval results and the approval loan amount. In this article, we are the first to propose the attributes partition of lending data considering the business process to classify variables into different types. Furthermore, we propose a multiangle data mining method for lending data by attributes partition considering the business process to discover the characteristics of P2P borrowers from multiple perspectives. Experimental results on the real dataset demonstrate that the method depicts the important characteristics of borrowers that affect the approval results and the loan amount, makes the research on P2P borrower characteristics more comprehensive and specific, and provides new ideas for the research on high-dimensional lending data. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
7. Recognition algorithm for cross-texting in text chat conversations.
- Author
-
Lee, Da-Young and Cho, Hwan-Gue
- Subjects
- *
ONLINE chat , *NATURAL language processing , *DEEP learning , *ALGORITHMS , *KOREAN language , *TEXT mining - Abstract
As the development of the Internet and IT technology, short-text based communication is so popular compared with voice based one. Chat-based communication enables rapid, short and massive exchange of message with many people, creates new social problems. 'Cross-texting' is one of them. It refers to accidentally sending a text to an unintended person during the concurrent conversations with separated multiple people. Cross-texting would be a serious problem in languages where respectful expressions are required. As text-based communication is getting popular, it is a crucial work to prevent cross-texting by detecting it in advance in languages with honorifics expression such as Korean. In this paper, we proposed two methods detecting a cross-text using a deep learning model. The first model is the formal feature vector, which models dialog by explicitly defining the politeness and completeness features. The second one is the grpah2vec based ChatGram-net model, which models the dialog based on the syllable occurrence relationship. To evaluate the detection performance, we suggest a generating method for cross-text datasets from a actual messenger corpus. In experiment we show that both proposed models detected cross-text effectively, and exceeded the performance of the baseline models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. XOnto-Apriori: An Effective Association Rule Mining Algorithm for Personalized Recommendation Systems
- Author
-
Gim, Jangwon, Jung, Hanmin, Jeong, Do-Heon, Park, James J. (Jong Hyuk), editor, Stojmenovic, Ivan, editor, Jeong, Hwa Young, editor, and Yi, Gangman, editor
- Published
- 2015
- Full Text
- View/download PDF
9. Robotic process automation using process mining — A systematic literature review.
- Author
-
El-Gharib, Najah Mary and Amyot, Daniel
- Subjects
- *
ROBOTIC process automation , *PROCESS mining , *SURGICAL robots - Abstract
Process mining (PM) aims to construct, from event logs, process maps that can help discover, automate, improve, and monitor organizational processes. Robotic process automation (RPA) uses software robots to perform some tasks usually executed by humans. It is usually difficult to determine what processes and steps to automate, especially with RPA. PM is seen as one way to address such difficulty. This paper aims to assess the applicability of process mining in accelerating and improving the implementation of RPA, along with the challenges encountered throughout project lifecycle. A systematic literature review was conducted to examine the approaches where PM techniques were used to understand the as-is processes that can be automated with software robots. Seven databases were used to identify papers on this topic. A total of 32 papers, all published since 2018, were selected from 605 unique candidate papers and then analyzed. There is a steady increase in the number of publications in this domain, especially during the year 2022, which suggests a raising interest in the combined use of PM with RPA. The literature mainly focuses on the methods to record the events that occur at the level of user interactions with the application, and on the preprocessing methods that are needed to discover routines with the steps that can be automated. Important challenges are faced with preprocessing such event logs, and many lifecycle steps of automation projects are weakly supported by existing approaches suggesting corresponding research areas in need of further attention. • Combining process mining (PM) and RPA offers unique process management opportunities. • PM techniques must better support discovering processes based on UI logs for RPA. • Tools need to better integrate PM and RPA features together, in a synergetic way. • Challenges common to PM and RPA remain, e.g., with data gathering and preprocessing. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. SOM++: Integration of Self-Organizing Map and K-Means++ Algorithms
- Author
-
Dogan, Yunus, Birant, Derya, Kut, Alp, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, and Perner, Petra, editor
- Published
- 2013
- Full Text
- View/download PDF
11. Experimental identification of hard data sets for classification and feature selection methods with insights on method selection.
- Author
-
Luan, Cuiju and Dong, Guozhu
- Subjects
- *
BIG data , *BENCHMARKING (Management) , *FEATURE selection , *RANDOM forest algorithms , *MINING methodology - Abstract
Abstract The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8; it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on the easy data sets. This paper is the first to rank methods separately for hard data sets and for easy data sets. It turns out that the classifier rankings resulting from our experiments are somehow different from those in the literature and hence they offer new insights on method selection. It should be noted that the Random Forest method remains to be the best in all groups of experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
12. On mining approximate and exact fault-tolerant frequent itemsets.
- Author
-
Liu, Shengxin and Poon, Chung Keung
- Subjects
DATA mining ,FAULT-tolerant computing ,ROBUST statistics ,HEURISTIC algorithms ,PROBLEM solving ,NP-hard problems - Abstract
Robust frequent itemset mining has attracted much attention due to the necessity to find frequent patterns from noisy data in many applications. In this paper, we focus on a variant of robust frequent itemsets in which a small amount of “faults” is allowed in each item and each supporting transaction. This problem is challenging since computing fault-tolerant support count is NP-hard and the anti-monotone property does not hold when the amount of allowable faults is proportional to the size of the itemset. We develop heuristic methods to solve an approximation version of the problem and propose speedup techniques for the exact problem. Experimental results show that our heuristic algorithms are substantially faster than the state-of-the-art exact algorithms while the error is acceptable. In addition, the proposed speedup techniques substantially improve the efficiency of the exact algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
13. Using subgroup discovery to relate odor pleasantness and intensity to peripheral nervous system reactions
- Author
-
Maelle Moranges, Marc Plantevit, Moustafa Bensafi, Université Claude Bernard Lyon 1 (UCBL), Université de Lyon, Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Université Lumière - Lyon 2 (UL2)-École Centrale de Lyon (ECL), Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS), and BENSAFI, Moustafa
- Subjects
Human-Computer Interaction ,[SCCO]Cognitive science ,Pattern analysis ,Mining methods and algorithms ,Physiological measures ,[INFO]Computer Science [cs] ,[SCCO] Cognitive science ,[INFO] Computer Science [cs] ,Software - Abstract
International audience; Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely; each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets.
- Published
- 2022
- Full Text
- View/download PDF
14. Feature weighted clustering for user profiling.
- Author
-
Cufoglu, Ayse, Lohi, Mahi, and Everiss, Colin
- Subjects
INTERNET user profiling ,WEB personalization ,ALGORITHMS ,CLUSTER analysis (Statistics) ,COMPUTER simulation - Abstract
Personalization is the adaptation of the services to fit the user's interests, characteristics and needs. The key to effective personalization is user profiling. Apart from traditional collaborative and content-based approaches, a number of classification and clustering algorithms have been used to classify user related information to create user profiles. However, they are not able to achieve accurate user profiles. In this paper, we present a new clustering algorithm, namely Multi-Dimensional Clustering (MDC), to determine user profiling. The MDC is a version of the Instance-Based Learner (IBL) algorithm that assigns weights to feature values and considers these weights for the clustering. Three feature weight methods are proposed for the MDC and, all three, have been tested and evaluated. Simulations were conducted with using two sets of user profile datasets, which are the training (includes 10,000 instances) and test (includes 1000 instances) datasets. These datasets reflect each user's personal information, preferences and interests. Additional simulations and comparisons with existing weighted and non-weighted instance-based algorithms were carried out in order to demonstrate the performance of proposed algorithm. Experimental results using the user profile datasets demonstrate that the proposed algorithm has better clustering accuracy performance compared to other algorithms. This work is based on the doctoral thesis of the corresponding author. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
15. A hybrid framework for mining high-utility itemsets in a sparse transaction database.
- Author
-
Dawar, Siddharth, Goyal, Vikram, and Bera, Debajyoti
- Subjects
DATA mining ,DATABASES ,INVENTORY management systems ,INVERSE document frequency ,INFORMATION retrieval - Abstract
High-utility itemset mining aims to find the set of items with utility no less than a user-defined threshold in a transaction database. High-utility itemset mining is an emerging research area in the field of data mining and has important applications in inventory management, query recommendation, systems operation research, bio-medical analysis, etc. Currently, known algorithms for this problem can be classified as either 1-phase or 2-phase algorithms. The 2-phase algorithms typically consist of tree-based algorithms which generate candidate high-utility itemsets and verify them later. A tree data structure generates candidate high-utility itemsets quickly by storing some upper bound utility estimate at each node. The 1-phase algorithms typically consist of inverted-list based and transaction projection based algorithms which avoid the generation of candidate high-utility itemsets. The inverted list and transaction projection allows computation of exact utility estimates. We propose a novel hybrid framework that combines a tree-based and an inverted-list based algorithm to efficiently mine high-utility itemsets. Algorithms based on the framework can harness benefits of both types of algorithms. We report experiment results on real and synthetic datasets to demonstrate the effectiveness of our framework. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
16. Target-Based, Privacy Preserving, and Incremental Association Rule Mining.
- Author
-
Ahluwalia, Madhu V., Gangopadhyay, Aryya, Chen, Zhiyuan, and Yesha, Yelena
- Abstract
We consider a special case in association rule mining where mining is conducted by a third party over data located at a central location that is updated from several source locations. The data at the central location is at rest while that flowing in through source locations is in motion. We impose some limitations on the source locations, so that the central target location tracks and privatizes changes and a third party mines the data incrementally. Our results show high efficiency, privacy and accuracy of rules for small to moderate updates in large volumes of data. We believe that the framework we develop is therefore applicable and valuable for securely mining big data. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
17. Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets.
- Author
-
Adikaram, K. K. L. B., Hussein, Mohamed A., Effenberger, Mathias, and Becker, Thomas
- Subjects
- *
BIG data , *BIT-mapped graphics , *GRAPHICS processing units , *REAL-time computing , *PIXELS - Abstract
Big data are visually cluttered by overlapping data points. Rather than removing, reducing or reformulating overlap, we propose a simple, effective and powerful technique for density cluster generation and visualization, where point marker (graphical symbol of a data point) overlap is exploited in an additive fashion in order to obtain bitmap data summaries in which clusters can be identified visually, aided by automatically generated contour lines. In the proposed method, the plotting area is a bitmap and the marker is a shape of more than one pixel. As the markers overlap, the red, green and blue (RGB) colour values of pixels in the shared region are added. Thus, a pixel of a 24-bit RGB bitmap can code up to 224 (over 1.6 million) overlaps. A higher number of overlaps at the same location makes the colour of this area identical, which can be identified by the naked eye. A bitmap is a matrix of colour values that can be represented as integers. The proposed method updates this matrix while adding new points. Thus, this matrix can be considered as an up-to-time knowledge unit of processed data. Results show cluster generation, cluster identification, missing and out-of-range data visualization, and outlier detection capability of the newly proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
18. Block-Constraint Robust Principal Component Analysis and its Application to Integrated Analysis of TCGA Data.
- Author
-
Liu, Jin-Xing, Gao, Ying-Lian, Zheng, Chun-Hou, Xu, Yong, and Yu, Jiguo
- Abstract
The Cancer Genome Atlas (TCGA) dataset provides us more opportunities to systematically and comprehensively learn some biological mechanism of cancers formation, growth and metastasis. Since TCGA dataset includes heterogeneous data, it is one of the bioinformatics bottlenecks to mine some meaningful information from them. In this paper, to improve the performance of Robust Principal Component Analysis (RPCA) analyzing these heterogeneous data, a modified RPCA-based method, Block-Constraint Robust Principal Component Analysis (BCRPCA), is proposed. Since different categories data have different peculiarities, BCRPCA enforces different constraint intensities on different categories to improve the performance of RPCA. Firstly, the observation matrix of TCGA data is decomposed into two adding matrices A and S by using BCRPCA. Secondly, we use a ranking scheme to evaluate every feature and project these features to the genes. Then, the genes with high scores will be identified as differentially expressed ones. The main contributions of this paper are as following: firstly, it proposes, for the first time, the idea and method of BCRPCA to model TCGA data; secondly, it provides a BCRPCA-based framework for integrated analysis of TCGA data. The results show that our method is effective and suitable to analyze these data. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
19. Parallel community detection on large graphs with MapReduce and GraphChi.
- Author
-
Moon, Seunghyeon, Lee, Jae-Gil, Kang, Minseo, Choy, Minsoo, and Lee, Jin-woo
- Subjects
- *
SOCIAL networks , *DOCUMENT clustering , *DATA mining , *HIERARCHICAL clustering (Cluster analysis) , *ALGORITHMS - Abstract
Community detection from social network data gains much attention from academia and industry since it has many real-world applications. The Girvan–Newman (GN) algorithm is a divisive hierarchical clustering algorithm for community detection, which is regarded as one of the most popular algorithms. It exploits the concept of edge betweenness to divide a network into multiple communities. Though it is being widely used, it has limitations in supporting large-scale networks since it needs to calculate the shortest path between every pair of vertices in a network. In this paper, we develop two parallel versions of the GN algorithm to support large-scale networks. First, we propose a new algorithm, which we call S hortest P ath B etweenness M ap R educe A lgorithm (SPB-MRA), that utilizes the MapReduce model. Second, we propose another new algorithm, which we call S hortest P ath B etweenness V ertex- C entric A lgorithm (SPB-VCA), that utilizes the vertex-centric model. An approximation technique is also developed to further speed up community detection processes. We implemented SPB-MRA using Hadoop and SPB-VCA using GraphChi, and then evaluated the performance of SPB-MRA on Amazon EC2 instances and that of SPB-VCA on a single commodity PC. The evaluation results showed that the elapsed time of SPB-MRA decreased almost linearly as the number of reducers increased, SPB-VCA outperformed SPB-MRA just on a single PC by 4–6 times, and the approximation technique introduced negligible errors. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
20. An Automatic and Clause-Based Approach to Learn Relations for Ontologies.
- Author
-
THENMOZHI, D. and ARAVINDAN, CHANDRABOSE
- Subjects
- *
ONTOLOGIES (Information retrieval) , *KNOWLEDGE acquisition (Expert systems) , *SEMANTIC computing , *SOFT computing , *MACHINE learning - Abstract
Ontology learning from text is one of the knowledge acquisition processes that facilitates construction of ontology. Considerable research is being done on learning concepts and relations, especially on acquiring semantic relations between concepts for a specific domain. However, more of the research contributions are in learning either taxonomic relations or semantic relations but not in both. Even those few research works that address learning of both types of relations deal with simple sentences only resulting in low recall value. Further, these approaches are semi-automatic, which require either user's feedback or domain expert's knowledge. In this paper, we propose a single framework that is automatic and domain-independent that helps in learning both taxonomic and non-taxonomic relations. We have developed a clause-based approach that automatically extracts the relations for concepts from unstructured text documents. Our approach is capable of handling complex sentences by identifying hidden triples present in the sentences. We have evaluated our methodology of relation learning for the concepts specified by AGROVOC and Open Directory Project using a corpus of web documents. The precision, recall and F1-measure of our method were observed to be considerably higher than those of state-of-the-art methodologies for relation learning. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
21. A parameter-free KNN for rating prediction.
- Author
-
Fopa, Medjeu, Gueye, Modou, Ndiaye, Samba, and Naacke, Hubert
- Subjects
- *
K-nearest neighbor classification , *MATHEMATICAL optimization , *RECOMMENDER systems , *FORECASTING - Abstract
Among the most popular collaborative filtering algorithms are methods based on the K nearest neighbors (KNN). In their basic operation, KNN methods consider a fixed number of neighbors to make recommendations. However, it is not easy to choose an appropriate number of neighbors. Thus, it is generally fixed by calibration to avoid inappropriate values which would negatively affect the accuracy of the recommendations. In the literature, some authors have addressed the problem of dynamically finding an appropriate number of neighbors. But they use additional parameters which limit their proposals because these parameters also require calibration. In this paper, we propose a parameter-free KNN method for rating prediction. It is able to dynamically select an appropriate number of neighbors to use. The experiments that we did on four publicly available datasets demonstrate the efficiency of our proposal. It rivals those of the state of the art in their best configurations. • Parameter-free KNN for rating prediction. • Optimization of the KNN algorithm by the choice of neighbors. • Dynamic selection of an optimal number of neighbors. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Temporal expression extraction with extensive feature type selection and a posteriori label adjustment.
- Author
-
Filannino, Michele and Nenadic, Goran
- Subjects
- *
FEATURE extraction , *FEATURE selection , *INFORMATION theory , *NATURAL language processing , *NOUN phrases (Grammar) , *MORPHOLOGY (Grammar) - Abstract
The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. It allows to filter information and infer temporal flows of events. This paper presents ManTIME, a general domain temporal expression identification and normalisation system, and systematically explores the impact of different features and training corpora on the performance. The identification phase combines the use of conditional random fields along with a post-processing pipeline, whereas the normalisation phase is carried out using NorMA, an open-source rule-based temporal normaliser. We investigate the performance variation with respect to different feature types. Specifically, we show that the use of WordNet-based features in the identification task negatively affects the overall performance, and that there is no statistically significant difference in the results based on gazetteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features. We also show that the use of silver data (alone or in addition to the human-annotated ones) does not improve the performance. We evaluate six combinations of training data and post-processing pipeline with respect to the TempEval-3 benchmark test set. The best run achieved 0.95 (precision), 0.85 (recall) and 0.90 (F β =1 ) in the identification phase. Normalisation accuracies are 0.86 (for type attribute) and 0.77 (for value attribute). The proposed approach ranked 3rd in the TempEval-3 challenge (task A) as the best performing machine learning-based system among 21 participants. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
23. Efficient community identification and maintenance at multiple resolutions on distributed datastores.
- Author
-
Aksu, Hidayet, Canim, Mustafa, Chang, Yuan-Chi, Korpeoglu, Ibrahim, and Ulusoy, Özgür
- Subjects
- *
VIRTUAL communities , *DISTRIBUTED computing , *WEBSITES , *COMPUTER networks , *INFORMATION storage & retrieval systems , *BLOGS - Abstract
The topic of network community identification at multiple resolutions is of great interest in practice to learn high cohesive subnetworks about different subjects in a network. For instance, one might examine the interconnections among web pages, blogs and social content to identify pockets of influencers on subjects like ‘Big Data’, ‘smart phone’ or ‘global warming’. With dynamic changes to its graph representation and content, the incremental maintenance of a community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k -core metric projected at multiple k -values, so that multiple community resolutions are represented with multiple k -core graphs. Recognizing that large graphs and their even larger attributed content cannot be stored and managed by a single server, we then propose distributed algorithms to construct and maintain a multi- k -core graph, implemented on the scalable Big Data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi- k -core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on multiple subjects in rich network content simultaneously. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
24. Mining time-interval univariate uncertain sequential patterns.
- Author
-
Liu, Ying-Ho
- Subjects
- *
UNIVARIATE analysis , *SEQUENTIAL pattern mining , *ALGORITHMS , *PROBABILITY density function , *DATA mining - Abstract
In this study, we propose two algorithms to discover time - interval univariate uncertain ( U2 ) - sequential patterns from a set of univariate uncertain ( U2 )- sequences . A U2-sequence is a sequence that contains transactions of univariate uncertain data , where each attribute in a transaction is associated with a quantitative interval and a probability density function indicating the possibility that each value exists in the interval. Many sources record U2-sequences, such as atmospheric pollution sensors and network monitoring systems. Mining sequential patterns from these U2-sequences is important for understanding the intrinsic characteristics of the U2-sequences. The proposed two algorithms are based on the candidate generate-and-test methodology and pattern growth methodology, respectively. We performed a series of experiments to evaluate them in terms of runtime and memory consumption. The experimental results show that different algorithms excel when applied to different conditions. In general, the algorithm based on the pattern growth methodology is the better choice. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
25. Pattern-Aided Regression Modeling and Prediction Model Analysis.
- Author
-
Dong, Guozhu and Taslimitehrani, Vahid
- Subjects
- *
PATTERNS (Mathematics) , *REGRESSION analysis , *PREDICTION models , *STATISTICAL correlation , *ERROR analysis in mathematics , *DATA mining - Abstract
This paper first introduces pattern aided regression (PXR) models, a new type of regression models designed to represent accurate and interpretable prediction models. This was motivated by two observations: (1) Regression modeling applications often involve complex diverse predictor-response relationships, which occur when the optimal regression models (of given regression model type) fitting two or more distinct logical groups of data are highly different. (2) State-of-the-art regression methods are often unable to adequately model such relationships. This paper defines PXR models using several patterns and local regression models, which respectively serve as logical and behavioral characterizations of distinct predictor-response relationships. The paper also introduces a contrast pattern aided regression (CPXR) method, to build accurate PXR models. In experiments, the PXR models built by CPXR are very accurate in general, often outperforming state-of-the-art regression methods by big margins. Usually using (a) around seven simple patterns and (b) linear local regression models, those PXR models are easy to interpret; in fact, their complexity is just a bit higher than that of (piecewise) linear regression models and is significantly lower than that of traditional ensemble based regression models. CPXR is especially effective for high-dimensional data. The paper also discusses how to use CPXR methodology for analyzing prediction models and correcting their prediction errors. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
26. Anonymizing graphs: measuring quality for clustering.
- Author
-
Casas-Roma, Jordi, Herrera-Joancomartí, Jordi, and Torra, Vicenç
- Subjects
DATA mining ,MINING methodology ,ALGORITHMS ,XML (Extensible Markup Language) ,INFORMATION processing - Abstract
Anonymization of graph-based data is a problem, which has been widely studied last years, and several anonymization methods have been developed. Information loss measures have been carried out to evaluate the noise introduced in the anonymized data. Generic information loss measures ignore the intended anonymized data use. When data has to be released to third-parties, and there is no control on what kind of analyses users could do, these measures are the standard ones. In this paper we study different generic information loss measures for graphs comparing such measures to the cluster-specific ones. We want to evaluate whether the generic information loss measures are indicative of the usefulness of the data for subsequent data mining processes. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
27. Design of computationally efficient density-based clustering algorithms.
- Author
-
Nanda, Satyasai Jagannath and Panda, Ganapati
- Subjects
- *
ALGORITHMS , *APPLICATION software , *DATABASES , *COMPUTATIONAL complexity , *DATA mining , *ASSOCIATION rule mining - Abstract
The basic DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm uses minimum number of input parameters, very effective to cluster large spatial databases but involves more computational complexity. The present paper proposes a new strategy to reduce the computational complexity associated with the DBSCAN by efficiently implementing new merging criteria at the initial stage of evolution of clusters. Further new density based clustering (DBC) algorithms are proposed considering correlation coefficient as similarity measure. These algorithms though computationally not efficient, found to be effective when there is high similarity between patterns of dataset. The computations associated with DBC based on correlation algorithms are reduced with new cluster merging criteria. Test on several synthetic and real datasets demonstrates that these computationally efficient algorithms are comparable in accuracy to the traditional one. An interesting application of the proposed algorithm has been demonstrated to identify the regional hazard regions present in the seismic catalog of Japan. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
28. Process Discovery Algorithms Using Numerical Abstract Domains.
- Author
-
Carmona, Josep and Cortadella, Jordi
- Subjects
- *
FORMAL methods (Computer science) , *ARTIFICIAL neural networks , *INFORMATION technology , *SOFTWARE engineering , *PETRI nets - Abstract
The discovery of process models from event logs has emerged as one of the crucial problems for enabling the continuous support in the life-cycle of an information system. However, in a decade of process discovery research, the algorithms and tools that have appeared are known to have strong limitations in several dimensions. The size of the logs and the formal properties of the model discovered are the two main challenges nowadays. In this paper we propose the use of numerical abstract domains for tackling these two problems, for the particular case of the discovery of Petri nets. First, numerical abstract domains enable the discovery of general process models, requiring no knowledge (e.g., the bound of the Petri net to derive) for the discovery algorithm. Second, by using divide and conquer techniques we are able to control the size of the process discovery problems. The methods proposed in this paper have been implemented in a prototype tool and experiments are reported illustrating the significance of this fresh view of the process discovery problem. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
29. Motif-Based Hyponym Relation Extraction from Wikipedia Hyperlinks.
- Author
-
Wei, Bifan, Liu, Jun, Ma, Jian, Zheng, Qinghua, Zhang, Wei, and Feng, Boqin
- Subjects
- *
FEATURE extraction , *HYPERLINKS , *MACHINE learning , *KNOWLEDGE acquisition (Expert systems) , *DATA mining , *ELECTRONIC publishing - Abstract
Discovering hyponym relations among domain-specific terms is a fundamental task in taxonomy learning and knowledge acquisition. However, the great diversity of various domain corpora and the lack of labeled training sets make this task very challenging for conventional methods that are based on text content. The hyperlink structure of Wikipedia article pages was found to contain recurring network motifs in this study, indicating the probability of a hyperlink being a hyponym hyperlink. Hence, a novel hyponym relation extraction approach based on the network motifs of Wikipedia hyperlinks was proposed. This approach automatically constructs motif-based features from the hyperlink structure of a domain; every hyperlink is mapped to a 13-dimensional feature vector based on the 13 types of three-node motifs. The approach extracts structural information from Wikipedia and heuristically creates a labeled training set. Classification models were determined from the training sets for hyponym relation extraction. Two experiments were conducted to validate our approach based on seven domain-specific datasets obtained from Wikipedia. The first experiment, which utilized manually labeled data, verified the effectiveness of the motif-based features. The second experiment, which utilized an automatically labeled training set of different domains, showed that the proposed approach performs better than the approach based on lexico-syntactic patterns and achieves comparable result to the approach based on textual features. Experimental results show the practicability and fairly good domain scalability of the proposed approach. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
30. Cost-Sensitive Online Classification.
- Author
-
Wang, Jialei, Zhao, Peilin, and Hoi, Steven C. H.
- Subjects
- *
DISTANCE education , *COST effectiveness , *PROBLEM solving , *ALGORITHMS , *MACHINE learning , *DATA mining - Abstract
Both cost-sensitive classification and online learning have been extensively studied in data mining and machine learning communities, respectively. However, very limited study addresses an important intersecting problem, that is, “Cost-Sensitive Online Classification”. In this paper, we formally study this problem, and propose a new framework for Cost-Sensitive Online Classification by directly optimizing cost-sensitive measures using online gradient descent techniques. Specifically, we propose two novel cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks. Finally, we demonstrate the application of the proposed technique for solving several online anomaly detection tasks, showing that the proposed technique could be a highly efficient and effective tool to tackle cost-sensitive online classification tasks in various application domains. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
31. Outlier Detection for Temporal Data: A Survey.
- Author
-
Gupta, Manish, Gao, Jing, Aggarwal, Charu C., and Han, Jiawei
- Subjects
- *
OUTLIER detection , *TEMPORAL databases , *TIME series analysis , *COMPUTER software , *DISTRIBUTED databases , *PATTERN matching - Abstract
In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
32. Surfing the Network for Ranking by Multidamping.
- Author
-
Kollias, Giorgos, Gallopoulos, Efstratios, and Grama, Ananth
- Subjects
- *
INTERNET , *APPROXIMATION algorithms , *MARKOV processes , *MONTE Carlo method , *INTERNET users - Abstract
PageRank is one of the most commonly used techniques for ranking nodes in a network. It is a special case of a family of link-based rankings, commonly referred to as functional rankings. Functional rankings are computed as power series of a stochastic matrix derived from the adjacency matrix of the graph. This general formulation of functional rankings enables their use in diverse applications, ranging from traditional search applications to identification of spam and outliers in networks. This paper presents a novel algorithmic (re)formulation of commonly used functional rankings, such as LinearRank, TotalRank and Generalized Hyperbolic Rank. These rankings can be approximated by finite series representations. We prove that polynomials of stochastic matrices can be expressed as products of Google matrices (matrices having the form used in Google’s original PageRank formulation). Individual matrices in these products are parameterized by different damping factors. For this reason, we refer to our formulation as multidamping. We demonstrate that multidamping has a number of desirable characteristics: (i) for problems such as finding the highest ranked pages, multidamping admits extremely fast approximate solutions; (ii) multidamping provides an intuitive interpretation of existing functional rankings in terms of the surfing habits of model web users; (iii) multidamping provides a natural framework based on Monte Carlo type methods that have efficient parallel and distributed implementations. It also provides the basis for constructing new link-based rankings based on inhomogeneous products of Google matrices. We present algorithms for computing damping factors for existing functional rankings analytically and numerically. We validate various benefits of multidamping on a number of real datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
33. Discovery of temporal neighborhoods through discretization methods.
- Author
-
Dey, Sandipan, Janeja, Vandana P., and Gangopadhyay, Aryya
- Subjects
- *
KNOWLEDGE management research , *TEMPORAL databases , *DATA mining , *DATA science , *DISCRETIZATION methods - Abstract
Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this paper we propose a class of algorithms to generate temporal neighborhoods through unequal depth discretization. We describe four novel algorithms (a) Similarity based Merging (SMerg), (b) Stationary distribution based Merging (StMerg), (c) Greedy Merge (GMerg) and, (d) Optimal Merging (OptMerg). The SMerg and STMerg algorithms are based on the robust framework of Markov models and the Markov Stationary distribution respectively. GMerg is a greedy approach and OptMerg algorithm is geared towards discovering optimal binning strategies for the most effective partitioning of the data into temporal neighborhoods. Both these algorithms do not use Markov models. We identify temporal neighborhoods with distinct demarcations based on unequal depth discretization of the data. We discuss detailed experimental results in both synthetic and real world data. Specifically, we show (i) the efficacy of our algorithms through precision and recall of labeled bins, (ii) the ground truth validation in real world traffic monitoring datasets and, (iii) Knowledge discovery in the temporal neighborhoods such as global anomalies. Our results indicate that we are able to identify valuable knowledge based on our ground truth validation from real world traffic data. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
34. Probabilistic Aspect Mining Model for Drug Reviews.
- Author
-
Cheng, Victor C., Leung, C. H. C., Liu, Jiming, and Milani, Alfredo
- Subjects
- *
PROBABILISTIC databases , *DATA mining , *CLINICAL drug trials , *DATA modeling , *TEXT mining , *FEATURE extraction - Abstract
Recent findings show that online reviews, blogs, and discussion forums on chronic diseases and drugs are becoming important supporting resources for patients. Extracting information from these substantial bodies of texts is useful and challenging. We developed a generative probabilistic aspect mining model (PAMM) for identifying the aspects/topics relating to class labels or categorical meta-information of a corpus. Unlike many other unsupervised approaches or supervised approaches, PAMM has a unique feature in that it focuses on finding aspects relating to one class only rather than finding aspects for all classes simultaneously in each execution. This reduces the chance of having aspects formed from mixing concepts of different classes; hence the identified aspects are easier to be interpreted by people. The aspects found also have the property that they are class distinguishing: They can be used to distinguish a class from other classes. An efficient EM-algorithm is developed for parameter estimation. Experimental results on reviews of four different drugs show that PAMM is able to find better aspects than other common approaches, when measured with mean pointwise mutual information and classification accuracy. In addition, the derived aspects were also assessed by humans based on different specified perspectives, and PAMM was found to be rated highest. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
35. A Simple but Powerful Heuristic Methodfor Accelerating k -Means Clusteringof Large-Scale Data in Life Science.
- Author
-
Ichikawa, Kazuki and Morishita, Shinichi
- Abstract
K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10–2001 (∼400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/∼ichikawa/boostKCP/. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
36. The Sum-over-Forests Density Index: Identifying Dense Regions in a Graph.
- Author
-
Senelle, Mathieu, Garcia-Diez, Silvia, Mantrach, Amin, Shimbo, Masashi, Saerens, Marco, and Fouss, Francois
- Subjects
- *
GRAPH theory , *MAXWELL-Boltzmann distribution law , *RANDOM forest algorithms , *SET theory , *MATRIX inversion - Abstract
This work introduces a novel nonparametric density index defined on graphs, the Sum-over-Forests (SoF) density index. It is based on a clear and intuitive idea: high-density regions in a graph are characterized by the fact that they contain a large amount of low-cost trees with high outdegrees while low-density regions contain few ones. Therefore, a Boltzmann probability distribution on the countable set of forests in the graph is defined so that large (high-cost) forests occur with a low probability while short (low-cost) forests occur with a high probability. Then, the SoF density index of a node is defined as the expected outdegree of this node on the set of forests, thus providing a measure of density around that node. Following the matrix-forest theorem and a statistical physics framework, it is shown that the SoF density index can be easily computed in closed form through a simple matrix inversion. Experiments on artificial and real datasets show that the proposed index performs well on finding dense regions, for graphs of various origins. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
37. Discovering Conservation Rules.
- Author
-
Golab, Lukasz, Karloff, Howard, Korn, Flip, Saha, Barna, and Srivastava, Divesh
- Subjects
- *
DATABASE management , *DATA quality , *INFORMATION storage & retrieval systems , *APPROXIMATION algorithms , *PROGRAMMING language semantics , *TEXT mining - Abstract
Many applications process data in which there exists a “conservation law” between related quantities. For example, in traffic monitoring, every incoming event, such as a packet's entering a router or a car's entering an intersection, should ideally have an immediate outgoing counterpart. We propose a new class of constraints—Conservation Rules—that express the semantics and characterize the data quality of such applications. We give confidence metrics that quantify how strongly a conservation rule holds and present approximation algorithms (with error guarantees) for the problem of discovering a concise summary of subsets of the data that satisfy a given conservation rule. Using real data, we demonstrate the utility of conservation rules and we show order-of-magnitude performance improvements of our discovery algorithms over naive approaches. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
38. Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases.
- Author
-
Zhao, Zhou, Yan, Da, and Ng, Wilfred
- Subjects
- *
DATABASES , *GLOBAL Positioning System , *DATA mining , *PROBABILITY theory , *TEMPERATURE sensors , *DATA modeling - Abstract
dData uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. Mining sequential patterns from inaccurate data, such as those data arising from sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. In this paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that conform to our models. However, the number of possible worlds is extremely large, which makes the mining prohibitively expensive. Inspired by the famous PrefixSpan algorithm, we develop two new algorithms, collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible worlds explosion”, and when combined with our four pruning and validating methods, achieves even better performance. We also propose a fast validating method to further speed up our U-PrefixSpan algorithm. The efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and synthetic datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
39. Adaptation Regularization: A General Framework for Transfer Learning.
- Author
-
Long, Mingsheng, Wang, Jianmin, Ding, Guiguang, Pan, Sinno Jialin, and Yu, Philip S.
- Subjects
- *
MATHEMATICAL regularization , *SUPPORT vector machines , *LEAST squares , *HILBERT space , *MARGINAL distributions , *MACHINE theory - Abstract
Domain transfer learning, which learns a target classifier using labeled data from a different distribution, has shown promising value in knowledge discovery yet still been a challenging problem. Most previous works designed adaptive classifiers by exploring two learning strategies independently: distribution adaptation and label propagation. In this paper, we propose a novel transfer learning framework, referred to as Adaptation Regularization based Transfer Learning (ARTL), to model them in a unified way based on the structural risk minimization principle and the regularization theory. Specifically, ARTL learns the adaptive classifier by simultaneously optimizing the structural risk functional, the joint distribution matching between domains, and the manifold consistency underlying marginal distribution. Based on the framework, we propose two novel methods using Regularized Least Squares (RLS) and Support Vector Machines (SVMs), respectively, and use the Representer theorem in reproducing kernel Hilbert space to derive corresponding solutions. Comprehensive experiments verify that ARTL can significantly outperform state-of-the-art learning methods on several public text and image datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
40. Multi-Relational Model Tree Induction Tightly-Coupled with a Relational Database.
- Author
-
Appice, Annalisa, Ceci, Michelangelo, and Malerba, Donato
- Subjects
- *
DATA mining , *RELATIONAL databases , *REGRESSION analysis , *GENERALIZATION , *DATA structures , *ELECTRONIC data processing , *STRUCTURAL frame models - Abstract
Multi-Relational Data Mining (MRDM) refers to the process of discovering implicit, previously unknown and potentially useful information from data scattered in multiple tables of a relational database. Following the mainstream of MRDM research, we tackle the regression where the goal is to examine samples of past experience with known continuous answers (response) and generalize future cases through an inductive process. Mr-SMOTI, the solution we propose, resorts to the structural approach in order to recursively partition data stored into a tightly-coupled database and build a multi-relational model tree which captures the linear dependence between the response variable and one or more explanatory variables. The model tree is top-down induced by choosing, at each step, either to partition the training space or to introduce a regression variable in the linear models with the leaves. The tight-coupling with the database makes the knowledge on data structures (foreign keys) available free of charge to guide the search in the multi-relational pattern space. Experiments on artificial and real databases demonstrate that in general Mr-SMOTI outperforms both SMOTI and M5' which are two propositional model tree induction systems, and TILDE-RT which is a state-of-art structural model tree induction system. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
41. Boosting Discrimination Information Based Document Clustering Using Consensus and Classification
- Author
-
Muhammad Rafique, Ahmad Muqeem Sheri, Moongu Jeon, Khurum Nazir Junejo, and Malik Tahir Hassan
- Subjects
Boosting (machine learning) ,Term Discrimination ,General Computer Science ,Computer science ,02 engineering and technology ,Machine learning ,computer.software_genre ,document clustering ,020204 information systems ,discrimination information ,Consensus clustering ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Electrical and Electronic Engineering ,Cluster analysis ,business.industry ,General Engineering ,Document clustering ,knowledge reuse ,mining methods and algorithms ,evidence combination ,ComputingMethodologies_PATTERNRECOGNITION ,020201 artificial intelligence & image processing ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Information measure ,Artificial intelligence ,business ,lcsh:TK1-9971 ,computer - Abstract
Adequate choice of term discrimination information measure (DIM) stipulates guaranteed document clustering. Exercise for the right choice is empirical in nature, and characteristics of data in the documents help experts to speculate a viable solution. Thus, a consistent DIM for the clustering is a mere conjecture and demands intelligent selection of the information measure. In this work, we propose an automated consensus building measure based on a text classifier. Two distinct DIMs construct basic partitions of documents and form base clusters. The consensus building measure method uses the clusters information to find concordant documents and constitute a dataset to train the text classifier. The classifier predicts labels for discordant documents from earlier clustering stage and forms new clusters. The experimentation is performed with eight standard data sets to test efficacy of the proposed technique. The improvement observed by applying the proposed consensus clustering demonstrates its superiority over individual results. Relative Risk (RR) and Measurement of Discrimination Information (MDI) are the two discrimination information measures used for obtaining the base clustering solutions in our experiments.
- Published
- 2019
- Full Text
- View/download PDF
42. Behavior Action Mining
- Author
-
Huimin Zhao, Peng Su, and Daniel Zeng
- Subjects
Decision support system ,decision support ,General Computer Science ,Discretization ,Computer science ,business.industry ,General Engineering ,Decision tree ,mining methods and algorithms ,Machine learning ,computer.software_genre ,knowledge and data engineering tools and techniques ,Leverage (statistics) ,Business ,General Materials Science ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Artificial intelligence ,business ,lcsh:TK1-9971 ,Categorical variable ,computer ,Expected utility hypothesis - Abstract
The actionable behavioral rules suggest specific actions that may influence certain behavior in the stakeholders' best interest. In mining such rules, it was assumed previously that all attributes are categorical while the numerical attributes have been discretized in advance. However, this assumption significantly reduces the solution space, and thus hinders the potential of mining algorithms, especially when the numerical attributes are prevalent. As the numerical data are ubiquitous in business applications, there is a crucial need for new mining methodologies that can better leverage such data. To meet this need, in this paper, we define a new data mining problem, named behavior action mining, as a problem of continuous variable optimization of expected utility for action. We then develop three approaches to solving this new problem, which uses regression as a technical basis. The experimental results based on a marketing dataset demonstrate the validity and superiority of our proposed approaches.
- Published
- 2019
- Full Text
- View/download PDF
43. Mining frequent itemsets in data streams within a time horizon.
- Author
-
Troiano, Luigi and Scibelli, Giacomo
- Subjects
- *
DATA mining , *TIME perspective , *ALGORITHMS , *SMOOTHNESS of functions , *SET theory , *APPLICATION software - Abstract
Abstract: In this paper, we present an algorithm for mining frequent itemsets in a stream of transactions within a limited time horizon. In contrast to other approaches that are presented in the literature, the proposed algorithm makes use of a test window that can discard non-frequent itemsets from a set of candidates. The efficiency of this approach relies on the property that the higher the support threshold is, the smaller the test window is. In addition to considering a sharp horizon, we consider a smooth window. Indeed, in many applications that are of practical interest, not all of the time slots have the same relevance, e.g., more recent slots can be more interesting than older slots. Smoothness can be determined in both qualitative and quantitative terms. A comparison to other algorithms is conducted. The experimental results prove that the proposed solution is faster than other approaches but has a slightly higher cost in terms of memory. [Copyright &y& Elsevier]
- Published
- 2014
- Full Text
- View/download PDF
44. Following the entire solution path of sparse principal component analysis by coordinate-pairwise algorithm.
- Author
-
Meng, Deyu, Cui, Hengbin, Xu, Zongben, and Jing, Kaili
- Subjects
- *
PATHS & cycles in graph theory , *PRINCIPAL components analysis , *ALGORITHMS , *ITERATIVE methods (Mathematics) , *PROBLEM solving , *COMPUTER simulation - Abstract
Abstract: In this paper we derive an algorithm to follow the entire solution path of the sparse principal component analysis (PCA) problem. The core idea is to iteratively identify the pairwise variables along which the objective function of the sparse PCA model can be largely increased, and then incrementally update the coefficients of the two variables so selected by a small stepsize. The new algorithm dominates on its capability of providing a computational shortcut to attain the entire spectrum of solutions of the sparse PCA problem, which is always beneficial to real applications. The proposed algorithm is simple and easy to be implemented. The effectiveness of our algorithm is empirically verified by a series of experiments implemented on synthetic and real problems, as compared with other typical sparse PCA methods. [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
45. Efficient Cluster Labeling for Support Vector Clustering.
- Author
-
D'Orangeville, V., Mayers, M. Andre, Monga, M. Ernest, and Wang, M. Shengrui
- Subjects
- *
SUPPORT vector machines , *ALGORITHMS , *PROBLEM solving , *INTEGRATED circuit interconnections , *KERNEL operating systems , *DATA mining - Abstract
We propose a new efficient algorithm for solving the cluster labeling problem in support vector clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improves the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
46. Sentiment Word Relations with Affect,Judgment, and Appreciation.
- Author
-
Neviarouskaya, Alena and Aono, Masaki
- Abstract
In this work, we propose a method for automatic analysis of attitude (affect, judgment, and appreciation) in sentiment words. The first stage of the proposed method is an automatic separation of unambiguous affective and judgmental adjectives from those that express appreciation or different attitudes depending on context. In our experiments with machine learning algorithms we employed three feature sets based on Pointwise Mutual Information, word-pattern co-occurrence, and minimal path length. The next stage of the proposed method is to estimate the potentials of miscellaneous adjectives to convey affect, judgment, and appreciation. Based on the sentences automatically collected for each adjective, the algorithm analyses the context of phrases that contain sentiment words by considering morphological tags, high-level concepts, and named entities, and then makes decisions about contextual attitude labels. Finally, the appraisal potentials of a word are calculated based on the number of sentences related to each type of attitude. Our two-stage method was evaluated on two data sets, and promising results were obtained. The performance of our method was also compared with the method from previous work. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
47. On detection of emerging anomalous traffic patterns using GPS data.
- Author
-
Pang, Linsey Xiaolin, Chawla, Sanjay, Liu, Wei, and Zheng, Yu
- Subjects
- *
GLOBAL Positioning System , *TRAFFIC flow , *DATA mining , *REMOTE sensing , *TRANSPORTATION , *METROPOLITAN areas - Abstract
Abstract: The increasing availability of large-scale trajectory data provides us great opportunity to explore them for knowledge discovery in transportation systems using advanced data mining techniques. Nowadays, large number of taxicabs in major metropolitan cities are equipped with a GPS device. Since taxis are on the road nearly 24h a day (with drivers changing shifts), they can now act as reliable sensors to monitor the behavior of traffic. In this article, we use GPS data from taxis to monitor the emergence of unexpected behavior in the Beijing metropolitan area, which has the potential to estimate and improve traffic conditions in advance. We adapt likelihood ratio test statistic (LRT) which have previously been mostly used in epidemiological studies to describe traffic patterns. To the best of our knowledge the use of LRT in traffic domain is not only novel but results in accurate and rapid detection of anomalous behavior. [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
48. Pattern-growth based frequent serial episode discovery.
- Author
-
Achar, Avinash, A, Ibrahim, and Sastry, P.S.
- Subjects
- *
PATTERN recognition systems , *SEQUENTIAL analysis , *TELECOMMUNICATION , *ALGORITHMS , *COMPUTER software correctness , *FAULT location (Engineering) - Abstract
Abstract: Frequent episode discovery is a popular framework for pattern discovery from sequential data. It has found many applications in domains like alarm management in telecommunication networks, fault analysis in the manufacturing plants, predicting user behavior in web click streams and so on. In this paper, we address the discovery of serial episodes. In the episodes context, there have been multiple ways to quantify the frequency of an episode. Most of the current algorithms for episode discovery under various frequencies are apriori-based level-wise methods. These methods essentially perform a breadth-first search of the pattern space. However currently there are no depth-first based methods of pattern discovery in the frequent episode framework under many of the frequency definitions. In this paper, we try to bridge this gap. We provide new depth-first based algorithms for serial episode discovery under non-overlapped and total frequencies. Under non-overlapped frequency, we present algorithms that can take care of span constraint and gap constraint on episode occurrences. Under total frequency we present an algorithm that can handle span constraint. We provide proofs of correctness for the proposed algorithms. We demonstrate the effectiveness of the proposed algorithms by extensive simulations. We also give detailed run-time comparisons with the existing apriori-based methods and illustrate scenarios under which the proposed pattern-growth algorithms perform better than their apriori counterparts. [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
49. Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary.
- Author
-
Bleik, Said, Mishra, Meenakshi, Huan, Jun, and Song, Min
- Abstract
Recently, graph representations of text have been showing improved performance over conventional bag-of-words representations in text categorization applications. In this paper, we present a graph-based representation for biomedical articles and use graph kernels to classify those articles into high-level categories. In our representation, common biomedical concepts and semantic relationships are identified with the help of an existing ontology and are used to build a rich graph structure that provides a consistent feature set and preserves additional semantic information that could improve a classifier's performance. We attempt to classify the graphs using both a set-based graph kernel that is capable of dealing with the disconnected nature of the graphs and a simple linear kernel. Finally, we report the results comparing the classification performance of the kernel classifiers to common text-based classifiers. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
50. Grammar-based multi-objective algorithms for mining association rules.
- Author
-
Luna, J.M., Romero, J.R., and Ventura, S.
- Subjects
- *
DATA mining , *APPLICATION software , *DATA quality , *GENETIC programming , *EVOLUTIONARY algorithms , *SOFTWARE reliability - Abstract
Abstract: In association rule mining, the process of extracting relations from a dataset often requires the application of more than one quality measure and, in many cases, such measures involve conflicting objectives. In such a situation, it is more appropriate to attain the optimal trade-off between measures. This paper deals with the association rule mining problem under a multi-objective perspective by proposing grammar guided genetic programming (G3P) models, that enable the extraction of both numerical and nominal association rules in only one single step. The strength of G3P is its ability to restrict the search space and build rules conforming to a given context-free grammar. Thus, the proposals presented in this paper combine the advantages of G3P models with those of multi-objective approaches. Both approaches follow the philosophy of two well-known multi-objective algorithms: the Non-dominated Sort Genetic Algorithm (NSGA-2) and the Strength Pareto Evolutionary Algorithm (SPEA-2). In the experimental stage, we compare both multi-objective algorithms to a single-objective G3P proposal for mining association rules and perform an analysis of the mined rules. The results obtained show that multi-objective proposals obtain very frequent (with support values above 95% in most cases) and reliable (with confidence values close to 100%) rules when attaining the optimal trade-off between support and confidence. Furthermore, for the trade-off between support and lift, the multi-objective proposals also produce very interesting and representative rules. [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.