Journal: data mining & knowledge discovery / Topic: algorithms - Searchworks@Jio Institute Digital Library Search Results

1. Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation.

Author: Pereira, João Luiz Junho, Smith-Miles, Kate, Muñoz, Mario Andrés, and Lorena, Ana Carolina
Subjects: MACHINE learning, SUPERVISED learning, METAHEURISTIC algorithms, CLASSIFICATION algorithms, ALGORITHMS
Abstract: Whenever a new supervised machine learning (ML) algorithm or solution is developed, it is imperative to evaluate the predictive performance it attains for diverse datasets. This is done in order to stress test the strengths and weaknesses of the novel algorithms and provide evidence for situations in which they are most useful. A common practice is to gather some datasets from public benchmark repositories for such an evaluation. But little or no specific criteria are used in the selection of these datasets, which is often ad-hoc. In this paper, the importance of gathering a diverse benchmark of datasets in order to properly evaluate ML models and really understand their capabilities is investigated. Leveraging from meta-learning studies evaluating the diversity of public repositories of datasets, this paper introduces an optimization method to choose varied classification and regression datasets from a pool of candidate datasets. The method is based on maximum coverage, circular packing, and the meta-heuristic Lichtenberg Algorithm for ensuring that diverse datasets able to challenge the ML algorithms more broadly are chosen. The selections were compared experimentally with a random selection of datasets and with clustering by k-medoids and proved to be more effective regarding the diversity of the chosen benchmarks and the ability to challenge the ML algorithms at different levels. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Gradient-based explanation for non-linear non-parametric dimensionality reduction.

Author: Corbugy, Sacha, Marion, Rebecca, and Frénay, Benoît
Subjects: NEIGHBORHOODS, EXPLANATION, ALGORITHMS
Abstract: Dimensionality reduction (DR) is a popular technique that shows great results to analyze high-dimensional data. Generally, DR is used to produce visualizations in 2 or 3 dimensions. While it can help understanding correlations between data, embeddings generated by DR are hard to grasp. The position of instances in low-dimension may be difficult to interpret, especially for non-linear, non-parametric DR techniques. Because most of the techniques are said to be neighborhood preserving (which means that explaining long distances is not relevant), some approaches try explaining them locally. These methods use simpler interpretable models to approximate the decision frontier locally. This can lead to misleading explanations. In this paper a novel approach to locally explain non-linear, non-parametric DR embeddings like t-SNE is introduced. It is the first gradient-based method for explaining these DR algorithms. The technique presented in this paper is applied on t-SNE, but is theoretically suitable for any DR method that is a minimization or maximization problem. The approach uses the analytical derivative of a t-SNE embedding to explain the position of an instance in the visualization. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. NICE: an algorithm for nearest instance counterfactual explanations.

Author: Brughmans, Dieter, Leyman, Pieter, and Martens, David
Subjects: MACHINE learning, ALGORITHMS, EXPLANATION, CLASSIFICATION
Abstract: In this paper we propose a new algorithm, named NICE, to generate counterfactual explanations for tabular data that specifically takes into account algorithmic requirements that often emerge in real-life deployments: (1) the ability to provide an explanation for all predictions, (2) being able to handle any classification model (also non-differentiable ones), (3) being efficient in run time, and (4) providing multiple counterfactual explanations with different characteristics. More specifically, our approach exploits information from a nearest unlike neighbor to speed up the search process, by iteratively introducing feature values from this neighbor in the instance to be explained. We propose four versions of NICE, one without optimization and, three which optimize the explanations for one of the following properties: sparsity, proximity or plausibility. An extensive empirical comparison on 40 datasets shows that our algorithm outperforms the current state-of-the-art in terms of these criteria. Our analyses show a trade-off between on the one hand plausibility and on the other hand proximity or sparsity, with our different optimization methods offering users the choice to select the types of counterfactuals that they prefer. An open-source implementation of NICE can be found at https://github.com/ADMAntwerp/NICE. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. An alternating nonmonotone projected Barzilai–Borwein algorithm of nonnegative factorization of big matrices.

Author: Li, Ting, Tang, Jiayi, and Wan, Zhong
Subjects: ALGORITHMS, FACTORIZATION, MATRIX decomposition, NONNEGATIVE matrices, IMAGE reconstruction, TRANSCRIPTOMES, FACE
Abstract: In this paper, a new alternating nonmonotone projected Barzilai–Borwein (BB) algorithm is developed for solving large scale problems of nonnegative matrix factorization. Unlike the existing algorithms available in the literature, a nonmonotone line search strategy is proposed to find suitable step lengths, and an adaptive BB spectral parameter is employed to generate search directions such that the constructed subproblems are efficiently solved. Apart from establishment of global convergence for this algorithm, numerical tests on three synthetic datasets, four public face image datasets and a real-world transcriptomic dataset are conducted to show advantages of the developed algorithm in this paper. It is concluded that in terms of numerical efficiency, noise robustness and quality of matrix factorization, our algorithm is promising and applicable to face image reconstruction, and deep mining of transcriptomic profiles of the sub-genomes in hybrid fish lineage, compared with the state-of-the-art algorithms. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

5. Bounding the family-wise error rate in local causal discovery using Rademacher averages.

Author: Simionato, Dario and Vandin, Fabio
Subjects: STATISTICAL learning, ERROR rates, FALSE discovery rate, ALGORITHMS
Abstract: Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Efficient learning with projected histograms.

Author: Huang, Zhanliang, Kabán, Ata, and Reeve, Henry
Subjects: BUDGET management, GENERALIZATION, PRIVACY, CLASSIFICATION, ALGORITHMS
Abstract: High dimensional learning is a perennial problem due to challenges posed by the "curse of dimensionality"; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Explainable decomposition of nested dense subgraphs.

Author: Tatti, Nikolaj
Subjects: DENSE graphs, NP-hard problems, TREE size, SUBGRAPHS, ALGORITHMS, GREEDY algorithms
Abstract: Discovering dense regions in a graph is a popular tool for analyzing graphs. While useful, analyzing such decompositions may be difficult without additional information. Fortunately, many real-world networks have additional information, namely node labels. In this paper we focus on finding decompositions that have dense inner subgraphs and that can be explained using labels. More formally, we construct a binary tree T with labels on non-leaves that we use to partition the nodes in the input graph. To measure the quality of the tree, we model the edges in the shell and the cross edges to the inner shells as a Bernoulli variable. We reward the decompositions with the dense regions by requiring that the model parameters are non-increasing. We show that our problem is NP-hard, even inapproximable if we constrain the size of the tree. Consequently, we propose a greedy algorithm that iteratively finds the best split and applies it to the current tree. We demonstrate how we can efficiently compute the best split by maintaining certain counters. Our experiments show that our algorithm can process networks with over million edges in few minutes. Moreover, we show that the algorithm can find the ground truth in synthetic data and produces interpretable decompositions when applied to real world networks. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Efficient algorithms for fair clustering with a new notion of fairness.

Author: Gupta, Shivam, Ghalme, Ganesh, Krishnan, Narayanan C., and Jain, Shweta
Subjects: FAIRNESS, GREEDY algorithms, ALGORITHMS
Abstract: We revisit the problem of fair clustering, first introduced by Chierichetti et al. (Fair clustering through fairlets, 2017), which requires each protected attribute to have approximately equal representation in every cluster, i.e., a Balance property. Existing solutions to fair clustering are either not scalable or do not achieve an optimal trade-off between clustering objectives and fairness. In this paper, we propose a new notion of fairness which we call τ -ratio fairness, that strictly generalizes the Balance property and enables a fine-grained efficiency vs. fairness trade-off. Furthermore, we show that a simple greedy round-robin-based algorithm achieves this trade-off efficiently. Under a more general setting of multi-valued protected attributes, we rigorously analyze the theoretical properties of the proposed algorithm, the Fair Round-Robin Algorithm for Clustering Over-End ( FRAC OE ). We also propose a heuristic algorithm, Fair Round-Robin Algorithm for Clustering (FRAC), that applies round-robin allocation at each iteration of a vanilla clustering algorithm. Our experimental results suggest that both FRAC and FRAC OE outperform all the state-of-the-art algorithms and work exceptionally well even for a large number of clusters. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

9. Bake off redux: a review and experimental evaluation of recent time series classification algorithms.

Author: Middlehurst, Matthew, Schäfer, Patrick, and Bagnall, Anthony
Subjects: CLASSIFICATION algorithms, DATA mining, TIME series analysis, ENCYCLOPEDIAS & dictionaries, ALGORITHMS
Abstract: In 2017, a research paper (Bagnall et al. Data Mining and Knowledge Discovery 31(3):606-660. 2017) compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a 'bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, MultiROCKET+Hydra (Dempster et al. 2022) and HIVE-COTEv2 (Middlehurst et al. Mach Learn 110:3211-3243. 2021), perform significantly better than other approaches on both the current and new TSC problems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. Interpretable linear dimensionality reduction based on bias-variance analysis.

Author: Bonetti, Paolo, Metelli, Alberto Maria, and Restelli, Marcello
Subjects: CONTINUOUS groups, MACHINE learning, LINEAR statistical models, DESIGN techniques, ALGORITHMS
Abstract: One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, dimensionality reduction techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, Linear Correlated Features Aggregation (LinCFA), which aggregates groups of continuous features with their average if their correlation is "sufficiently large". In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles.

Author: Marques, Henrique O., Swersky, Lorne, Sander, Jörg, Campello, Ricardo J. G. B., and Zimek, Arthur
Subjects: OUTLIER detection, CLASSIFICATION algorithms, SUPERVISED learning, ALGORITHMS, MACHINE learning, COMPARATIVE studies, CLASSIFICATION
Abstract: It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56–64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147–153, 2009. https://doi.org/10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

12. Community discovery using nonnegative matrix factorization.

Author: Fei Wang, Tao Li, Xin Wang, Shenghuo Zhu, and Ding, Chris
Subjects: VIRTUAL communities, ONLINE social networks, DATA mining, COMPUTER networks, FACTORIZATION, ALGORITHMS
Abstract: Complex networks exist in a wide range of real world systems, such as social networks, technological networks, and biological networks. During the last decades, many researchers have concentrated on exploring some common things contained in those large networks include the small-world property, power-law degree distributions, and network connectivity. In this paper, we will investigate another important issue, community discovery, in network analysis. We choose Nonnegative Matrix Factorization (NMF) as our tool to find the communities because of its powerful interpretability and close relationship between clustering methods. Targeting different types of networks (undirected, directed and compound), we propose three NMF techniques (Symmetric NMF, Asymmetric NMF and Joint NMF). The correctness and convergence properties of those algorithms are also studied. Finally the experiments on real world networks are presented to show the effectiveness of the proposed methods. [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

13. Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms.

Author: Gomes Mantovani, Rafael, Horváth, Tomáš, Rossi, André L. D., Cerri, Ricardo, Barbon Junior, Sylvio, Vanschoren, Joaquin, and Carvalho, André C. P. L. F. de
Subjects: DECISION trees, MACHINE learning, ALGORITHMS, EMPIRICAL research, MATHEMATICAL optimization, CLASSIFICATION
Abstract: Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters' relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. Linear regression for uplift modeling.

Author: Rudaś, Krzysztof and Jaroszewicz, Szymon
Subjects: REGRESSION analysis, STATISTICAL models, MACHINE learning, COMPUTER simulation, MARKETING, ALGORITHMS
Abstract: The purpose of statistical modeling is to select targets for some action, such as a medical treatment or a marketing campaign. Unfortunately, classical machine learning algorithms are not well suited to this task since they predict the results after the action, and not its causal impact. The answer to this problem is uplift modeling, which, in addition to the usual training set containing objects on which the action was taken, uses an additional control group of objects not subjected to it. The predicted true effect of the action on a given individual is modeled as the difference between responses in both groups. This paper analyzes two uplift modeling approaches to linear regression, one based on the use of two separate models and the other based on target variable transformation. Adapting the second estimator to the problem of regression is one of the contributions of the paper. We identify the situations when each model performs best and, contrary to several claims in the literature, show that the double model approach has favorable theoretical properties and often performs well in practice. Finally, based on our analysis we propose a third model which combines the benefits of both approaches and seems to be the model of choice for uplift linear regression. Experimental analysis confirms our theoretical results on both simulated and real data, clearly demonstrating good performance of the double model and the advantages of the proposed approach. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

15. Coupled block diagonal regularization for multi-view subspace clustering.

Author: Chen, Huazhu, Wang, Weiwei, and Luo, Shousheng
Subjects: MATRICES (Mathematics), LATENT variables, ALGORITHMS
Abstract: The object of multi-view subspace clustering is to uncover the latent low-dimensional structure by segmenting a collection of high-dimensional multi-source data into their corresponding subspaces. Existing methods imposed various constraints on the affinity matrix and/or the cluster labels to promote segmentation accuracy, and demonstrated effectiveness in some applications. However, the previous constraints are inefficient to ensure the ideal discriminative capability of the corresponding method. In this paper, we propose to learn view-specific affinity matrices and a common cluster indicator matrix jointly in a unified minimization problem, in which the affinity matrices and the cluster indicator matrix can guide each other to facilitate the final segmentation. To enforce the ideal discrimination, we use a block diagonal inducing regularity to constrain the affinity matrices as well as the cluster indicator matrix. Such coupled regularities are double insurances to promote clustering accuracy. We call it Coupled Block Diagonal Regularized Multi-view Subspace Clustering (CBDMSC). Based on the alternative minimization method, an algorithm is proposed to solve the new model. We evaluate our method by several metrics and compare it with several state-of-the-art related methods on some commonly used datasets. The results demonstrate that our method outperforms the state-of-the-art methods in the vast majority of metrics. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Discovering recurring activity in temporal networks.

Author: Kostakis, Orestis, Tatti, Nikolaj, and Gionis, Aristides
Subjects: SOCIOLOGY of sports, ACQUISITION of data, SPORTS teams, ALGORITHMS, SOCIAL networks
Abstract: Recent advances in data-acquisition technologies have equipped team coaches and sports analysts with the capability of collecting and analyzing detailed data of team activity in the field. It is now possible to monitor a sports event and record information regarding the position of the players in the field, passing the ball, coordinated moves, and so on. In this paper we propose a new method to analyze such team activity data. Our goal is to segment the overall activity stream into a sequence of potentially recurrent modes, which reflect different strategies adopted by a team, and thus, help to analyze and understand team tactics. We model team activity data as a temporal network, that is, a sequence of time-stamped edges that capture interactions between players. We then formulate the problem of identifying a small number of team modes and segmenting the overall timespan so that each segment can be mapped to one of the team modes; hence the set of modes summarizes the overall team activity. We prove that the resulting optimization problem is $$\mathrm {NP}$$ -hard, and we discuss its properties. We then present a number of different algorithms for solving the problem, including an approximation algorithm that is practical only for one mode, as well as heuristic methods based on iterative and greedy approaches. We benchmark the performance of our algorithms on real and synthetic datasets. Of all methods, the iterative algorithm provides the best combination of performance and running time. We demonstrate practical examples of the insights provided by our algorithms when mining real sports-activity data. In addition, we show the applicability of our algorithms on other types of data, such as social networks. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

17. Guest editorial: special issue on temporal data mining: theory, algorithms and applications.

Author: Tao Li, Chang-Shing Perng, and Sheng Ma
Subjects: ALGORITHMS, DATA mining, REPORT writing, FORUMS
Abstract: This special issue provides a leading forum for timely, in-depth presentation of recent advances in algorithms, theories and applications in temporal data mining. The selected papers underwent a rigorous refereeing and revision process. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

18. Characterizing attitudinal network graphs through frustration cloud.

Author: Rusnak, Lucas and Tešić, Jelena
Subjects: FRUSTRATION, SOCIAL networks, ALGORITHMS
Abstract: Attitudinal network graphs are signed graphs where edges capture an expressed opinion; two vertices connected by an edge can be agreeable (positive) or antagonistic (negative). A signed graph is called balanced if each of its cycles includes an even number of negative edges. Balance is often characterized by the frustration index or by finding a single convergent balanced state of network consensus. In this paper, we propose to expand the measures of consensus from a single balanced state associated with the frustration index to the set of nearest balanced states. We introduce the frustration cloud as a set of all nearest balanced states and use a graph-balancing algorithm to find all nearest balanced states in a deterministic way. Computational concerns are addressed by measuring consensus probabilistically, and we introduce new vertex and edge metrics to quantify status, agreement, and influence. We also introduce a new global measure of controversy for a given signed graph and show that vertex status is a zero-sum game in the signed network. We propose an efficient scalable algorithm for calculating frustration cloud-based measures in social network and survey data of up to 80,000 vertices and half-a-million edges. We also demonstrate the power of the proposed approach to provide discriminant features for community discovery when compared to spectral clustering and to automatically identify dominant vertices and anomalous decisions in the network. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

19. Noise-tolerance matrix completion for location recommendation.

Author: Xia, Bin, Li, Tao, Li, Qianmu, and Zhang, Hong
Subjects: LOCATION-based services, ONLINE social networks, RECOMMENDER systems, ALGORITHMS, MATRICES (Mathematics), NOISE
Abstract: Due to the sharply increasing number of users and venues in Location-Based Social Networks, it becomes a big challenge to provide recommendations which match users' preferences. Furthermore, the sparse data and skew distribution (i.e., structural noise) also worsen the coverage and accuracy of recommendations. This problem is prevalent in traditional recommender methods since they assume that the collected data truly reflect users' preferences. To overcome the limitation of current recommenders, it is imperative to explore an effective strategy, which can accurately provide recommendations while tolerating the structural noise. However, few study concentrates on the process of noisy data in the recommender system, even recent matrix-completion algorithms. In this paper, we cast the location recommendation as a mathematical matrix-completion problem and propose a robust algorithm named Linearized Bregman Iteration for Matrix Completion (LBIMC), which can effectively recover the user-location matrix considering structural noise and provide recommendations based solely on check-in records. Our experiments are conducted by an amount of check-in data from Foursquare, and the results demonstrate the effectiveness of LBIMC. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

20. Widening: using parallel resources to improve model quality.

Author: Berthold, Michael R., Fillbrunn, Alexander, and Siebes, Arno
Subjects: DATA mining, MACHINE learning
Abstract: This paper provides a unified description of Widening, a framework for the use of parallel (or otherwise abundant) computational resources to improve model quality. We discuss different theoretical approaches to Widening with and without consideration of diversity. We then soften some of the underlying constraints so that Widening can be implemented in real world algorithms. We summarize earlier experimental results demonstrating the potential impact as well as promising implementation strategies before concluding with a survey of related work. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. An exemplar-based clustering using efficient variational message passing.

Author: Ibrahim, Mohamed Hamza and Missaoui, Rokia
Subjects: CLUSTER analysis (Statistics), MESSAGE passing (Computer science), ENGINEERING mathematics, SYSTEMS engineering, DATA analysis, ALGORITHMS
Abstract: Clustering is a crucial step in scientific data analysis and engineering systems. Thus, an efficient cluster analysis method often remains a key challenge. In this paper, we introduce a general purpose exemplar-based clustering method called (MEGA), which performs a novel message-passing strategy based on variational expectation–maximization and generalized arc-consistency techniques. Unlike message passing clustering methods, MEGA formulates the message-passing schema as E- and M-steps of variational expectation–maximization based on a reparameterized factor graph. It also exploits an adaptive variant of generalized arc consistency technique to perform a variational mean-field approximation in E-step to minimize a Kullback–Leibler divergence on the model evidence. Dissimilar to density-based clustering methods, MEGA has no sensitivity to initial parameters. In contrast to partition-based clustering methods, MEGA does not require pre-specifying the number of clusters. We focus on the binary-variable factor graph to model the clustering problem but MEGA is applicable to other graphical models in general. Our experiments on real-world problems demonstrate the efficiency of MEGA over existing prominent clustering algorithms such as Affinity propagation, Agglomerative, DBSCAN, K-means, and EM. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Maximum entropy models and subjective interestingness: an application to tiles in binary databases.

Author: De Bie, Tijl
Subjects: TILES, MAXIMUM entropy method, DATABASES, FORMALIZATION (Philosophy), DATA mining, ALGORITHMS
Abstract: Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), ; Geng and Hamilton, ACM Comput Surv 38(3):9, ). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, ). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), ) in binary databases. [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

23. InceptionTime: Finding AlexNet for time series classification.

Author: Ismail Fawaz, Hassan, Lucas, Benjamin, Forestier, Germain, Pelletier, Charlotte, Schmidt, Daniel F., Weber, Jonathan, Webb, Geoffrey I., Idoumghar, Lhassane, Muller, Pierre-Alain, and Petitjean, François
Subjects: TIME series analysis, CONVOLUTIONAL neural networks, DEEP learning, ARTIFICIAL neural networks, ALGORITHMS, MACHINE learning
Abstract: This paper brings deep learning at the forefront of research into time series classification (TSC). TSC is the area of machine learning tasked with the categorization (or labelling) of time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate, HIVE-COTE cannot be applied to many real-world datasets because of its high training time complexity in O (N 2 · T 4) for a dataset with N time series of length T. For example, it takes HIVE-COTE more than 8 days to learn from a small dataset with N = 1500 time series of short length T = 46 . Meanwhile deep learning has received enormous attention because of its high accuracy and scalability. Recent approaches to deep learning for TSC have been scalable, but less accurate than HIVE-COTE. We introduce InceptionTime—an ensemble of deep Convolutional Neural Network models, inspired by the Inception-v4 architecture. Our experiments show that InceptionTime is on par with HIVE-COTE in terms of accuracy while being much more scalable: not only can it learn from 1500 time series in one hour but it can also learn from 8M time series in 13 h, a quantity of data that is fully out of reach of HIVE-COTE. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

24. Fair-by-design matching.

Author: García-Soriano, David and Bonchi, Francesco
Subjects: MEDICAL databases, ALGORITHMS, BIPARTITE graphs, REGULAR graphs, RESIDENTS (Medicine), IMAGE recognition (Computer vision)
Abstract: Matching algorithms are used routinely to match donors to recipients for solid organs transplantation, for the assignment of medical residents to hospitals, record linkage in databases, scheduling jobs on machines, network switching, online advertising, and image recognition, among others. Although many optimal solutions may exist to a given matching problem, when the elements that shall or not be included in a solution correspond to individuals, it becomes of paramount importance that the solution is selected fairly. In this paper we study individual fairness in matching problems. Given that many maximum matchings may exist, each one satisfying a different set of individuals, the only way to guarantee fairness is through randomization. Hence we introduce the distributional maxmin fairness framework which provides, for any given input instance, the strongest guarantee possible simultaneously for all individuals in terms of satisfaction probability (the probability of being matched in the solution). Specifically, a probability distribution over feasible solutions is maxmin-fair if it is not possible to improve the satisfaction probability of any individual without decreasing it for some other individual which is no better off. Our main contribution is a polynomial-time algorithm building on techniques from minimum cuts, and edge-coloring algorithms for regular bipartite graphs, and transversal theory. In the special case of bipartite matching, our algorithm runs in O ((| V | 2 + | E | | V | 2 / 3 ) · (log | V |) 2) expected time. An experimental evaluation of our fair-matching algorithm shows its ability to scale to graphs with tens of millions of vertices and hundreds of millions of edges, taking only a few minutes on a simple architecture. To the best of our knowledge, this yields the first large-scale implementation of the egalitarian mechanism of Bogomolnaia and Moulin (Econometrica 72(1):257–279, 2004). Our analysis confirms that our method provides stronger satisfaction probability guarantees than non-trivial baselines. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

25. On normalization and algorithm selection for unsupervised outlier detection.

Author: Kandanaarachchi, Sevvandi, Muñoz, Mario A., Hyndman, Rob J., and Smith-Miles, Kate
Subjects: OUTLIER detection, ALGORITHMS
Abstract: This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. Grafting for combinatorial binary model using frequent itemset mining.

Author: Lee, Taito, Matsushima, Shin, and Yamanishi, Kenji
Subjects: ALGORITHMS
Abstract: We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the L 1 -regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale L 1 -RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. A Subsequence Matching Algorithm that Supports Normalization Transform in Time-Series Databases.

Author: Loh, Woong-Kee, Kim, Sang-Wook, and Whang, Kyu-Young
Subjects: INFORMATION storage & retrieval systems, DATABASES, INDEXES, ALGORITHMS, INTERPOLATION
Abstract: In this paper, an algorithm is proposed for subsequence matching that supports normalization transform in time-series databases. Normalization transform enables finding sequences with similar fluctuation patterns even though they are not close to each other before the normalization transform. Simple application of existing subsequence matching algorithms to support normalization transform is not feasible since the algorithms do not have information for normalization transform of subsequences of arbitrary lengths. Application of the existing whole matching algorithm supporting normalization transform to the subsequence matching is feasible, but requires an index for every possible length of the query sequence causing serious overhead on both storage space and update time. The proposed algorithm generates indexes only for a small number of different lengths of query sequences. For subsequence matching it selects the most appropriate index among them. Better search performance can be obtained by using more indexes. In this paper, the approach is called index interpolation. It is formally proved that the proposed algorithm does not cause false dismissal. The search performance can be traded off with storage space by adjusting the number of indexes. For performance evaluation, a series of experiments is conducted using the indexes for only five different lengths out of lengths 256 ~512 of the query sequence. The results show that the proposed algorithm outperforms the sequential scan by up to 2.4 times on the average when the selectivity of the query is 10-2 and up to 14.6 times when it is 10-5 . Since the proposed algorithm performs better with smaller selectivities, it is suitable for practical situations, where the queries with smaller selectivities are much more frequent.. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

28. Finding lasting dense subgraphs.

Author: Semertzidis, Konstantinos, Pitoura, Evaggelia, Terzi, Evimaria, and Tsaparas, Panayiotis
Subjects: MATHEMATICAL connectedness, ALGORITHMS, SOCIAL networks, CLIENT/SERVER computing, COMPUTER networks
Abstract: Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks, or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, we study the following problem: given a set of graph snapshots, which may correspond to the state of an evolving graph at different time instances, identify the set of nodes that are the most densely connected in all snapshots. We call this problem the Best Friends Forever (BFF ) problem. We provide definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and we study the corresponding variants of the BFF problem. We then look at the On–Off BFF ( O 2 BFF ) problem that relaxes the requirement of nodes being connected in all snapshots, and asks for the densest set of nodes in at least k of a given set of graph snapshots. We show that this problem is NP-complete for all definitions of density, and we propose a set of efficient algorithms. Finally, we present experiments with synthetic and real datasets that show both the efficiency of our algorithms and the usefulness of the BFF and the O 2 BFF problems. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

29. Model-free inference of diffusion networks using RKHS embeddings.

Author: Hu, Shoubo, Cautis, Bogdan, Chen, Zhitang, Chan, Laiwan, Geng, Yanhui, and He, Xiuqiang
Subjects: REPRODUCING kernel (Mathematics), HILBERT space, ALGORITHMS, SCALABILITY, BATCH processing
Abstract: We revisit in this paper the problem of inferring a diffusion network from information cascades. In our study, we make no assumptions on the underlying diffusion model, in this way obtaining a generic method with broader practical applicability. Our approach exploits the pairwise adoption-time intervals from cascades. Starting from the observation that different kinds of information spread differently, these time intervals are interpreted as samples drawn from unknown (conditional) distributions. In order to statistically distinguish them, we propose a novel method using Reproducing Kernel Hilbert Space embeddings. Experiments on both synthetic and real-world data from Twitter and Flixster show that our method significantly outperforms the state-of-the-art methods. We argue that our algorithm can be implemented by parallel batch processing, in this way meeting the needs in terms of efficiency and scalability of real-world applications. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

30. Noise-free latent block model for high dimensional data.

Author: Laclau, Charlotte and Brault, Vincent
Subjects: CLUSTER analysis (Statistics), FEATURE selection, ALGORITHMS, MACHINE learning, BAYESIAN analysis
Abstract: Co-clustering is known to be a very powerful and efficient approach in unsupervised learning because of its ability to partition data based on both the observations and the variables of a given dataset. However, in high-dimensional context co-clustering methods may fail to provide a meaningful result due to the presence of noisy and/or irrelevant features. In this paper, we tackle this issue by proposing a novel co-clustering model which assumes the existence of a noise cluster, that contains all irrelevant features. A variational expectation-maximization-based algorithm is derived for this task, where the automatic variable selection as well as the joint clustering of objects and variables are achieved via a Bayesian framework. Experimental results on synthetic datasets show the efficiency of our model in the context of high-dimensional noisy data. Finally, we highlight the interest of the approach on two real datasets which goal is to study genetic diversity across the world. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

31. A general framework for predictive tensor modeling with domain knowledge.

Author: Zhu, Yada, He, Jingrui, and Lawrence, Richard
Subjects: METROLOGY, SEMICONDUCTOR industry, APPROXIMATION theory, TENSOR algebra, LINEAR algebra
Abstract: In many real applications such as virtual metrology in semiconductor manufacturing, face recognition, and gait recognition in computer vision, the input data is naturally expressed as tensors or multi-dimensional arrays. Furthermore, in addition to the known label information, domain knowledge can often be obtained from various sources, e.g., multiple domain experts. To address such problems, in this paper, we propose a general optimization framework for dealing with tensor inputs while taking into consideration domain knowledge. To be specific, our framework is based on a linear model, and we obtain the weight tensor in a hierarchical way-first approximate it by a low-rank tensor, and then estimate the low-rank approximation using the domain knowledge from various sources. This is motivated by wafer quality prediction in semiconductor manufacturing. We also propose an effective algorithm named H-MOTE for solving this framework, which is guaranteed to converge. For each iteration, the time complexity of H-MOTE is linear with respect to the number of examples as well as the size of the weight tensor. Therefore, H-MOTE is scalable to large-scale problems. Experimental results show that H-MOTE outperforms state-of-the-art techniques on both synthetic and real data sets. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

32. Interesting pattern mining in multi-relational data.

Author: Spyropoulou, Eirini, De Bie, Tijl, and Boley, Mario
Subjects: DATA mining, GRAPH theory, ALGORITHMS, SYNTAX in programming languages, VIRTUAL reality
Abstract: Mining patterns from multi-relational data is a problem attracting increasing interest within the data mining community. Traditional data mining approaches are typically developed for single-table databases, and are not directly applicable to multi-relational data. Nevertheless, multi-relational data is a more truthful and therefore often also a more powerful representation of reality. Mining patterns of a suitably expressive syntax directly from this representation, is thus a research problem of great importance. In this paper we introduce a novel approach to mining patterns in multi-relational data. We propose a new syntax for multi-relational patterns as complete connected subsets of database entities. We show how this pattern syntax is generally applicable to multi-relational data, while it reduces to well-known tiles ' Geerts et al. (Proceedings of Discovery Science, pp 278-289, )' when the data is a simple binary or attribute-value table. We propose RMiner, a simple yet practically efficient divide and conquer algorithm to mine such patterns which is an instantiation of an algorithmic framework for efficiently enumerating all fixed points of a suitable closure operator 'Boley et al. (Theor Comput Sci 411(3):691-700, )'. We show how the interestingness of patterns of the proposed syntax can conveniently be quantified using a general framework for quantifying subjective interestingness of patterns 'De Bie (Data Min Knowl Discov 23(3):407-446, )'. Finally, we illustrate the usefulness and the general applicability of our approach by discussing results on real-world and synthetic databases. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

33. Hierarchical co-clustering: off-line and incremental approaches.

Author: Pensa, Ruggero, Ienco, Dino, and Meo, Rosa
Subjects: ACQUISITION of data, DATA analysis, BIG data, DATA mining, ALGORITHMS, INFORMATION theory
Abstract: Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus 'incorporating' data gradually into an on-going co-clustering solution. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

34. An adaptive algorithm for anomaly and novelty detection in evolving data streams.

Author: Bouguelia, Mohamed-Rafik, Nowaczyk, Slawomir, and Payberah, Amir H.
Subjects: SUPERVISED learning, GRAPH theory, ARTIFICIAL intelligence, DATA extraction, DATA distribution, ALGORITHMS
Abstract: In the era of big data, considerable research focus is being put on designing efficient algorithms capable of learning and extracting high-level knowledge from ubiquitous data streams in an online fashion. While, most existing algorithms assume that data samples are drawn from a stationary distribution, several complex environments deal with data streams that are subject to change over time. Taking this aspect into consideration is an important step towards building truly aware and intelligent systems. In this paper, we propose GNG-A, an adaptive method for incremental unsupervised learning from evolving data streams experiencing various types of change. The proposed method maintains a continuously updated network (graph) of neurons by extending the Growing Neural Gas algorithm with three complementary mechanisms, allowing it to closely track both gradual and sudden changes in the data distribution. First, an adaptation mechanism handles local changes where the distribution is only non-stationary in some regions of the feature space. Second, an adaptive forgetting mechanism identifies and removes neurons that become irrelevant due to the evolving nature of the stream. Finally, a probabilistic evolution mechanism creates new neurons when there is a need to represent data in new regions of the feature space. The proposed method is demonstrated for anomaly and novelty detection in non-stationary environments. Results show that the method handles different data distributions and efficiently reacts to various types of change. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

35. Exploring variable-length time series motifs in one hundred million length scale.

Author: Gao, Yifeng and Lin, Jessica
Subjects: TIME series analysis, ALGORITHMS, APPROXIMATION theory, VARIABLE-length codes, DATA mining
Abstract: The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

36. Efficiently summarizing attributed diffusion networks.

Author: Amiri, Sorour E., Chen, Liangzhe, and Prakash, B. Aditya
Subjects: ONLINE social networks, VIRAL marketing, GRAPH theory, ALGORITHMS, PROBLEM solving
Abstract: Given a large attributed social network, can we find a compact, diffusion-equivalent representation while keeping the attribute properties? Diffusion networks with user attributes such as friendship, email communication, and people contact networks are increasingly common-place in the real-world. However, analyzing them is challenging due to their large size. In this paper, we first formally formulate a novel problem of summarizing an attributed diffusion graph to preserve its attributes and influence-based properties. Next, we propose ANeTS, an effective sub-quadratic parallelizable algorithm to solve this problem: it finds the best set of candidate nodes and merges them to construct a smaller network of ‘super-nodes’ preserving the desired properties. Extensive experiments on diverse real-world datasets show that ANeTS outperforms all state-of-the-art baselines (some of which do not even finish in 14 days). Finally, we show how ANeTS helps in multiple applications such as Topic-Aware viral marketing and sense-making of diverse graphs from different domains. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

37. Anytime parallel density-based clustering.

Author: Mai, Son T., Assent, Ira, Jacobsen, Jon, and Dieu, Martin Storgaard
Subjects: MULTICORE processors, PARALLEL processing, ALGORITHMS, DENSITY, VECTOR data
Abstract: The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

38. Discovering injective episodes with general partial orders.

Author: Achar, Avinash, Laxman, Srivatsan, Viswanathan, Raajay, and Sastry, P.
Subjects: ALGORITHMS, DATA mining, ONLINE data processing, DATABASE searching, SET theory
Abstract: Frequent episode discovery is a popular framework for temporal pattern discovery in event streams. An episode is a partially ordered set of nodes with each node associated with an event type. Currently algorithms exist for episode discovery only when the associated partial order is total order (serial episode) or trivial (parallel episode). In this paper, we propose efficient algorithms for discovering frequent episodes with unrestricted partial orders when the associated event-types are unique. These algorithms can be easily specialized to discover only serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in the space of certain interesting subclasses of partial orders. We point out that frequency alone is not a sufficient measure of interestingness in the context of partial order mining. We propose a new interestingness measure for episodes with unrestricted partial orders which, when used along with frequency, results in an efficient scheme of data mining. Simulations are presented to demonstrate the effectiveness of our algorithms. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

39. Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis.

Author: Wang, En Tzu and Chen, ArbeeL. P.
Subjects: DATA mining, AUTOMATIC extracting (Information science), DATABASE searching, ALGORITHMS, ALGEBRA, FOUNDATIONS of arithmetic
Abstract: Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework. [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

40. A game-theoretic framework to identify overlapping communities in social networks.

Author: Chen, Wei, Liu, Zhenming, Sun, Xiaorui, and Wang, Yajun
Subjects: GAME theory, SOCIAL networks, COMMUNITY organization, COMMUNITIES, ALGORITHMS
Abstract: In this paper, we introduce a game-theoretic framework to address the community detection problem based on the structures of social networks. We formulate the dynamics of community formation as a strategic game called community formation game: Given an underlying social graph, we assume that each node is a selfish agent who selects communities to join or leave based on her own utility measurement. A community structure can be interpreted as an equilibrium of this game. We formulate the agents’ utility by the combination of a gain function and a loss function. We allow each agent to select multiple communities, which naturally captures the concept of “overlapping communities”. We propose a gain function based on the modularity concept introduced by Newman (Proc Natl Acad Sci 103(23):8577–8582, 2006), and a simple loss function that reflects the intrinsic costs incurred when people join the communities. We conduct extensive experiments under this framework, and our results show that our algorithm is effective in identifying overlapping communities, and are often better then other algorithms we evaluated especially when many people belong to multiple communities. To the best of our knowledge, this is the first time the community detection problem is addressed by a game-theoretic framework that considers community formation as the result of individual agents’ rational behaviors. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

41. COG: local decomposition for rare class analysis.

Author: Junjie Wu, Hui Xiong, and Jian Chen
Subjects: CLUSTERING of particles, CLASSIFICATION, ALGORITHMS, MATHEMATICAL physics, DISTRIBUTION (Probability theory)
Abstract: Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

42. A link mining algorithm for earnings forecast and trading.

Author: Creamer, Germán and Stolfo, Sal
Subjects: EARNINGS forecasting, BUSINESS enterprises, SOCIAL networks, ALGORITHMS, CENTRALITY, BUSINESS planning, TRENDS, PRODUCT returns, RETURNS on sales, DATA mining
Abstract: The objective of this paper is to present and discuss a link mining algorithm called CorpInterlock and its application to the financial domain. This algorithm selects the largest strongly connected component of a social network and ranks its vertices using several indicators of distance and centrality. These indicators are merged with other relevant indicators in order to forecast new variables using a boosting algorithm. We applied the algorithm CorpInterlock to integrate the metrics of an extended corporate interlock (social network of directors and financial analysts) with corporate fundamental variables and analysts’ predictions (consensus). CorpInterlock used these metrics to forecast the trend of the cumulative abnormal return and earnings surprise of S&P 500 companies. The rationality behind this approach is that the corporate interlock has a direct effect on future earnings and returns because these variables affect directors and managers’ compensation. The financial analysts engage in what the agency theory calls the “earnings game”: Managers want to meet the financial forecasts of the analysts and analysts want to increase their compensation or business of the company that they follow. Following the CorpInterlock algorithm, we calculated a group of well-known social network metrics and integrated with economic variables using Logitboost. We used the results of the CorpInterlock algorithm to evaluate several trading strategies. We observed an improvement of the Sharpe ratio (risk-adjustment return) when we used “long only” trading strategies with the extended corporate interlock instead of the basic corporate interlock before the regulation Fair Disclosure (FD) was adopted (1998–2001). There was no major difference among the trading strategies after 2001. Additionally, the CorpInterlock algorithm implemented with Logitboost showed a significantly lower test error than when the CorpInterlock algorithm was implemented with logistic regression. We conclude that the CorpInterlock algorithm showed to be an effective forecasting algorithm and supported profitable trading strategies. [ABSTRACT FROM AUTHOR]
Published: 2009
Full Text: View/download PDF

43. Efficient algorithms for segmentation of item-set time series.

Author: Chundi, Parvathi and Rosenkrantz, Daniel
Subjects: TIME Series Processor (Computer program language), ALGORITHMS, DYNAMIC programming, MATHEMATICAL optimization, OPTIMAL designs (Statistics), DATA mining, TEMPORAL databases, MATHEMATICAL programming, PROGRAMMING languages
Abstract: We propose a special type of time series, which we call an item-set time series, to facilitate the temporal analysis of software version histories, email logs, stock market data, etc. In an item-set time series, each observed data value is a set of discrete items. We formalize the concept of an item-set time series and present efficient algorithms for segmenting a given item-set time series. Segmentation of a time series partitions the time series into a sequence of segments where each segment is constructed by combining consecutive time points of the time series. Each segment is associated with an item set that is computed from the item sets of the time points in that segment, using a function which we call a measure function. We then define a concept called the segment difference, which measures the difference between the item set of a segment and the item sets of the time points in that segment. The segment difference values are required to construct an optimal segmentation of the time series. We describe novel and efficient algorithms to compute segment difference values for each of the measure functions described in the paper. We outline a dynamic programming based scheme to construct an optimal segmentation of the given item-set time series. We use the item-set time series segmentation techniques to analyze the temporal content of three different data sets–Enron email, stock market data, and a synthetic data set. The experimental results show that an optimal segmentation of item-set time series data captures much more temporal content than a segmentation constructed based on the number of time points in each segment, without examining the item set data at the time points, and can be used to analyze different types of temporal data. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

44. Data mining with Temporal Abstractions: learning rules from time series.

Author: Sacchi, Lucia, Larizza, Cristiana, Combi, Carlo, and Bellazzi, Riccardo
Subjects: DATA mining, TEMPORAL databases, ALGORITHMS, COMPUTER programming, RESEARCH methodology, TIME Series Processor (Computer program language)
Abstract: A large volume of research in temporal data mining is focusing on discovering temporal rules from time-stamped data. The majority of the methods proposed so far have been mainly devoted to the mining of temporal rules which describe relationships between data sequences or instantaneous events and do not consider the presence of complex temporal patterns into the dataset. Such complex patterns, such as trends or up and down behaviors, are often very interesting for the users. In this paper we propose a new kind of temporal association rule and the related extraction algorithm; the learned rules involve complex temporal patterns in both their antecedent and consequent. Within our proposed approach, the user defines a set of complex patterns of interest that constitute the basis for the construction of the temporal rule; such complex patterns are represented and retrieved in the data through the formalism of knowledge-based Temporal Abstractions. An Apriori-like algorithm looks then for meaningful temporal relationships (in particular, precedence temporal relationships) among the complex patterns of interest. The paper presents the results obtained by the rule extraction algorithm on a simulated dataset and on two different datasets related to biomedical applications: the first one concerns the analysis of time series coming from the monitoring of different clinical variables during hemodialysis sessions, while the other one deals with the biological problem of inferring relationships between genes from DNA microarray data. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

45. Mining process models with non-free-choice constructs.

Author: Wen, Lijie, Van der Aalst, Wil M. P., Wang, Jianmin, and Sun, Jiaguang
Subjects: INFORMATION resources, KNOWLEDGE management, ALGORITHMS, SOCIAL informatics, SYNCHRONIZATION, CHOICES (Information retrieval system), RESEARCH methodology
Abstract: Process mining aims at extracting information from event logs to capture the business process as it is being executed. Process mining is particularly useful in situations where events are recorded but there is no system enforcing people to work in a particular way. Consider for example a hospital where the diagnosis and treatment activities are recorded in the hospital information system, but where health-care professionals determine the "care-flow." Many process mining approaches have been proposed in recent years. However, in spite of many researchers' persistent efforts, there are still several challenging problems to be solved. In this paper, we focus on mining non-free choice constructs, i.e., situations where there is a mixture of choice and synchronization. Although most real-life processes exhibit non-free-choice behavior, existing algorithms are unable to adequately deal with such constructs. Using a Petri-net-based representation, we will show that there are two kinds of causal dependencies between tasks, i.e., explicit and implicit ones. We propose an algorithm that is able to deal with both kinds of dependencies. The algorithm has been implemented in the ProM framework and experimental results shows that the algorithm indeed significantly improves existing process mining techniques. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

46. Information Preserving Time Decompositions of Time Stamped Documents.

Author: Chundi, Parvathi and Rosenkrantz, Daniel J.
Subjects: ARCHIVES, PUBLICATIONS, ALGORITHMS, DECOMPOSITION method, INFORMATION retrieval, DATABASES, SYSTEM analysis, DATA mining
Abstract: Extraction of sequences of events from news and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events. This paper addresses the issue of constructing an optimal information preserving decomposition of the time period associated with a given document set, i.e., a decomposition with the smallest number of subintervals, subject to no loss of information. We introduce the notion of the compressed interval decomposition, where each subinterval consists of consecutive time points having identical information content. We define optimality, and show that any optimal information preserving decomposition of the time period is a refinement of the compressed interval decomposition. We define several special classes of measure functions (functions that measure the prevalence of keywords in the document set and assign them numeric values), based on their effect on the information computed as document sets are combined. We give algorithms, appropriate for different classes of measure functions, for computing an optimal information preserving decomposition of a given document set. We studied the effectiveness of these algorithms by computing several compressed interval and information preserving decompositions for a subset of the Reuters-21578 document set. The experiments support the obvious conclusion that the temporal information gleaned from a document set is strongly dependent on the measure function used and on other user-defined parameters. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

47. Mining Adaptive Ratio Rules from Distributed Data Sources.

Author: Jun Yan, Ning Liu, Qiang Yang, Benyu Zhang, Qiansheng Cheng, and Zheng Chen
Subjects: ASSOCIATION rule mining, DATA mining, INFORMATION resources, DISTRIBUTED databases, DATABASES, ALGORITHMS, KNOWLEDGE management, INFORMATION resources management, ELECTRONIC information resources
Abstract: Different from traditional association-rule mining, a new paradigm called Ratio Rule (RR) was proposed recently. Ratio rules are aimed at capturing the quantitative association knowledge, We extend this framework to mining ratio rules from distributed and dynamic data sources. This is a novel and challenging problem. The traditional techniques used for ratio rule mining is an eigen-system analysis which can often fall victim to noise. This has limited the application of ratio rule mining greatly. The distributed data sources impose additional constraints for the mining procedure to be robust in the presence of noise, because it is difficult to clean all the data sources in real time in real-world tasks. In addition, the traditional batch methods for ratio rule mining cannot cope with dynamic data. In this paper, we propose an integrated method to mining ratio rules from distributed and changing data sources, by first mining the ratio rules from each data source separately through a novel robust and adaptive one-pass algorithm (which is called Robust and Adaptive Ratio Rule (RARR)), and then integrating the rules of each data source in a simple probabilistic model. In this way, we can acquire the global rules from all the local information sources adaptively. We show that the RARR technique can converge to a fixed point and is robust as well. Moreover, the integration of rules is efficient and effective. Both theoretical analysis and experiments illustrate that the performance of RARR and the proposed information integration procedure is satisfactory for the purpose of discovering latent associations in distributed dynamic data source. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

48. Finding Frequent Patterns in a Large Sparse Graph.

Author: Kuramochi, Michihiro and Karypis, George
Subjects: BASES (Linear topological spaces), EMBEDDINGS (Mathematics), ALGORITHMS, DATABASE searching, DATA mining, VECTOR topology, MATHEMATICAL sequences, ELECTRONIC information resource searching, INFORMATION retrieval
Abstract: Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

49. Bursty and Hierarchical Structure in Streams.

Author: Kleinberg, Jon
Subjects: DATA mining, EMAIL, TEXT mining, MARKOV processes, INFORMATION technology, ALGORITHMS
Abstract: A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise-that the appearance of a topic in a document stream is signaled by a 'burst of activity,' with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such 'bursts,' in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. [ABSTRACT FROM AUTHOR]
Published: 2003
Full Text: View/download PDF

50. Tour recommendation for groups.

Author: Anagnostopoulos, Aris, Atassi, Reem, Becchetti, Luca, Fazzone, Adriano, and Silvestri, Fabrizio
Subjects: RECOMMENDER systems, TOURISM, CITIES & towns, ALGORITHMS, ACQUISITION of data
Abstract: Consider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people's expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

190 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources