Journal: knowledge & information systems / Topic: data mining - Searchworks@Jio Institute Digital Library Search Results

1. Best papers from the 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD2008).

Author: Washio, Takashi, Suzuki, Einoshin, and Kai Ming Ting
Subjects: DATA mining, ALGORITHMS
Abstract: The article discusses various reports published within the issue, including "A Framework for Modeling Positive Class Expansion With Single Snapshot," by Yang Yu and Zhi-Hua Zhou, "Multi-Sorting Algorithm for Finding Pairs of Similar Short Substrings From Large-Scale String Data," by Takeaki Uno, and "Large-Scale K-Means Clustering With User-Centric Privacy-Preservation," by Jun Sakuma and Shigenobu Kobayashi.
Published: 2010
Full Text: View/download PDF

2. A prediction model of student performance based on self-attention mechanism.

Author: Chen, Yan, Wei, Ganglin, Liu, Jiaxin, Chen, Yunwei, Zheng, Qinghua, Tian, Feng, Zhu, Haiping, Wang, Qianying, and Wu, Yaqiang
Subjects: PREDICTION models, DATA mining, PSYCHOLOGY of students, GRADE point average
Abstract: Performance prediction is an important research facet of educational data mining. Most models extract student behavior features from campus card data for prediction. However, most of these methods have coarse time granularity, difficulty in extracting useful high-order behavior combination features, dependence on 6 historical achievements, etc. To solve these problems, this paper utilizes prediction of grade point average (GPA prediction) and whether a specific student has failing subjects (failing prediction) in a term as the goal of performance prediction and proposes a comprehensive performance prediction model of college students based on behavior features. First, a method for representing campus card data based on behavior flow is introduced to retain higher time accuracy. Second, a method for extracting student behavior features based on multi-head self-attention mechanism is proposed to automatically select more important high-order behavior combination features. Finally, a performance prediction model based on student behavior feature mode difference is proposed to improve the model's prediction accuracy and increases the model's robustness for students with significant changes in performance. The performance of the model is verified on actual data collected by the teaching monitoring big data platform of Xi'an Jiaotong University. The results show that the model's prediction performance is better than the comparison algorithms on both the failing prediction and GPA prediction. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

3. Entity linking for English and other languages: a survey.

Author: Guellil, Imane, Garcia-Dominguez, Antonio, Lewis, Peter R., Hussain, Shakeel, and Smith, Geoffrey
Subjects: ENGLISH language, MACHINE translating, DOMINANT language, SENTIMENT analysis, DATA mining
Abstract: Extracting named entities text forms the basis for many crucial tasks such as information retrieval and extraction, machine translation, opinion mining, sentiment analysis and question answering. This paper presents a survey of the research literature on named entity linking, including named entity recognition and disambiguation. We present 200 works by focusing on 43 papers (5 surveys and 38 research works). We also describe and classify 56 resources, including 25 tools and 31 corpora. We focus on the most recent papers, where more than 95% of the described research works are after 2015. To show the efficiency of our construction methodology and the importance of this state of the art, we compare it to other surveys presented in the research literature, which were based on different criteria (such as the domain, novelty and presented models and resources). We also present a set of open issues (including the dominance of the English language in the proposed studies and the frequent use of NER rather than the end-to-end systems proposing NED and EL) related to entity linking based on the research questions that this survey aims to answer. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Seq2EG: a novel and effective event graph parsing approach for event extraction.

Author: Sun, Haotong, Zhou, Junsheng, Kong, Li, Gu, Yanhui, and Qu, Weiguang
Subjects: DATA mining, ARGUMENT, HOUGH transforms
Abstract: Event extraction is a fundamental task in information extraction. Most previous approaches typically transform event extraction into two subtasks: trigger classification and argument classification, and solve them via classification-based methods, which suffer from some inherent drawbacks. To overcome these issues, in this paper, we propose a novel event extraction model Seq2EG by first formulating event extraction as an event graph parsing problem, and then exploiting a pre-trained sequence-to-sequence (seq2seq) model to transduce an input sentence into an accurate event graph without the need for trigger words. Based on the generative event graph parsing formulation, our model Seq2EG can explicitly model the multiple event correlations and argument sharing and can naturally incorporate some graph-structured features and the rich semantic information conveyed by the labels of event types and argument roles. Extensive experimental results on the public ACE2005 dataset show that our approach outperforms all previous state-of-the-art models for event extraction by a large margin, respectively, obtaining an improvement of 3.4% F1 score for event detection and an improvement of 4.7% F1 score for argument classification over the best baselines. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

5. Special issue on selected papers from IEEE DMF 2008.

Author: Chan, Keith C. C. and Xindong Wu
Subjects: ALGORITHMS, DATA mining
Abstract: The article discusses various papers published within the issue including one on local-distributed privacy preserving feature selection algorithm, one on optimization-based classification models and another on data mining problem.
Published: 2010
Full Text: View/download PDF

6. Feature extraction for chart pattern classification in financial time series.

Author: Zheng, Yuechu, Si, Yain-Whar, and Wong, Raymond
Subjects: FEATURE extraction, DATA mining, PATTERN recognition systems, PATTERN matching, ELECTRONIC data processing, CLASSIFICATION, TIME series analysis
Abstract: Extracting shape-related features from a given query subsequence is a crucial preprocessing step for chart pattern matching in rule-based, template-based and hybrid pattern classification methods. The extracted features can significantly influence the accuracy of pattern recognition tasks during the data mining process. Although shape-related features are widely used for chart pattern matching in financial time series, the intrinsic properties of these features and their relationships to the patterns are rarely investigated in research community. This paper aims to formally identify shape-related features used in chart patterns and investigates their impact on chart pattern classifications in financial time series. In this paper, we describe a comprehensive analysis of 14 shape-related features which can be used to classify 41 known chart patterns in technical analysis domain. In order to evaluate their effectiveness, shape-related features are then translated into rules for chart pattern classification. We perform extensive experiments on real datasets containing historical price data of 24 stocks/indices to analyze the effectiveness of the rules. Experimental results reveal that the features put forward in this paper can be effectively used for recognizing chart patterns in financial time series. Our analysis also reveals that high-level features can be hierarchically composed from low-level features. Hierarchical composition allows construction of complex chart patterns from features identified in this paper. We hope that the features identified in this paper can be used as a reference model for the future research in chart pattern analysis. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

7. Review on novelty detection in the non-stationary environment.

Author: Agrahari, Supriya, Srivastava, Sakshi, and Singh, Anil Kumar
Subjects: MACHINE learning, DATA distribution, DATA mining
Abstract: Novelty detection and concept drift detection are essential for the plethora of machine learning applications. The statistical properties of application generated data change over time in the streaming environment, known as concept drift. These changes develop a profound influence on the learning model's performance. Along with concept drift, the new class emergence (i.e., novel class/novelty detection) is also challenging in the non-stationary distribution of data. Novel class detection finds whether the identifying data points of a data stream are unknown or unusual. The paper presents a survey focusing on the challenges encountered while dealing with real-time data. In addition to this, the chronological discussion on the various existing novelty detectors with their advantages, limitations, critical points, the different research prospect, and future directions are also incorporated in the paper. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. A residual utility-based concept for high-utility itemset mining.

Author: Sra, Pushp and Chand, Satish
Subjects: DATA mining, WEB analytics, DATA structures, DATABASES, UTILITY functions
Abstract: Knowledge discovery in databases aims at finding useful information for decision-making. The problem of high-utility itemset mining (HUIM) has specifically garnered huge research attention, as it aims to find relevant information on patterns in a database, which conform to a user-defined utility function. The mined patterns are used for making data-backed decisions in the fields of healthcare, e-commerce, web analytics, etc. Various algorithms exist in the literature related to mining the high-utility items from the databases; however, most of them require multiple database scans, or deploy complex data structures. The utility-list is an efficient list-based data structure that is being widely adopted in the design of HUIM algorithms. The existing utility-list-based algorithms, however, suffer from some drawbacks like extensive use of inefficient join operations, multiple definitions of join operations, etc. Though the HUIM is an important research area, yet very little research has been directed towards improving the design of data structures used for the mining process. In this paper, we introduce the concept of residual utility to design two new data structures, called residue-map and master-map. Using these two data structures, a new algorithm, called R-Miner, is introduced for mining the high-utility items. In order to further optimise the mining process, the cumulative utility value is used as an upper bound and additional pruning conditions are also discussed. Several experiments are carried out on both real and synthetic datasets to compare the performance of R-Miner with the existing list-based algorithms. The experimental results show that the R-Miner improves the performance by up to the order of 2 as compared to the list-based algorithms: EFIM, H-Miner, HUI-Miner, FHM, and ULB-Miner. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Knowledge-based discovery of multi-level co-location patterns using ontology.

Author: Wang, Long, Chang, Liang, Bao, Xuguang, Zhu, Chuangying, and Gu, Tianlong
Subjects: DATA mining, ONTOLOGY, MINERS, ALGORITHMS
Abstract: Spatial co-location pattern discovery (SCPD), a kind of knowledge discovery process, aims at discovering potentially unknown co-location patterns (co-locations). Co-locations have been widely used in many aspects, including life services, ecological environment, business research, etc. Many methods have been proposed to discover co-locations. However, these methods only discovered co-locations consisting of fine-grain spatial features, since the user knowledge is ignored, many interesting and general patterns are still undiscovered. Meanwhile, co-locations that are discovered by current frameworks are quantity-numerous and independent; thus, their usefulness is strongly limited. To overcome these shortcomings, this paper introduces the user knowledge into the process of SCPD, to discover general and intrinsic co-locations and help users quickly find their interested patterns. First, a framework OCPM (Co-location Pattern Miner using Ontology) is proposed, where an ontology is employed to integrate user knowledge to guide the process of SCPD. Second, a new co-location consisting of ontology concepts is proposed. Under the guidance of the ontology, we propose the prevalent semantic multi-level co-locations (PSMCs) consisting of ontology concepts to represent richer knowledge. Third, we design two different ways, i.e., the Apriori-like and clique-based ways, to meet the requirements of OCPM and propose a novel clique-based algorithm named IDG to discover PSMCs. Meanwhile, a top-down search strategy is proposed to help users quickly find interesting knowledge via the ontology. Finally, we validate OCPM and IDG on both real and synthetic datasets, respectively, the experimental results demonstrate their effectiveness. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. MFS-SubSC: an efficient algorithm for mining frequent sequences with sub-sequence constraint.

Author: Duong, Hai and Tran, Anh
Subjects: CONSTRAINT satisfaction, DATABASES, DATA mining, SCALABILITY, ALGORITHMS
Abstract: Mining frequent sequences (FS) with constraints in a sequence database (SDB) are a critical task in Data Mining, as it forms the basis for discovering meaningful patterns within sequential data. However, traditional algorithms tackling the direct mining of constrained FSs from the SDB often exhibit poor performance, especially when dealing with large SDBs and low support thresholds. Moreover, constraint-based sequence mining algorithms face additional challenges, such as increased runtime and memory usage, particularly when constraints change frequently. To address these issues, this paper introduces an efficient method for generating FSs that include a user-defined sub-sequence. Specifically, the discovered FSs must be super-sequences of the given sub-sequence. Rather than directly discovering these sequences from a sequence database (SDB) in the traditional manner, the proposed method quickly generates constrained FSs from frequent closed sequences (FCS) and frequent generator sequences (FGS). This process involves categorizing constrained FSs into equivalence classes; each represented by FCSs and FGSs. An efficient method is then adapted to swiftly generate constrained FSs within each class based on the representative elements, which are FCSs and FGSs. Additionally, a novel technique called Constraint Satisfaction Technique (CST) is introduced to circumvent computationally expensive checks for the inclusion relation among sequences during the generation process. Furthermore, a novel algorithm named MFS-SubSC is developed based on the proposed theoretical results to generate all constrained FSs efficiently. Experimental results demonstrate that the proposed algorithm surpasses state-of-the-art methods in terms of runtime, memory usage, and scalability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. Methods for concept analysis and multi-relational data mining: a systematic literature review.

Author: Leutwyler, Nicolás, Lezoche, Mario, Franciosi, Chiara, Panetto, Hervé, Teste, Laurent, and Torres, Diego
Subjects: KNOWLEDGE graphs, EVIDENCE gaps, DATA mining, INTERNET of things, DATA analysis, RELATIONAL databases
Abstract: The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of distributed architectures (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. Mining frequent generators and closures in data streams with FGC-Stream.

Author: Martin, Tomas, Valtchev, Petko, and Roux, Louis-Romain
Subjects: STEPFAMILIES, DATA mining, MINERS, CONFERENCES & conventions
Abstract: Mining frequent itemsets (FIs) from data streams is a challenging task due to the limited resources available w.r.t. the typically large size of the result and the need for frequent recalculations due to data evolution. Therefore, the mining of condensed representations, e.g. frequent closures (FCIs) or generators (FGIs), instead of plain FIs, has been explored. So far the tasks of mining FGIs and FCIs have only been addressed separately over data streams. Yet, both itemset families combine in the solutions of a range of practical problems while they also underlie the definition of handy association rule bases. To date, the joint mining task can only be approached by a combining two dedicated miners. As a remedy, we propose a holistic approach rooted in the support set-based equivalence classes underlying a transaction dataset: the ensuing FGC - Stream miner exploits some mathematical results about those classes' evolution to efficiently update both FCIs and FGIs. Thus, targeting a sliding window mode—where the window over a stream expands and shrinks—we enhance results from formal concept analysis to design an efficient expansion procedure. On window shrinking, we exploit some thoroughly new results about class evolution. Overall, FGC - Stream achieves significant effort factoring through the collaborative maintenance of FCIs and FGIs. As a result, when confronted experimentally, it managed to largely outperform its unique FGI mining competitor while keeping up with two of the most efficient FCI miners. This outcome confirms that FGC - Stream will dominate any combination of miners for the joint task. This article is an extended version of our paper [27] presented at the 21st International Conference on Data Mining. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

13. CustRE: a rule based system for family relations extraction from english text.

Author: Mumtaz, Raabia and Qadir, Muhammad Abdul
Subjects: FAMILY relations, NATURAL language processing, KNOWLEDGE graphs, DATA mining
Abstract: Relation extraction is an important information extraction task that must be solved in order to transform data into Knowledge Graph (KG), as semantic relations between entities form KG edges of the graph. Although much effort has been devoted to solve this task during the last three decades, but the results achieved are not as good yet. For instance, winner at Text Analysis Conference's (TAC) Knowledge Base Population (KBP) 2015 slot filling task, the Stanford's system, achieves F1 score of 60.5% on standard Relation Extraction (RE) dataset (Zhang et al., in: Position-aware attention and supervised data improve slot_lling. In: EMNLP 2017-Conference on Empirical Methods in Natural Language Processing, Proceedings, (2017). https://doi.org/10.18653/v1/d17-1004). The RE task therefore needs better solutions. This paper presents our system, CustRE, for better identification and classification of family relations from English text. CustRE is a rule based system, that uses regular expressions for pattern matching to extract family relations explicitly mentioned in text, and uses co-reference and propagation rules to extract family relations implicitly implied in the text. The proposed system, its implementation and the results obtained are presented in this paper. The results show that our approach makes a great improvement over existing methods by achieving F1 scores of 79.7% and 76.6% on TACRED family relations and CustFRE datasets respectively, which are 6.3 and 18.5 points higher than LUKE, the best score reporter on TACRED. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. Recent advances in document summarization.

Author: Yao, Jin-ge, Wan, Xiaojun, and Xiao, Jianguo
Subjects: ABSTRACTING, NATURAL language processing, ARTIFICIAL intelligence, TEXT mining, DATA mining
Abstract: The task of automatic document summarization aims at generating short summaries for originally long documents. A good summary should cover the most important information of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. Numerous approaches for automatic summarization have been developed to date. In this paper we give a self-contained, broad overview of recent progress made for document summarization within the last 5 years. Specifically, we emphasize on significant contributions made in recent years that represent the state-of-the-art of document summarization, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences. In addition, we review progress made for document summarization in domains, genres and applications that are different from traditional settings. We also point out some of the latest trends and highlight a few possible future directions. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

15. Cost-sensitive learning using logical analysis of data.

Author: Osman, Hany
Subjects: MACHINE learning, DATA analysis, NAIVE Bayes classification, DATA mining, RECEIVER operating characteristic curves, COST control, FOOD recall, CLOTHES closets
Abstract: Classification is a common task in data mining that assigns a class label to an unseen situation. It has been widely used in decision making for various applications, and many machine learning algorithms have been developed to accomplish this task. Classification becomes critical when the problem under concern is related to serious situations such as fraud detection, cancer diseases, and quality control. Learning in these situations is characterized by predetermined asymmetric costs of incorrect class prediction, or critical consequences associated with erroneous class prediction. In this paper, a novel approach of cost-sensitive learning is proposed. The approach is constructed by employing the theory of logical analysis of data (LAD) to build accurate cost-sensitive classifiers. Two classifiers are proposed. The first classifier is established by solving a proposed pattern selection model, minimum misclassification cost model (MMCM), that aims at minimizing the asymmetric misclassification cost. The second classifier is established by solving another proposed pattern selection model, maximum precision–recall model (MPRM), that maximizes precision and recall willing to reach a 100% accuracy. A comparative study is conducted by using real datasets. The proposed MMCM has enabled LAD to realize up to 32.22% cost reduction from the misclassification cost realized by the traditional implementation of LAD. Moreover, MPRM has provided up to 19.15% increase in the precision and up to 37% increase in the recall. Also, MPRM has enhanced the performance of LAD while compared to common machine learning algorithms by providing better combinations of recall and false positive rate. This enabled LAD to provide the closet to the optimal point on the receiver operating characteristic (ROC) diagram when compared with existing machine learning methods. Incorporating the MMCM and the MPRM models into LAD establishes a novel implementation of LAD that makes LAD a promising cost-sensitive learning classifier compared to other machine learning classifiers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. Multi-representation web service recommendation system based on attention mechanism.

Author: Dang, Depeng, Guo, Bilin, Fang, Tingting, and Zhang, Ying
Subjects: WEB services, RECOMMENDER systems, CHOICE (Psychology), DATA mining, INFORMATION processing
Abstract: Recently, mashup developers seek to integrate multiple services with complementary functionalities from a large amount of web services. With so many available web services, it is difficult for developers to choose the right one to develop new mashups. Therefore, it is critical to create and recommend appropriate web services for mashup developers based on their development needs. In the past, various deep models have been proposed to facilitate web service recommendation based on semantic matching of textual descriptions. However, existing deep approaches mainly match global semantic representations while ignoring descriptive structure and tag information. In this paper, we propose a multi-representation web service recommendation model, which simultaneously extracts global, local and tag representations of the description and tag information, respectively. Moreover, we propose a tag-driven attention mechanism to guide the process of information extraction. Experiments over a real-world dataset demonstrate that our proposed service recommendation algorithm can achieve remarkable performance. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. BPF: a novel cluster boundary points detection method for static and streaming data.

Author: Khalique, Vijdan, Kitagawa, Hiroyuki, and Amagasa, Toshiyuki
Subjects: OUTLIER detection, K-nearest neighbor classification
Abstract: Data points situated near a cluster boundary are called boundary points and they can represent useful information about the process generating this data. The existing methods of boundary points detection cannot differentiate boundary points from outliers as they are affected by the presence of outliers as well as by the size and density of clusters in the dataset. Also, they require tuning of one or more parameters and prior knowledge of the number of outliers in the dataset for tuning. In this research, a boundary points detection method called BPF is proposed which can effectively differentiate boundary points from outliers and core points. BPF combines the well-known outlier detection method Local Outlier Factor (LOF) with Gravity value to calculate the BPF score. Our proposed algorithm StaticBPF can detect the top-m boundary points in the given dataset. Importantly, StaticBPF requires tuning of only one parameter i.e. the number of nearest neighbors (k) and can employ the same k used by LOF for outlier detection. This paper also extends BPF for streaming data and proposes StreamBPF. StreamBPF employs a grid structure for improving k-nearest neighbor computation and an incremental method of calculating BPF scores of a subset of data points in a sliding window over data streams. In evaluation, the accuracy of StaticBPF and the runtime efficiency of StreamBPF are evaluated on synthetic and real data where they generally performed better than their competitors. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

18. MatchSim: a novel similarity measure based on maximum neighborhood matching.

Author: Lin, Zhenjiang, Lyu, Michael, and King, Irwin
Subjects: DATA mining, SOCIAL networks, INFORMATION retrieval, APPROXIMATION algorithms, DATABASE management
Abstract: Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that 'similar objects have similar neighbors' and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

19. Editorial.

Author: Ng, Wee, Kitsuregawa, Masaru, and Li, Jianzhong
Subjects: WEBSITES, DATA mining
Abstract: The authors introduce several articles published in the issue, including a paper extracting and summarizing hot item features from multiple auction web sites and another on preservation of privacy while data mining.
Published: 2008
Full Text: View/download PDF

20. Information extraction from electronic medical documents: state of the art and future research directions.

Author: Landolsi, Mohamed Yassine, Hlaoua, Lobna, and Ben Romdhane, Lotfi
Subjects: DATA mining, ELECTRONIC records, NATURAL language processing, EXPERIMENTAL literature, MEDICAL errors
Abstract: In the medical field, a doctor must have a comprehensive knowledge by reading and writing narrative documents, and he is responsible for every decision he takes for patients. Unfortunately, it is very tiring to read all necessary information about drugs, diseases and patients due to the large amount of documents that are increasing every day. Consequently, so many medical errors can happen and even kill people. Likewise, there is such an important field that can handle this problem, which is the information extraction. There are several important tasks in this field to extract the important and desired information from unstructured text written in natural language. The main principal tasks are named entity recognition and relation extraction since they can structure the text by extracting the relevant information. However, in order to treat the narrative text we should use natural language processing techniques to extract useful information and features. In our paper, we introduce and discuss the several techniques and solutions used in these tasks. Furthermore, we outline the challenges in information extraction from medical documents. In our knowledge, this is the most comprehensive survey in the literature with an experimental analysis and a suggestion for some uncovered directions. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

21. Integrated anchor and social link predictions across multiple social networks.

Author: Zhan, Qianyi, Zhang, Jiawei, and Yu, Philip S.
Subjects: SOCIAL networks, ONLINE social networks, SOCIAL sciences education, RANDOM walks, ONLINE business networks (Social networks), PROBABILISTIC databases
Abstract: In recent years, various online social networks offering specific services have gained great popularity and success. To enjoy more online social services, some users can be involved in multiple social networks simultaneously. A challenging problem in social network studies is to identify the common users across networks to gain better understanding of user behavior. This is referred to as the anchor link prediction problem. Meanwhile, across these partially aligned social networks, users can be connected by different kinds of links, e.g., social links among users in one single network and anchor links between accounts of the shared users in different networks. Many different link prediction methods have been proposed so far to predict each type of links separately. In this paper, we want to predict the formation of social links among users in the target network as well as anchor links aligning the target network with other external social networks. The problem is formally defined as the "collective link identification" problem. Predicting the formation of links in social networks with traditional link prediction methods, e.g., classification-based methods, can be very challenging. The reason is that, from the network, we can only obtain the formed links (i.e., positive links) but no information about the links that will never be formed (i.e., negative links). To solve the collective link identification problem, a unified link prediction framework, collective link fusion (CLF) is proposed in this paper, which consists of two phases: step (1) collective link prediction of anchor and social links with positive and unlabeled learning techniques, and step (2) propagation of predicted links across the partially aligned "probabilistic networks" with collective random walk. Extensive experiments conducted on two real-world partially aligned networks demonstrate that CLF can perform very well in predicting social and anchor links concurrently. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

22. An efficient algorithm for mining closed high utility itemsets over data streams with one dataset scan.

Author: Han, Meng, Cheng, Haodong, Zhang, Ni, Li, Xiaojuan, and Wang, Le
Subjects: MINE closures, ALGORITHMS, PROCESS mining, DATA mining
Abstract: The high utility itemsets mining over data streams will produce many redundant itemsets. To remove redundant itemsets, the researchers proposed to mine the closed high utility itemsets, the number of which is much smaller than that of the complete high utility itemsets and the result is lossless. However, the existing closed high utility itemsets mining algorithm over data streams needs to scan the dataset twice, and this algorithm that requires multiple scans cannot meet the real-time processing requirements of the streaming environment. To solve the above problem, this paper proposed a new algorithm CHUIDS_OSc that only needs to scan the original dataset once to achieve mining closed high utility itemsets over data streams. A new utility-list structure is designed in CHUIDS_OSc, and this structure can quickly complete the construction and update of batch information without rescanning the original dataset. In addition, effective pruning strategies are applied to improve the closed itemsets mining process and eliminate potential low utility candidates. Experimental evaluations show the efficiency and feasibility of the algorithm for scanning and processing datasets. As far as the running time is concerned, it is better than the previously proposed closed high utility itemsets mining algorithms that require multiple scans over data streams. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

23. An overview of high utility itemsets mining methods based on intelligent optimization algorithms.

Author: Han, Meng, Gao, Zhihui, Li, Ang, Liu, Shujuan, and Mu, Dongliang
Subjects: SWARM intelligence, MATHEMATICAL optimization, PARTICLE swarm optimization, EVOLUTIONARY algorithms, GENETIC algorithms
Abstract: Mining high utility itemsets from massive data is one of the most active research directions in data mining at present. Intelligent optimization algorithms have been applied to the high utility itemsets mining because of their flexibility and intelligence, and have achieved good results. In this paper, high utility itemsets mining strategies based on swarm intelligence optimization algorithms are mainly analyzed and summarized comprehensively, and the strategies based on the evolutionary algorithms and other intelligence optimization algorithms are introduced in detail. The method based on swarm intelligence optimization algorithm is summarized and compared from the aspects of update strategy, pruning strategy, comparison algorithms, dataset, parameter settings, advantages, disadvantages, etc. The methods based on particle swarm optimization are classified in terms of particle update, which are traditional update strategies, sigmoid function-based strategies, greed-based strategies, roulette mechanism-based strategies, and set-based strategies. The experimental comparative analysis of the algorithms is carried out in terms of the operational efficiency of the algorithms and the number of high utility itemsets mined by the algorithms under the conditions of the same dataset. The experimental analysis shows that the strategy based on the swarm intelligence optimization algorithm is optimal, especially the high utility itemsets mining algorithm based on the bionic algorithm, which has a shorter running time and less number of high utility itemsets lost, and the least efficient strategy based on the genetic algorithm, which will lose a large number of itemsets. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

24. Distribution-free bounds for relational classification.

Author: Dhurandhar, Amit and Dobra, Alin
Subjects: NONPARAMETRIC statistics, MACHINE learning, MATHEMATICAL statistics, INFORMATION resources management, DATA mining
Abstract: Statistical relational learning (SRL) is a subarea in machine learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)-as is generally assumed. For the traditional i.i.d. setting, distribution-free bounds exist, such as the Hoeffding bound, which are used to provide confidence bounds on the generalization error of a classification algorithm given its hold-out error on a sample size of N. Bounds of this form are currently not present for the type of interactions that are considered in the data by relational classification algorithms. In this paper, we extend the Hoeffding bounds to the relational setting. In particular, we derive distribution-free bounds for certain classes of data generation models that do not produce i.i.d. data and are based on the type of interactions that are considered by relational classification algorithms that have been developed in SRL. We conduct empirical studies on synthetic and real data which show that these data generation models are indeed realistic and the derived bounds are tight enough for practical use. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

25. Surface pattern-enhanced relation extraction with global constraints.

Author: Jiang, Haiyun, Liu, JunTao, Zhang, Sheng, Yang, Deqing, Xiao, Yanghua, and Wang, Wei
Subjects: DATA mining, GSM communications, PRIOR learning, EMBEDDINGS (Mathematics)
Abstract: Relation extraction is one of the most important tasks in information extraction. The traditional works either use sentences or surface patterns (i.e., the shortest dependency paths of sentences) to build extraction models. Intuitively, the integration of these two kinds of methods will further obtain more robust and effective extraction models, which is, however, ignored in most of the existing works. In this paper, we aim to learn the embeddings of surface patterns to further augment the sentence-based models. To achieve this purpose, we propose a novel pattern embedding learning framework with the weighted multi-dimensional attention mechanism. To suppress noise in the training dataset, we mine the global statistics between patterns and relations and introduce two kinds of prior knowledge to guide the pattern embedding learning. Based on the learned embeddings, we present two augmentation strategies to improve the existing relation extraction models. We conduct extensive experiments on two popular datasets (i.e., NYT and KnowledgeNet) and observe promising performance improvements. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. Elastic distances for time-series classification: Itakura versus Sakoe-Chiba constraints.

Author: Geler, Zoltan, Kurbalija, Vladimir, Ivanović, Mirjana, and Radovanović, Miloš
Subjects: EUCLIDEAN distance, TIME series analysis, NEAREST neighbor analysis (Statistics), DATA mining, PARALLELOGRAMS, CLASSIFICATION, MATHEMATICAL continuum
Abstract: In the field of time series data mining, the accuracy of the simple, but very successful nearest neighbor (NN) classifier directly depends on the chosen similarity measure. To improve the efficiency of elastic measures introduced to overcome the shortcomings of Euclidean distance, the Sakoe-Chiba band is usually applied as a constraint. In this paper, we provide a detailed analysis of the influence of the alternative Itakura parallelogram constraint on the accuracy of the NN classifier in combination with four well-known elastic measures, compared to the Sakoe-Chiba constraint and the unconstrained variants of these measures. The findings suggest that, although the Sakoe-Chiba band generally produces better results, for certain types of datasets the Itakura parallelogram represents a better choice. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

27. The examination of the effect of the criterion for neural network's learning on the effectiveness of the qualitative analysis of multidimensional data.

Author: Jamróz, Dariusz
Subjects: DATA analysis, DATA modeling, QUALITATIVE chemical analysis, ARTIFICIAL neural networks, DATA mining
Abstract: A variety of multidimensional visualization methods are applied for the qualitative analysis of multidimensional data. One of the multidimensional data visualization methods is a method using autoassociative neural networks. In order to perform visualizations of n-dimensional data, such a network has n inputs, n outputs and one of the interlayers consisting of two outputs whose values represent coordinates of the analyzed sample's image on the screen. Such a criterion for the network's learning consists in that the same value as the one at the ith input appears at each ith output. If the network is trained in this way, the whole information from n inputs was compressed to two outputs of the interlayer and then decompressed to n network outputs. The paper shows the application of different learning criteria can be more beneficial from the point of view of the results' readability. Overall analysis was conducted on seven-dimensional real data representing three coal classes, five-dimensional data representing printed characters, 216-dimensional data representing hand-written digits and, additionally, in order to illustrate additional explanations using artificially generated seven-dimensional data. Readability of results of the qualitative analysis of these data was compared using the multidimensional visualization utilizing neural networks for different learning criteria. Also, the obtained results of applying all analyzed criteria on 20 randomly selected sets of multidimensional data obtained from one of the publicly available repositories are presented. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. Missing data imputation using decision trees and fuzzy clustering with iterative learning.

Author: Nikfalazar, Sanaz, Yeh, Chung-Hsing, Bedingfield, Susan, and Khorshidi, Hadi A.
Subjects: DECISION trees, MISSING data (Statistics), MULTIPLE imputation (Statistics), CATEGORIES (Mathematics), DATA mining, MACHINE learning
Abstract: Various imputation approaches have been proposed to address the issue of missing values in data mining and machine learning applications. To improve the accuracy of missing data imputation, this paper proposes a new method called DIFC by integrating the merits of decision tress and fuzzy clustering into an iterative learning approach. To compare the performance of the DIFC method against five effective imputation methods, extensive experiments are conducted on six widely used datasets with numerical and categorical missing data, and with various amounts and types of missing values. The experimental results show that the DIFC method outperforms other methods in terms of imputation accuracy. Further experiments on the effect of missing value types demonstrate the robustness of the DIFC method in dealing with different types of missing values. This paper contributes to missing data imputation research by providing an accurate and robust method. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. Exceptionally monotone models-the rank correlation model class for Exceptional Model Mining.

Author: Downar, Lennart and Duivesteijn, Wouter
Subjects: COHERENT analytic sheaves, RANK correlation (Statistics), STATISTICAL correlation, PEARSON correlation (Statistics), COMPUTATIONAL chemistry
Abstract: Exceptional Model Mining strives to find coherent subgroups of the dataset where multiple target attributes interact in an unusual way. One instance of such an investigated form of interaction is Pearson's correlation coefficient between two targets. EMM then finds subgroups with an exceptionally linear relation between the targets. In this paper, we enrich the EMM toolbox by developing the more general rank correlation model class. We find subgroups with an exceptionally monotone relation between the targets. Apart from catering for this richer set of relations, the rank correlation model class does not necessarily require the assumption of target normality, which is implicitly invoked in the Pearson's correlation model class. Furthermore, it is less sensitive to outliers. We provide pseudocode for the employed algorithm and analyze its computational complexity, and experimentally illustrate what the rank correlation model class for EMM can find for you on six datasets from an eclectic variety of domains. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

30. Combining supervised term-weighting metrics for SVM text classification with extended term representation.

Author: Haddoud, Mounia, Mokhtari, Aïcha, Lecroq, Thierry, and Abdeddaïm, Saïd
Subjects: SUPPORT vector machines, CLASSIFICATION, STATISTICAL weighting, DATA mining, VECTOR spaces
Abstract: The accuracy of a text classification method based on a SVM learner depends on the weighting metric used in order to assign a weight to a term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. A supervised metric should be highly informative about the relation of a document term to a category, and discriminative in separating the positive documents from the negative documents for this category. In this paper, we propose 80 metrics never used for the term-weighting problem and compare them to 16 functions of the literature. A large number of these metrics were initially proposed for other data mining problems: feature selection, classification rules and term collocations. While many previous works have shown the merits of using a particular metric, our experience suggests that the results obtained by such metrics can be highly dependent on the label distribution on the corpus and on the performance measures used (microaveraged or macroaveraged $$F_1$$ -Score). The solution that we propose consists in combining the metrics in order to improve the classification. More precisely, we show that using a SVM classifier which combines the outputs of SVM classifiers that utilize different metrics performs well in all situations. The second main contribution of this paper is an extended term representation for the vector space model that improves significantly the prediction of the text classifier. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

31. Empirical comparison of supervised learning techniques for missing value imputation.

Author: Tsai, Chih-Fong and Hu, Ya-Han
Subjects: SUPERVISED learning, MISSING data (Statistics), SUPPORT vector machines, K-nearest neighbor classification, MACHINE learning, DATA mining
Abstract: Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

32. Modeling and implementing distributed data mining strategies in JaCa-DDM.

Author: Limón, Xavier, Guerra-Hernández, Alejandro, Cruz-Ramírez, Nicandro, and Grimaldo, Francisco
Subjects: DATA mining, MULTIAGENT systems, DISTRIBUTED computing, MACHINE learning, DECISION trees, AUTOMATIC extracting (Information science), COMPUTER software reusability
Abstract: This work introduces JaCa-DDM, a novel distributed data mining system founded on the agents and artifacts paradigm, conceived to design, implement, deploy, and evaluate learning strategies. Jason rational agents conform to such strategies to cope with distributed computing environments, where CArtAgO artifacts encapsulate learning algorithms, data sources, evaluation tools, and other services implemented in Weka for data mining tasks. The set of strategies presented in this paper aims at encouraging the use of JaCa-DDM to develop new ones, suited to different needs. For this, our system provides tools to evaluate the resulting models in terms of accuracy, number of instances employed to learn, time of convergence, and volume of communications. Although the emphasis in decision trees, JaCa-DDM can be easily extended by adopting new artifacts, e.g., for meta-learning. The main contributions of the paper are as follows: (i) From the multi-agent systems perspective, our approach illustrates how to exploit the so-called "agentification" of Weka for the sake of code reusability, while preserving the benefits of reasoning at the Belief–Desire–Intention level with Jason; (ii) from the data mining perspective, JaCa-DDM is promoted as an extensible tool to define and test distributed strategies; and (iii) a set of strategies including centralizing, meta-learning and Windowing-based approaches, is carefully analyzed to provide comparisons among them. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

33. Analyzing large-scale human mobility data: a survey of machine learning methods and applications.

Author: Toch, Eran, Lerner, Boaz, Ben-Zion, Eyal, and Ben-Gal, Irad
Subjects: CELL phones, MACHINE learning, DATA mining, GLOBAL Positioning System, LOCATION-based services, SOCIAL mobility
Abstract: Human mobility patterns reflect many aspects of life, from the global spread of infectious diseases to urban planning and daily commute patterns. In recent years, the prevalence of positioning methods and technologies, such as the global positioning system, cellular radio tower geo-positioning, and WiFi positioning systems, has driven efforts to collect human mobility data and to mine patterns of interest within these data in order to promote the development of location-based services and applications. The efforts to mine significant patterns within large-scale, high-dimensional mobility data have solicited use of advanced analysis techniques, usually based on machine learning methods, and therefore, in this paper, we survey and assess different approaches and models that analyze and learn human mobility patterns using mainly machine learning methods. We categorize these approaches and models in a taxonomy based on their positioning characteristics, the scale of analysis, the properties of the modeling approach, and the class of applications they can serve. We find that these applications can be categorized into three classes: user modeling, place modeling, and trajectory modeling, each class with its characteristics. Finally, we analyze the short-term trends and future challenges of human mobility analysis. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

34. A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks.

Author: Das, Kamalika, Bhaduri, Kanishka, and Kargupta, Hillol
Subjects: ALGORITHMS, PEER-to-peer architecture (Computer networks), MACHINE learning, DISTRIBUTED computing, PRIVACY
Abstract: In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used in machine learning for data compaction and efficient learning by eliminating the curse of dimensionality. There exist many solutions for feature selection when the data are located at a central location. However, it becomes extremely challenging to perform the same when the data are distributed across a large number of peers or machines. Centralizing the entire dataset or portions of it can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, dynamic nature of the data/network, and privacy concerns. The solution proposed in this paper allows us to perform feature selection in an asynchronous fashion with a low communication overhead where each peer can specify its own privacy constraints. The algorithm works based on local interactions among participating nodes. We present results on real-world dataset in order to test the performance of the proposed algorithm. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

35. Exploiting edge semantics in citation graphs using efficient, vertical ARM.

Author: Rahal, Imad, Ren, Dongmei, Weihua Wu, Denton, Anne, Besemann, Christopher, and Perrizo, William
Subjects: GRAPH theory, SEMANTICS, PUBLICATIONS, ALGORITHMS, ALGEBRA
Abstract: Graphs are increasingly becoming a vital source of information within which a great deal of semantics is embedded. As the size of available graphs increases, our ability to arrive at the embedded semantics grows into a much more complicated task. One form of important hidden semantics is that which is embedded in the edges of directed graphs. Citation graphs serve as a good example in this context. This paper attempts to understand temporal aspects in publication trends through citation graphs, by identifying patterns in the subject matters of scientific publications using an efficient, vertical association rule mining model. Such patterns can (a) indicate subject-matter evolutionary history, (b) highlight subject-matter future extensions, and (c) give insights on the potential effects of current research on future research. We highlight our major differences with previous work in the areas of graph mining, citation mining, and Web-structure mining, propose an efficient vertical data representation model, introduce a new subjective interestingness measure for evaluating patterns with a special focus on those patterns that signify strong associations between properties of cited papers and citing papers, and present an efficient algorithm for the purpose of discovering rules of interest followed by a detailed experimental analysis. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

36. Random-data perturbation techniques and privacy-preserving data mining.

Author: Kargupta, Hillol, Datta, Souptik, Wang, Qi, and Sivakumar, Krishnamoorthy
Subjects: DATA mining, METHODOLOGY, MATRICES (Mathematics), ALGORITHMS
Abstract: Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

37. Characterizing and Mining the Citation Graph of the Computer Science Literature.

Author: An, Yuan, Janssen, Jeannette, and Milios, Evangelos E.
Subjects: GRAPHIC methods, DATA mining, DATABASE searching, KNOWLEDGE management, COMPUTER science
Abstract: Citation graphs representing a body of scientific literature convey measures of scholarly activity and productivity. In this work we present a study of the structure of the citation graph of the computer science literature. Using a web robot we built several topic-specific citation graphs and their union graph from the digital library ResearchIndex. After verifying that the degree distributions follow a power law, we applied a series of graph theoretical algorithms to elicit an aggregate picture of the citation graph in terms of its connectivity. We discovered the existence of a single large weakly-connected and a single large biconnected component, and confirmed the expected lack of a large strongly-connected component. The large components remained even after removing the strongest authority nodes or the strongest hub nodes, indicating that such tight connectivity is widespread and does not depend on a small subset of important nodes. Finally, minimum cuts between authority papers of different areas did not result in a balanced partitioning of the graph into areas, pointing to the need for more sophisticated algorithms for clustering the graph. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

38. Generalized maximal utility for mining high average-utility itemsets.

Author: Song, Wei, Liu, Lu, and Huang, Chaomin
Subjects: ALGORITHMS, DATA mining
Abstract: Mining high average-utility itemsets (HAUIs) is a promising research topic in data mining because, in contrast to high utility itemsets, they are not biased toward long itemsets. Regardless of what upper bounds and pruning strategies are used, most existing HAUI mining algorithms are founded on the concept of maximal utility, namely the highest utility of a single item in each transaction. In this paper, we study this problem by generalizing the typical maximal utility and average-utility upper bound from a single item to an itemset, and propose an efficient HAIU mining algorithm based on generalized maximal utility (HAUIM-GMU). For this algorithm, we first propose the concepts of generalized maximal utility and the generalized average-utility upper bound, and discuss how the proposed upper bound can be made tighter to generate fewer candidates. A new pruning strategy is then proposed based on the concept of support, and this is shown to be effective for filtering out unpromising itemsets. The final algorithm is described in detail. Extensive experimental results show that the HAUIM-GMU algorithm outperforms existing state-of-the-art algorithms. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

39. Visualisation and interaction for scientific exploration and knowledge discovery.

Author: Zudilova-Seinstra, Elena and Adriaansen, Tony
Subjects: DATA mining, ONLINE data processing
Abstract: The article discusses reports published within the issue including one on the human-computer visual interaction approach to the data mining of complex spatio-temporal phenomena and another one on the distributed scientific collaboration.
Published: 2007
Full Text: View/download PDF

40. Class imbalance revisited: a new experimental setup to assess the performance of treatment methods.

Author: Prati, Ronaldo, Batista, Gustavo, and Silva, Diego
Subjects: MACHINE learning, DATA mining, PATTERN recognition systems, EXPERIMENTAL design, SUPPORT vector machines
Abstract: In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as 'Are all learning paradigms equally affected by class imbalance?', 'What is the expected performance loss for different imbalance degrees?' and 'How much of the performance losses can be recovered by the treatment methods?'. In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5 %) for the most balanced distributions up to 10 % of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20 % for 1 % of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30 % or less of the performance that was lost due to class imbalance was recovered by these methods. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

41. Privacy-preserving LOF outlier detection.

Author: Li, Lu, Huang, Liusheng, Yang, Wei, Yao, Xiaohui, and Liu, An
Subjects: DATA mining, OUTLIER detection, K-nearest neighbor classification, PERMUTATIONS, DATA encryption
Abstract: LOF is a well-known approach for density-based outlier detection and has received much attention recently. It is important to design a privacy-preserving LOF outlier detection algorithm as the data on which LOF runs is typically spilt among multiple participants and no one is willing to disclose his sensitive information due to legal or moral considerations. This is, however, a hard problem since participants need to find the maximum one of the distances between an object and its k-Nearest Neighbors ( k-NN) without learning the information of these objects. In this paper, we propose an efficient protocol for privacy-preserving LOF outlier detection. We first employ a shuffle protocol to permute the distance vectors owned by different participants. Then, we design a secure selection method to obtain the garbled k-NN indexes and shares of k-distance for given objects. For each object, we make use of the k-distance of all objects to construct a vector, based on which the permute protocol is executed again to obtain new shares of k-distance. Finally, the shares corresponding to the garbled k-NN indexes are selected as the expected result. Our protocol ensures that all the intermediates are shared between multiple participants and thus avoid information leaking. In addition, our protocol is efficient as we prove that the computation and communication complexity of our protocol is bounded by $$O(n^2)$$ . [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

42. Improving the efficiency of traditional DTW accelerators.

Author: Tavenard, Romain and Amsaleg, Laurent
Subjects: DATABASE searching, BIOINFORMATICS, LIFE sciences, DATA mining, DATABASE management, DATABASES
Abstract: Dynamic time warping (DTW) is the most popular approach for evaluating the similarity of time series, but its computation is costly. Therefore, simple functions lower bounding DTW distances have been designed, accelerating searches by quickly pruning sequences that could not possibly be best matches. The tighter the bounds, the more they prune and the better the performance. Designing new functions that are even tighter is difficult because their computation is likely to become complex, canceling the benefits of their pruning. It is possible, however, to design simple functions with a higher pruning power by relaxing the no false dismissal assumption, resulting in approximate lower bound functions. This paper describes how very popular approaches accelerating DTW such as $$\text {LB}\_\text {Keogh}{}$$ and $$\text {LB}\_\text {PAA}{}$$ can be made more efficient via approximations. The accuracy of approximations can be tuned, ranging from no false dismissal to potential losses when aggressively set for great response time savings. At very large scale, indexing time series is mandatory. This paper also describes how approximate lower bound functions can be used with iSAX. Furthermore, it shows that a $$k$$ -means-based quantization step for iSAX gives significant performance gains. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

43. Imputing missing value through ensemble concept based on statistical measures.

Author: Jenghara, Moslem Mohammadi, Ebrahimpour-Komleh, Hossein, Rezaie, Vahideh, Nejatian, Samad, Parvin, Hamid, and Yusof, Sharifah Kamilah Syed
Subjects: DATA mining, STATISTICAL matching, MULTIPLE imputation (Statistics), KERNEL functions, DISTRIBUTION (Probability theory), KURTOSIS
Abstract: Many datasets include missing values in their attributes. Data mining techniques are not applicable in the presence of missing values. So an important step in preprocessing of a data mining task is missing value management. One of the most important categories in missing value management techniques is missing value imputation. This paper presents a new imputation technique. The proposed imputation technique is based on statistical measurements. The suggested imputation technique employs an ensemble of the estimators built to estimate the missing values based on positive and negative correlated observed attributes separately. Each estimator guesses a value for a missed value based on the average and variance of that feature. The average and variance of the feature are estimated from the non-missed values of that feature. The final consensus value for a missed value is the weighted aggregation of the values estimated by different estimators. The chief weight is attribute correlation, and the slight weight is dependent to kernel function such as kurtosis, skewness, number of involved samples and composition of them. The missing values are deliberately produced randomly at different levels. The experimentations indicate that the suggested technique has a good accuracy in comparison with the classical methods. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

44. Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach.

Author: Vluymans, Sarah, Fernández, Alberto, Saeys, Yvan, Cornelis, Chris, and Herrera, Francisco
Subjects: ROUGH sets, SET theory, AGGREGATION (Statistics), BIG data, DATA mining
Abstract: Class imbalance occurs when data elements are unevenly distributed among classes, which poses a challenge for classifiers. The core focus of the research community has been on binary-class imbalance, although there is a recent trend toward the general case of multi-class imbalanced data. The IFROWANN method, a classifier based on fuzzy rough set theory, stands out for its performance in two-class imbalanced problems. In this paper, we consider its extension to multi-class data by combining it with one-versus-one decomposition. The latter transforms a multi-class problem into two-class sub-problems. Binary classifiers are applied to these sub-problems, after which their outcomes are aggregated into one prediction. We enhance the integration of IFROWANN in the decomposition scheme in two steps. Firstly, we propose an adaptive weight setting for the binary classifier, addressing the varying characteristics of the sub-problems. We call this modified classifier IFROWANN-WIR. Second, we develop a new dynamic aggregation method called WV-FROST that combines the predictions of the binary classifiers with the global class affinity before making a final decision. In a meticulous experimental study, we show that our complete proposal outperforms the state-of-the-art on a wide range of multi-class imbalanced datasets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

45. OntoILPER: an ontology- and inductive logic programming-based system to extract entities and relations from text.

Author: Lima, Rinaldo, Espinasse, Bernard, and Freitas, Fred
Subjects: LOGIC programming, DATA mining, MACHINE learning, DATA extraction, ELECTRONIC data processing
Abstract: Named entity recognition (NER) and relation extraction (RE) are two important subtasks in information extraction (IE). Most of the current learning methods for NER and RE rely on supervised machine learning techniques with more accurate results for NER than RE. This paper presents OntoILPER a system for extracting entity and relation instances from unstructured texts using ontology and inductive logic programming, a symbolic machine learning technique. OntoILPER uses the domain ontology and takes advantage of a higher expressive relational hypothesis space for representing examples whose structure is relevant to IE. It induces extraction rules that subsume examples of entities and relation instances from a specific graph-based model of sentence representation. Furthermore, OntoILPER enables the exploitation of the domain ontology and further background knowledge in the form of relational features. To evaluate OntoILPER, several experiments over the TREC corpus for both NER and RE tasks were conducted and the yielded results demonstrate its effectiveness in both tasks. This paper also provides a comparative assessment among OntoILPER and other NER and RE systems, showing that OntoILPER is very competitive on NER and outperforms the selected systems on RE. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

46. Exploiting highly qualified pattern with frequency and weight occupancy.

Author: Gan, Wensheng, Lin, Jerry Chun-Wei, Fournier-Viger, Philippe, Chao, Han-Chieh, Zhan, Justin, and Zhang, Ji
Subjects: DATA mining, SEARCH engines, INTERNET searching, WEB search engines, ELECTRONIC information resource searching
Abstract: By identifying useful knowledge embedded in the behavior of search engines, users can provide valuable information for web searching and data mining. Numerous algorithms have been proposed to find the desired interesting patterns, i.e., frequent pattern, in real-world applications. Most of those studies use frequency to measure the interestingness of patterns. However, each object may have different importance in these real-world applications, and the frequent ones do not usually contain a large portion of the desired patterns. In this paper, we present a novel method, called exploiting highly qualified patterns with frequency and weight occupancy (QFWO), to suggest the possible highly qualified patterns that utilize the idea of co-occurrence and weight occupancy. By considering item weight, weight occupancy and the frequency of patterns, in this paper, we designed a new highly qualified patterns. A novel Set-enumeration tree called the frequency-weight (FW)-tree and two compact data structures named weight-list and FW-table are designed to hold the global downward closure property and partial downward closure property of quality and weight occupancy to further prune the search space. The proposed method can exploit high qualified patterns in a recursive manner without candidate generation. Extensive experiments were conducted both on real-world and synthetic datasets to evaluate the effectiveness and efficiency of the proposed algorithm. Results demonstrate that the obtained patterns are reasonable and acceptable. Moreover, the designed QFWO with several pruning strategies is quite efficient in terms of runtime and search space. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

47. Discovering topic structures of a temporally evolving document corpus.

Author: Beykikhoshk, Adham, Arandjelović, Ognjen, Phung, Dinh, and Venkatesh, Svetha
Subjects: DIRICHLET problem, MARKOV processes, GRAPH theory, ALGORITHMS, AUTISM spectrum disorders, DATA mining
Abstract: In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting, and merging. The power of the proposed framework is demonstrated on two medical literature corpora concerned with the autism spectrum disorder (ASD) and the metabolic syndrome (MetS)—both increasingly important research subjects with significant social and healthcare consequences. In addition to the collected ASD and metabolic syndrome literature corpora which we made freely available, our contribution also includes an extensive empirical analysis of the proposed framework. We describe a detailed and careful examination of the effects that our algorithms’s free parameters have on its output and discuss the significance of the findings both in the context of the practical application of our algorithm as well as in the context of the existing body of work on temporal topic analysis. Our quantitative analysis is followed by several qualitative case studies highly relevant to the current research on ASD and MetS, on which our algorithm is shown to capture well the actual developments in these fields. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

48. DeepAM: a heterogeneous deep learning framework for intelligent malware detection.

Author: Ye, Yanfang, Chen, Lingwei, Hou, Shifu, Hardy, William, and Li, Xin
Subjects: MALWARE, MACHINE learning, DATA mining, FEATURE selection, ARTIFICIAL intelligence
Abstract: With computers and the Internet being essential in everyday life, malware poses serious and evolving threats to their security, making the detection of malware of utmost concern. Accordingly, there have been many researches on intelligent malware detection by applying data mining and machine learning techniques. Though great results have been achieved with these methods, most of them are built on shallow learning architectures. Due to its superior ability in feature learning through multilayer deep architecture, deep learning is starting to be leveraged in industrial and academic research for different applications. In this paper, based on the Windows application programming interface calls extracted from the portable executable files, we study how a deep learning architecture can be designed for intelligent malware detection. We propose a heterogeneous deep learning framework composed of an AutoEncoder stacked up with multilayer restricted Boltzmann machines and a layer of associative memory to detect newly unknown malware. The proposed deep learning model performs as a greedy layer-wise training operation for unsupervised feature learning, followed by supervised parameter fine-tuning. Different from the existing works which only made use of the files with class labels (either malicious or benign) during the training phase, we utilize both labeled and unlabeled file samples to pre-train multiple layers in the heterogeneous deep learning framework from bottom to up for feature learning. A comprehensive experimental study on a real and large file collection from Comodo Cloud Security Center is performed to compare various malware detection approaches. Promising experimental results demonstrate that our proposed deep learning framework can further improve the overall performance in malware detection compared with traditional shallow learning methods, deep learning methods with homogeneous framework, and other existing anti-malware scanners. The proposed heterogeneous deep learning framework can also be readily applied to other malware detection tasks. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

49. GrAFCI+ A fast generator-based algorithm for mining frequent closed itemsets.

Author: Ledmi, Makhlouf, Zidat, Samir, and Hamdi-Cherif, Aboubekeur
Subjects: ALGORITHMS, ASSOCIATION rule mining, DATA mining
Abstract: Mining itemsets for association rule generation is a fundamental data mining task originally stemming from the traditional market basket analysis problem. However, enumerating all frequent itemsets, especially in a dense dataset, or with low support thresholds, remains costly. In this paper, a novel theorem builds the relationship between frequent closed itemsets (FCIs) and frequent generator itemsets (FGIs) and proves that the process of mining FCIs is equivalent to mining FGIs, unified with their full-support and extension items. On the basis of this theorem, a generator-based algorithm for mining FCIs, called GrAFCI+, is proposed and explained in details including its correctness. The comparative effectiveness of the algorithm in terms of scalability is first investigated, along with the compression rate—a measure of the interestingness of a given FIs representation. Extensive experiments are further undertaken on eight datasets and four state-of-the-art algorithms, namely DCI_CLOSED*, DCI_PLUS, FPClose, and NAFCP. The results show that the proposed algorithm is more efficient regarding the execution time in most cases as compared to these algorithms. Because GrAFCI+ main goal is to address the runtime issue, it paid a memory cost, especially when the support is too small. However, this cost is not high since GrAFCI+ is seconded by only one competitor out of four in memory utilization and for large support values. As an overall assessment, GrAFCI+ gives better results than most of its competitors. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

50. A Nested Two-Stage Clustering Method for Structured Temporal Sequence Data.

Author: Wang, Liang, Narayanan, Vignesh, Yu, Yao-Chi, Park, Yikyung, and Li, Jr-Shin
Subjects: FOOD diaries, PROBABILITY measures, SMART meters, ALGORITHMS, DATA mining
Abstract: Mining patterns of temporal sequence data is an important problem across many disciplines. Under appropriate preprocessing procedures, a structured temporal sequence can be organized into a probability measure or a time series representation, which grants a potential to reveal distinctive temporal pattern characteristics. In this paper, we propose a nested two-stage clustering method that integrates optimal transport and the dynamic time warping distances to learn the distributional and dynamic shape-based dissimilarity at the respective stage. The proposed clustering algorithm preserves both the distribution and shape patterns present in the data, which are critical for the datasets composed of structured temporal sequences. The effectiveness of the method is tested against existing agglomerative and K-shape-based clustering algorithms on Monte Carlo simulated synthetic datasets, and the performance is compared through various cluster validation metrics. Furthermore, we apply the developed method to real-world datasets from three domains: temporal dietary records, online retail sales, and smart meter energy profiles. The expressiveness of the cluster and subcluster centroid patterns shows significant promise of our method for structured temporal sequence data mining. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

476 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources