Journal: knowledge-based systems / Publisher: elsevier b.v. / Topic: data mining - Searchworks@Jio Institute Digital Library Search Results

1. HEPM: High-efficiency pattern mining.

Author: Zhang, Xiaojie, Chen, Guoting, Song, Linqi, Gan, Wensheng, and Song, Yunling
Abstract: Pattern mining (PM) is an important field of data mining and has gained considerable momentum recently, mainly owing to the massive growth of big data. PM often sets attentive objectives such as mining frequent or high utility patterns to obtain attractive patterns. High utility patterns address the defect of frequent patterns that cannot reveal the maximum profit. However, it neglects another vital factor, cost or investment. This paper proposes a new high-efficiency PM problem that considers both utility and investment. The problem aims to find patterns with the maximum profit-to-investment ratio. Our paper is devoted to studying high-efficiency itemsets in transaction databases. We first formulate the criteria for a high-efficiency PM problem. Subsequently, we propose a two-phase algorithm called HEPM and an improved one-phase algorithm called HEPMiner to discover high-efficiency patterns in a transaction database. We design a corresponding pruning strategy within HEPM to reduce the search space. In HEPMiner, we utilize a novel efficiency-list and an estimated efficiency co-occurrence structure in the pruning strategies to further improve the mining performance. Moreover, we derive the upper bounds of efficiency for both algorithms. The experimental results demonstrate the effectiveness and efficiency of our two algorithms. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

2. An efficient strategy for mining high-efficiency itemsets in quantitative databases.

Author: Huynh, Bao, Tung, N.T., Nguyen, Trinh D.D., Bui, Quang-Thinh, Nguyen, Loan T.T., Yun, Unil, and Vo, Bay
Abstract: The classic problems in itemset mining involve finding frequent itemsets and high-utility itemsets. However, frequent itemset mining has the disadvantage of not paying attention to the profit of products, while high-utility itemset mining does not address the issue of the cost price of the products. Therefore, neither can locate products with high-efficiency value on investment. To overcome these problems, the high-efficiency itemset mining (HEIM) problem was proposed. Despite its practicality, this issue has received little attention. The algorithms proposed to exploit high-efficiency itemsets (HEI) still use ineffective strategies on dense databases and unstrict upper bounds, requiring a lot of time and memory. To address the current issues with HEIM, the paper proposes tight upper bounds for the early pruning of candidates. Several techniques are also proposed, such as combining similar transactions and saving promising transaction locations, to reduce the cost of database scanning. Finally, the techniques are combined to propose a novel way to implement the MHEI (an efficient strategy for Mining High-Efficiency Itemsets in quantitative databases) to optimize the HEIM process. The experimental process also shows that the proposed algorithm has performance better than the state-of-the-art algorithm in HEIM. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. From content to links: Social image embedding with deep multimodal model.

Author: Huang, Feiran, Zhang, Xiaoming, Li, Zhoujun, Zhao, Zhonghua, and He, Yueying
Subjects: *IMAGE processing, *DEEP learning, *ONLINE social networks, *IMAGE retrieval, *DATA mining
Abstract: Abstract With the popularity of social network, social media data embedding has attracted extensive research interest and boomed many applications, such as image classification and cross-modal retrieval. In this paper, we examine the scenario of social images containing multimodal content (e.g., visual content and textual tags) and connecting with each other (e.g., two images submitted to the same group). In such a case, both the multimodal content and link information provide useful clues for representation learning. Therefore, simply learning the embedding from network structure or data content results in sub-optimal social image representation. In this paper, we propose a Deep Multimodal Attention Networks (DMAN) to combine multimodal content and link information for social image embedding. Specifically, to effectively incorporate the multimodal content, a visual-textual attention model is proposed to encode the fine-granularity correlation between multimodal content, i.e., the alignment between image regions and textual words. To incorporate the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the first-order proximity and the second-order proximity among images. Then the two modules are integrated into a joint deep model for social image embedding. Once the representation has been learned, a wide variety of data mining problems can be solved by using the task-specific algorithms designed for handling vector representations. Extensive experiments are conducted to demonstrate the effectiveness of our approach on multi-label classification and cross-modal search. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

4. Are we meeting a deadline? classification goal achievement in time in the presence of imbalanced data.

Author: Hlosta, Martin, Zdrahal, Zdenek, and Zendulka, Jaroslav
Subjects: *FORMAL description techniques (Computer science), *PROBLEM solving, *DATA mining, *MACHINE learning, *HIGHER education
Abstract: Abstract This paper addresses the problem of a finite set of entities which are required to achieve a goal within a predefined deadline. For example, a group of students is supposed to submit a homework by a specified cutoff. Further, we are interested in predicting which entities will achieve the goal within the deadline. The predictive models are built based only on the data from that population. The predictions are computed at various time instants by taking into account updated data about the entities. The first contribution of the paper is a formal description of the problem. The important characteristic of the proposed method for model building is the use of the properties of entities that have already achieved the goal. We call such an approach “Self-Learning”. Since typically only a few entities have achieved the goal at the beginning and their number gradually grows, the problem is inherently imbalanced. To mitigate the curse of imbalance, we improved the Self-Learning method by tackling information loss and by several sampling techniques. The original Self-Learning and the modifications have been evaluated in a case study for predicting submission of the first assessment in distance higher education courses. The results show that the proposed improvements outperform the specified two base-line models and the original Self-Learner, and also that the best results are achieved if domain-driven techniques are utilised to tackle the imbalance problem. We also showed that these improvements are statistically significant using Wilcoxon signed rank test. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

5. Multi-level feature interaction for open knowledge base canonicalization.

Author: Sui, Xuhui, Zhang, Ying, Song, Kehui, Zhou, Baohang, and Yuan, Xiaojie
Abstract: Open Information Extraction (OpenIE) aims to construct expansive open knowledge bases (OKBs) by extracting triples (noun phrase, relation phrase, noun phrase) from unstructured text. One critical problem in OKBs is the lack of canonicalization for noun phrases and relation phrases, leading to the storage of redundant and ambiguous facts. Consequently, open knowledge base canonicalization, which clusters synonymous phrases into the same group, has emerged as an active research area. Existing approaches either leverage fact triples or source context in isolation, or at best, interact them at the clustering level. However, these approaches lack explicit interaction or only loosely couple the two types of knowledge, resulting in the potential loss of valuable intermediate information. In this paper, we propose MuFIC, a novel unsupervised framework that interacts the fact triples and source context at the feature level to address these limitations. In order to capture and integrate fine-grained fact and context knowledge, we design three levels of feature interaction: low-level context-guided feature interaction, mid-level fact-guided feature interaction, and high-level gated fusion feature interaction. Furthermore, we introduce an additional objective function via contrastive learning to improve the quality of extracted features and reduce knowledge-specific noise. Finally, we design a bidirectional feedback mechanism to better guide the learning process of joint features by harnessing side information prototype learning, and to dynamically optimize side information based on the clustering results formed by joint features. Extensive experiments on three public benchmark datasets demonstrate the superiority of our proposed framework. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Novel Privacy-preserving algorithm based on frequent path for trajectory data publishing.

Author: Dong, Yulan and Pi, Dechang
Subjects: *DATA mining, *DATA security, *DATA privacy, *DATA quality, *COMPUTER algorithms
Abstract: Existing location-based services have collected a large amount of location data, which contain users’ personal information and has serious personal privacy leakage threats. Therefore, the preservation of individual privacy when publishing data is receiving increasing attention. Most existing methods of preserving user privacy suffer a serious loss in data usability, resulting in low usability of data. In this paper, we address this problem and present TOPF, a novel approach for preserving privacy in trajectory data publishing based on frequent path. TOPF aims to achieve better quality of trajectory data for publishing and strike a balance between the conflicting goals of data usability and data privacy. To the best of our knowledge, this is the first paper that uses frequent path to preserve data privacy. First, infrequent roads in each trajectory are removed, and a new way is adopted to divide trajectories into candidate groups. A new method for finding the most frequent path is then proposed, and then, the representative trajectory is selected to represent all trajectories within a group. Experimental results show that our algorithm not only effectively guarantees the privacy of the user but also ensures the high usability of the data. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

7. Efficiently mining high utility itemsets with negative unit profits.

Author: Krishnamoorthy, Srikumar
Subjects: *DATA mining, *SET theory, *PROFIT -- Mathematical models, *DATA structures, *DECISION making, *MATHEMATICAL models
Abstract: A High Utility Itemset (HUI) mining is an important problem in the data mining literature that considers utilities of items (such as profits and margins) to discover interesting patterns from transactional databases. Several data structures, pruning strategies and algorithms have been proposed in the literature to efficiently mine high utility itemsets. Most of these works, however, do not consider itemsets with negative unit profits that provide greater flexibility to a decision maker to determine profitable itemsets. This paper aims to advance the state-of-the-art and presents a generalized high utility mining (GHUM) method that considers both positive and negative unit profits. The proposed method uses a simplified utility-list data structure for storing itemset information during the mining process. The paper also introduces a novel utility based anti-monotonic property to improve the performance of HUI mining. Furthermore, GHUM adapts key pruning strategies from the basic HUI mining literature and presents new pruning strategies to significantly improve the performance of mining. The proposed method is evaluated on a set of benchmark sparse and dense datasets and compared against a state-of-the-art method. Rigorous experimental evaluation is performed and implications of the key findings are also presented. In general, GHUM was found to deliver more than an order of magnitude improvement at a fraction of the memory over the state-of-the-art FHN method. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

8. A Selective Multiple Instance Transfer Learning Method for Text Categorization Problems.

Author: Liu, Bo, Xiao, Yanshan, and Hao, Zhifeng
Subjects: *SUPERVISED learning, *PROBLEM solving, *TASK performance, *BIG data, *DATA mining
Abstract: Multiple instance learning (MIL) is a generalization of supervised learning which attempts to learn a distinctive classifier from bags of instances. This paper addresses the problem of the transfer learning-based multiple instance method for text categorization problem. To provide a safe transfer of knowledge from a source task to a target task, this paper proposes a new approach, called selective multiple instance transfer learning (SMITL), which selects the case that the multiple instance transfer learning will work in step one, and then builds a multiple instance transfer learning classifier in step two. Specifically, in the first step, we measure whether the source task and the target task are related or not by investigating the similarity of the positive features of both tasks. In the second step, we construct a transfer learning-based multiple instance method to transfer knowledge from a source task to a target task if both tasks are found to be related in the first step. Our proposed approach explicitly addresses the problem of safe transfer of knowledge for multiple instance learning on the text classification problem. Extensive experiments have shown that SMITL can determine whether the two tasks are related for most data sets, and outperforms classic multiple instance learning methods. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

9. Improved Binary Meerkat Optimization Algorithm for efficient feature selection of supervised learning classification.

Author: Hussien, Reda M., Abohany, Amr A., Abd El-Mageed, Amr A., and Hosny, Khalid M.
Abstract: Feature selection (FS) is a crucial step in machine learning and data mining projects. It aims to remove redundant and uncorrelated features, thus improving the accuracy of models. However, it can be challenging to select the optimal features, especially for datasets with many features. This paper proposes the Binary Meerkat Optimization Algorithm (BMOA) to address the issue of FS. The BMOA selects the most related and useful features. Additionally, an improved BMOA (IBMOA) is introduced, which enhances exploration and exploitation capabilities by incorporating the Periodic Mode Boundary Handling (PMBH) strategy and the Local Search (LS) process. This helps to reduce dimensionality and enhance classification accuracy. To evaluate the significance of selected features, two widely used classifiers, namely k -nearest Neighbor (k -NN) and Support Vector Machine (SVM), were used as quality raters. The proposed IBMOA and the original BMOA were compared on 21 multi-scale and multi-faceted benchmark datasets. The binary structures of eight contemporary algorithms, including Binary Artificial Bee Colony (BABC), Binary Grey Wolf Optimization (BGWO), Binary Sailfish Optimizer (BSFO), Binary Particle Swarm Optimizer (BPSO), Binary Bird Swarm Algorithm (BBSA), Binary Whale Optimization Algorithm (BWOA), Binary Grasshopper Optimization Algorithm (BGOA), and Binary Bat Algorithm (BBA), were also analyzed and compared. The proposed IBMOA algorithm outperforms competitors on small and large datasets, obtaining optimal solutions within a reasonable timeframe. This is true for the proposed IBMOA, which has been statistically proven to be highly competitive using the Wilcoxon rank sum test (with alpha=0.05). [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. Stepwise optimal scale selection for multi-scale decision tables via attribute significance.

Author: Li, Feng, Hu, Bao Qing, and Wang, Jun
Subjects: *MULTISCALE modeling, *COMPUTER programming, *DECISION logic tables, *COMPUTER software, *DATA mining, *BIG data
Abstract: Hierarchically structured data are very common or even unavoidable for data mining and knowledge discovering from the perspective of granular computing in real-life world. Based on this circumstance, multi-scale information system is introduced by Wu and Leung and extends the theory and application of information system. In such table, objects may take different values under the same attribute measured at different scales. Recently, scale selection is the main issue of multi-scale information system, and optimal scale selection is to choose a proper decision table for final decision making or classification. In this paper, we firstly propose the concept of multi-scale attribute significance, and, in the sense of binary classification, another two equivalent definitions are given. Then based on the concept of significance, this paper introduces a novel approach of stepwise optimal scale selection to obtain one optimal scale combination with less time cost compared with the lattice model. Specially, for inconsistent multi-scale decision tables, different types of consistence are considered with different requirements for optimal scale selection. Finally, five algorithms are designed and six numerical experiments are employed to illustrate the feasibility and efficiency of the proposed model. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

11. A survey of the applications of text mining in financial domain.

Author: Kumar, B. Shravan and Ravi, Vadlamani
Subjects: *DATA mining, *APPLICATION software, *PROBLEM solving, *LOGICAL prediction, *FUTURES studies
Abstract: Text mining has found a variety of applications in diverse domains. Of late, prolific work is reported in using text mining techniques to solve problems in financial domain. The objective of this paper is to provide a state-of-the-art survey of various applications of Text mining to finance. These applications are categorized broadly into FOREX rate prediction, stock market prediction, customer relationship management (CRM) and cyber security. Since finance is a service industry, these problems are paramount in operational and customer growth aspects. We reviewed 89 research papers that appeared during the period 2000–2016, highlighted some of the issues, gaps, key challenges in this area and proposed some future research directions. Finally, this review can be extremely useful to budding researchers in this area, as many open problems are highlighted. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

12. Clustering time-stamped data using multiple nonnegative matrices factorization.

Author: Huang, Xiaohui, Ye, Yunming, Xiong, Liyan, Wang, Shaokai, and Yang, Xiaofei
Subjects: *DATA mining, *CLUSTER analysis (Statistics), *UBIQUITOUS computing, *BIG data, *INFORMATION theory, *CONSTRAINT satisfaction
Abstract: Time-stamped data are ubiquitous in our daily life, such as twitter data, academic papers and sensor data. Finding clusters and their evolutionary trends in time-stamped data sets are receiving increasing attention from researchers. Most existing methods, however, can only tackle the clustering problem of a data set without time-stamped information which is inherent in almost all the data objects. Actually, not only the performance can be improved by effectively incorporating the time-stamped information in the clustering process on most data sets, but also we can find the evolutionary trends of the clusters with time information. In this paper, we introduce an approach for clustering time-stamped data and discovering the evolutionary trends of the clusters by using Multiple Nonnegative Matrices Factorization (MNMF) with smooth constraint over time. To utilize time-stamped information in the clustering process, an extra object-time matrix is constructed in our proposed method. Then, we jointly factorize multiple feature matrices using smooth constraint to perform the object-time matrix to obtain the clusters and their evolutionary trends. Experimental results on real data sets demonstrate that our proposed approach outperforms the comparative algorithms with respect to Fscore, NMI or Entropy. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

13. Estimating the structural diversity introduced by decision forest algorithms : A probabilistic approach.

Author: Ip, Ryan H.L., Bewong, Michael, Adnan, Md. Nasim, and Islam, Md. Zahidul
Abstract: Structurally diverse decision trees are important for knowledge discovery and classification/prediction accuracy. Over the years, researchers have devoted much effort to the development of algorithms to increase diversity among the trees within an ensemble. While Kappa is commonly used to measure diversity among the decision trees, it does not measure the ability of the tree building algorithms to introduce diversity. Further, Kappa does not consider the structural diversity amongst the trees. Instead, Kappa measures the diversity of the predictions made from the trees produced, and are dependent on the datasets used. This paper presents a novel data-independent metric, called R index, for measuring the diversity that can be introduced by a decision forest algorithm without building the entire decision forest. The proposed measure is applied to five well-known algorithms that involve bagging and random subspacing. An efficient practical approach for calculating the R index empirically – R finder – is also proposed, and is implemented. Both R finder and Kappa were applied to thirty-two publicly available benchmark datasets under various algorithms to estimate the resulting diversity. The results indicate a generally strong negative correlation between R finder and Kappa, implying that R finder is effective at estimating the diversity of trees without the added computational costs associated with calculating Kappa. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies.

Author: Duong, Quang-Huy, Liao, Bo, Fournier-Viger, Philippe, and Dam, Thu-Lan
Subjects: *ALGORITHMS, *DATA mining, *STRATEGIC planning, *DATABASES, *NUMERICAL calculations
Abstract: Top- k high utility itemset mining is the process of discovering the k itemsets having the highest utilities in a transactional database. In recent years, several algorithms have been proposed for this task. However, it remains very expensive both in terms of runtime and memory consumption. The reason is that current algorithms often generate a huge amount of candidate itemsets and are unable to prune the search space effectively. In this paper, we address this issue by proposing a novel algorithm named kHMC to discover the top- k high utility itemsets more efficiently. Unlike several algorithms for top- k high utility itemset mining, kHMC discovers high utility itemsets using a single phase. Furthermore, it employs three strategies named RIU, CUD, and COV to raise its internal minimum utility threshold effectively, and thus reduce the search space. The COV strategy introduces a novel concept of coverage . The concept of coverage can be employed to prune the search space in high utility itemset mining, or to raise the threshold in top- k high utility itemset mining, as proposed in this paper. Furthermore, kHMC relies on a novel co-occurrence pruning technique named EUCPT to avoid performing costly join operations for calculating the utilities of itemsets. Moreover, a novel pruning strategy named TEP is proposed for reducing the search space. To evaluate the performance of the proposed algorithm, extensive experiments have been conducted on six datasets having various characteristics. Results show that the proposed algorithm outperforms the state-of-the-art TKO and REPT algorithms for top- k high utility itemset mining both in terms of memory consumption and runtime. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

15. Recompiling learning processes from event logs.

Author: Vidal, Juan C., Vázquez-Barreiros, Borja, Lama, Manuel, and Mucientes, Manuel
Subjects: *MACHINE learning, *DECISION trees, *PROBLEM solving, *MATHEMATICAL models, *DATA mining, *COMPUTER software
Abstract: In this paper a novel approach to reuse units of learning (UoLs) – such as courses, seminars, workshops, and so on – is presented. Virtual learning environments (VLEs) do not usually provide the tools to export in a standardized format the designed UoLs, making thus more challenging their reuse in a different platform. Taking into account that many of these VLEs are legacy or proprietary systems, the implementation of a specific software is usually out of place. However, these systems have in common that they record the events of students and teachers during the learning process. The approach presented in this paper makes use of these logs (i) to extract the learning flow structure using process mining, and (ii) to obtain the underlying rules that control the adaptive learning of students by means of decision tree learning. Finally, (iii) the process structure and the adaptive rules are recompiled in IMS Learning Design (IMS LD) – the de facto educational modeling language standard. The three steps of our approach have been validated with UoLs from different domains. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

16. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining.

Author: García, Salvador, Luengo, Julián, and Herrera, Francisco
Subjects: *DATA mining, *ELECTRONIC data processing, *COMPUTER algorithms, *FEATURE selection, *BENCHMARK problems (Computer science)
Abstract: Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL’2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

17. A biology-inspired, data mining framework for extracting patterns in sexual cyberbullying data.

Author: Potha, N., Maragoudakis, M., and Lyras, D.
Subjects: *DATA mining, *CYBERBULLYING, *SOCIAL media, *ONLINE social networks, *COMPUTER crimes
Abstract: With the rapid growth of social media, users, especially adolescents, are spending significant amount of time on various social networking sites to connect with others, to share information, and to pursue common interests. However, as social networking has become widespread, certain people are finding illegal and unethical ways to use these communities as means for opening the door of inappropriate online activities. Thus, they are providing an open way for cybercrimes such as cyberbullying. In this paper, we deal with the aforementioned issue as a time series modelling methodology, aiming at the recognition of bullying patterns within the questions posed by a predator to his victims. Given a set of real world transcripts (i.e. the whole set of predator′ s questions), in which each question is numerically labelled in terms of severity, we first model each set of predator′ s questions as a time series. The next step is the main contribution of this paper, in terms of changing the representation scheme from time series data into symbolic representation. More specifically, inspired by the Multiple Sequence Alignment ( MSA ) method, commonly used in computational biology for identifying conserved regions of similarity among raw molecular data, we represent the set of signals according to a SAX (Symbolic Aggregate approXimation) symbolic representation, transforming each signal into a symbol string. The main rationale behind this adoption lies to the fact that the collected cyberbullying data can be converted to string sequences via SAX conversion, which in turn can be aligned, thus revealing conserved temporal patterns or slight variations in the attacking strategies of the predators. Experimental results, based on the clustering improvement of the aforementioned data using the extracted patterns instead of the time series data, justify our claims. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

18. Double-quantitative fusion of accuracy and importance: Systematic measure mining, benign integration construction, hierarchical attribute reduction.

Author: Zhang, Xianyong and Miao, Duoqian
Subjects: *DATA mining, *ATTRIBUTE focusing (Data mining), *PROBABILITY theory, *APPROXIMATION theory, *UNCERTAINTY (Information theory)
Abstract: Uncertainty measure mining and applications are fundamental, and it is possible for double-quantitative fusion to acquire benign measures via heterogeneity and complementarity. This paper investigates the double-quantitative fusion of relative accuracy and absolute importance to provide systematic measure mining, benign integration construction, and hierarchical attribute reduction. (1) First, three-way probabilities and measures are analyzed. Thus, the accuracy and importance are systematically extracted, and both are further fused into importance-accuracy (IP-Accuracy), a synthetic causality measure. (2) By sum integration, IP-Accuracy gains a bottom-top granulation construction and granular hierarchical structure. IP-Accuracy holds benign granulation monotonicity at both the knowledge concept and classification levels. (3) IP-Accuracy attribute reduction is explored based on decision tables. A hierarchical reduct system is thereby established, including qualitative/quantitative reducts, tolerant/approximate reducts, reduct hierarchies, and heuristic algorithms. Herein, the innovative tolerant and approximate reducts quantitatively approach/expand/weaken the ideal qualitative reduct. (4) Finally, a decision table example is provided for illustration. This paper performs double-quantitative fusion of causality measures to systematically mine IP-Accuracy, and this measure benignly constructs a granular computing platform and hierarchical reduct system. By resorting to a monotonous uncertainty measure, this study provides an integration-evolution strategy of granular construction for attribute reduction. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

19. An uncertainty-based approach: Frequent itemset mining from uncertain data with different item importance.

Author: Lee, Gangin, Yun, Unil, and Ryang, Heungmo
Subjects: *UNCERTAINTY (Information theory), *DATA mining, *ELECTRONIC data processing, *DATABASES, *COMPUTER algorithms
Abstract: Since itemset mining was proposed, various approaches have been devised, ranging from processing simple item-based databases to dealing with more complex databases including sequence, utility, or graph information. Especially, in contrast to the mining approaches that process such databases containing exact presence or absence information of items, uncertain pattern mining finds meaningful patterns from uncertain databases with items’ existential probability information. However, traditional uncertain mining methods have a problem in that it cannot apply importance of each item obtained from the real world into the mining process. In this paper, to solve such a problem and perform uncertain itemset mining operations more efficiently, we propose a new uncertain itemset mining algorithm additionally considering importance of items such as weight constraints. In our algorithm, both items’ existential probabilities and weight factors are considered; as a result, we can selectively obtain more meaningful itemsets with high importance and existential probabilities. In addition, the algorithm can operate more quickly with less memory by efficiently reducing the number of calculations causing useless itemset generations. Experimental results in this paper show that the proposed algorithm is more efficient and scalable than state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

20. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications.

Author: Ravi, Kumar and Ravi, Vadlamani
Subjects: *SENTIMENT analysis, *SOCIAL media, *DATA mining, *WORD-of-mouth communication, *MACHINE learning
Abstract: With the advent of Web 2.0, people became more eager to express and share their opinions on web regarding day-to-day activities and global issues as well. Evolution of social media has also contributed immensely to these activities, thereby providing us a transparent platform to share views across the world. These electronic Word of Mouth (eWOM) statements expressed on the web are much prevalent in business and service industry to enable customer to share his/her point of view. In the last one and half decades, research communities, academia, public and service industries are working rigorously on sentiment analysis, also known as, opinion mining, to extract and analyze public mood and views. In this regard, this paper presents a rigorous survey on sentiment analysis, which portrays views presented by over one hundred articles published in the last decade regarding necessary tasks, approaches, and applications of sentiment analysis. Several sub-tasks need to be performed for sentiment analysis which in turn can be accomplished using various approaches and techniques. This survey covering published literature during 2002–2015, is organized on the basis of sub-tasks to be performed, machine learning and natural language processing techniques used and applications of sentiment analysis. The paper also presents open issues and along with a summary table of a hundred and sixty-one articles. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

21. Incremental learning from news events.

Author: Hu, Linmei, Shao, Chao, Li, Juanzi, and Ji, Heng
Subjects: *MACHINE learning, *PROBLEM solving, *DATA mining, *DATABASES, *FEATURE selection
Abstract: As news events on the same subject occur, our knowledge about the subject will accumulate and become more comprehensive. In this paper, we formally define the problem of incremental knowledge learning from similar news events on the same subject, where each event consists of a set of news articles reporting about it. The knowledge is represented by a topic hierarchy presenting topics at different levels of granularity. Though topic (hierarchy) mining from text has been researched a lot, incremental learning from similar events remains under developed. In this paper, we propose a scalable two-phase framework to incrementally learn a topic hierarchy for a subject from events on the subject as the events occur. First, we recursively construct a topic hierarchy for each event based on a novel topic model considering the named entities and entity types in news articles. Second, we incrementally merge the topic hierarchies through top-down hierarchical topic alignment. Extensive experimental results on real datasets demonstrate the effectiveness and efficiency of the proposed framework in terms of both qualitative and quantitative measures. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

22. ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem.

Author: Triguero, Isaac, del Río, Sara, López, Victoria, Bacardit, Jaume, Benítez, José M., and Herrera, Francisco
Subjects: *BIOINFORMATICS, *BIG data, *DATA mining, *MACHINE learning, *EVOLUTIONARY computation, *DECISION making, *COMPUTER algorithms
Abstract: The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

23. A unified motion planning method for parking an autonomous vehicle in the presence of irregularly placed obstacles.

Author: Li, Bai and Shao, Zhijiang
Subjects: *COMPARATIVE studies, *DATA mining, *TRAFFIC engineering, *DYNAMIC models, *COMPUTER simulation
Abstract: This paper proposes a motion planner for autonomous parking. Compared to the prevailing and emerging studies that handle specific or regular parking scenarios only, our method describes various kinds of parking cases in a unified way regardless they are regular parking scenarios (e.g., parallel, perpendicular or echelon parking cases) or not. First, we formulate a time-optimal dynamic optimization problem with vehicle kinematics, collision-avoidance conditions and mechanical constraints strictly described. Thereafter, an interior-point simultaneous approach is introduced to solve that formulated dynamic optimization problem. Simulation results validate that our proposed motion planning method can tackle general parking scenarios. The tested parking scenarios in this paper can be regarded as benchmark cases to evaluate the efficiency of methods that may emerge in the future. Our established dynamic optimization problem is an open and unified framework, where other complicated user-specific constraints/optimization criteria can be handled without additional difficulty, provided that they are expressed through inequalities/polynomial explicitly. This proposed motion planner may be suitable for the next-generation intelligent parking-garage system. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

24. A prioritization model for locating relief logistic centers using analytic hierarchy process with interval comparison matrix.

Author: Bozorgi-Amiri, Ali and Asvadi, Saman
Subjects: *HIERARCHY (Linguistics), *NATURAL disasters, *LOGISTICS, *DATA mining, *OPERATING costs
Abstract: When natural disasters happen, relief logistic centers (RLCs) and the quality of their services become absolutely important. In other words, choosing proper locations for RLCs has a direct impact on operating costs and timeliness of responses to the rising demands. This paper aims at proposing a decision support system for prioritizing RLC’s locations to facilitate providing emergency helps when natural disasters occur. The present study, focuses on considering availability, risk, technical issues, cost and coverage in locating RLCs. It is assumed that applying analytic hierarchy process (AHP) can facilitate the problem of locating these centers. The most important step in this process is establishing pair-wise comparisons for the criteria and alternatives. As it is more logical to use interval comparisons instead of crisp ones in real-world problems due to some considerations, this paper has used two decision-making methods known as lexicographic goal programming (LGP) and two-step logarithmic goal programming (TLGP) to derive priorities from pair-wise matrices. To assess the proposed method, a case study of Tehran, the capital city of Iran has also been discussed. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

25. Coverage-based resampling: Building robust consolidated decision trees.

Author: Ibarguren, Igor, Pérez, Jesús M., Muguerza, Javier, Gurrutxaga, Ibai, and Arbelaitz, Olatz
Subjects: *RESAMPLING (Statistics), *ROBUST control, *DECISION trees, *DATA mining, *DATA analysis, *MACHINE learning
Abstract: The class imbalance problem has attracted a lot of attention from the data mining community recently, becoming a current trend in machine learning research. The Consolidated Tree Construction (CTC) algorithm was proposed as an algorithm to solve a classification problem involving a high degree of class imbalance without losing the explaining capacity, a desirable characteristic of single decision trees and rule sets. CTC works by resampling the training sample and building a tree from each subsample, in a similar manner to ensemble classifiers, but applying the ensemble process during the tree construction phase, resulting in a unique final tree. In the ECML/PKDD 2013 conference the term “Inner Ensembles” was coined to refer to such methodologies. In this paper we propose a resampling strategy for classification algorithms that use multiple subsamples. This strategy is based on the class distribution of the training sample to ensure a minimum representation of all classes when resampling. This strategy has been applied to CTC over different classification contexts. A robust classification algorithm should not just be able to rank in the top positions for certain classification problems but should be able to excel when faced with a broad range of problems. In this paper we establish the robustness of the CTC algorithm against a wide set of classification algorithms with explaining capacity. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

26. Promoting the performance of vertical recommendation systems by applying new classification techniques.

Author: Saleh, Ahmed I., El Desouky, Ali I., and Ali, Shereen H.
Subjects: *PERFORMANCE evaluation, *RECOMMENDER systems, *WORKLOAD of computer networks, *DATA mining, *INTERNET users, *FUZZY systems
Abstract: Recommender systems (RSs) have proven to be valuable means for online users to cope with the information overload and have become one of the most powerful and popular tools in electronic commerce. RSs are software tools providing suggestions for items of interest to users; hence, they typically apply techniques and methodologies from Data Mining. The most frequently used technique is the classification as it matches the aims of RSs that basically classify items based on user’s preferences. The main contribution of this paper is in the area of applying classification techniques to enhance the performance of RSs. In this paper, an Intelligent Adaptive Vertical Recommendation (IAVR) system will be introduced. IAVR recommends text documents related to a specific domain. Basically, the paper concentrates on the first phase of IAVR, which contains two modules; the first is a distiller, while the second is a multi-class classifier. The proposed distiller is employed as a binary classifier that elects documents related to the domain of interest. It is built upon a novel neuro-fuzzy system as well as a modified K Nearest Neighbors (KNN) classifier. On the other hand, the proposed multi-class classifier merges a new instance of Naïve Bayes (NB) classifier, that depends on a proposed learning technique called “accumulative learning”, with association rules. Experimental results have proven the effectiveness of the proposed classifiers, which accordingly promote the overall system’s recommendation accuracy. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

27. Efficient approach for mining high-utility patterns on incremental databases with dynamic profits.

Author: Kim, Sinyoung, Kim, Hanju, Cho, Myungha, Kim, Hyeonmo, Vo, Bay, Lin, Jerry Chun-Wei, and Yun, Unil
Abstract: High-utility itemset mining (HUIM) is one of the heavily studied fields of data mining, which is due to its high compatibility with real-world applications. HUIM is a process of extracting a complete set of interesting patterns by considering the importance and the quantity of each item. Incremental high-utility itemset mining (IHUIM) further increases the compatibility of HUIM by mining from a data stream instead of a static database. It does this by mining interesting patterns in a single database scan and storing prior knowledge in a compact data structure. However, there is a critical limitation, because the importance of each item has to be guaranteed to remain unchanged throughout the data stream. This limitation hinders the compatibility of IHUIM to the real-world applications, because the importance of an item changes from time to time in the real world. Conventional IHUIM approaches consequently have to use outdated information when mining interesting patterns. This paper proposes a novel problem of mining high-utility itemsets in an incremental and dynamic profit environment in order to account for the fluctuation of the item's importance. Furthermore, a novel approach to mining patterns in this type of environment is introduced using an efficient list structure and tight upper bounds. Experiments on real and synthetic datasets show that the proposed approach performs well compared to the state-of-the-art approaches in terms of the runtime and memory usage, and it scales better than the other approaches. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

28. Multi-strategy ensemble binary hunger games search for feature selection.

Author: Ma, Benedict Jun, Liu, Shuai, and Heidari, Ali Asghar
Subjects: *FEATURE selection, *CHAOS theory, *HUNGER, *MACHINE learning, *GLOBAL optimization, *DATA mining
Abstract: Feature selection is a crucial preprocessing step in the sphere of machine learning and data mining, devoted to reducing the data dimensionality to improve the performance of learning models. In this paper, a vigorous metaheuristic named hunger games search (HGS) is integrated with a multi-strategy (MS) framework, including chaos theory, greedy selection, and vertical crossover, to boost search equilibrium between explorative and exploitative cores. The new MS-HGS algorithm is developed for global optimization, and its binary variant MS-bHGS is applied to the feature selection problem particularly. To evaluate and validate the performance of our proposed approach, on the one hand, MS-HGS is compared with HGS alongside single strategy embedded HGS on 23 benchmark functions and compared with seven state-of-the-art algorithms on IEEE CEC 2017 test suite. On the other hand, MS-bHGS is employed for feature selection in 20 datasets from the UCI repository and compared with three groups of methods, i.e., traditional, recent, and enhanced, respectively. The relevant experimental results confirm that MS-bHGS exceeds bHGS and most existing techniques in terms of classification accuracy, the number of selected features, fitness values, and execution time. Overall, this paper's findings suggest that MS-HGS is a superior optimizer, and MS-bHGS can be considered a valuable wrapper-mode feature selection technique. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

29. Adaptive feature fusion for time series classification.

Author: Wang, Tian, Liu, Zhaoying, Zhang, Ting, Hussain, Syed Fawad, Waqas, Muhammad, and Li, Yujian
Subjects: *DATA mining, *CLASSIFICATION, *PROBLEM solving
Abstract: Time series classification is one of the most critical and challenging problems in data mining, which exists widely in various fields and has essential research significance. However, to improve the accuracy of time series classification is still a challenging task. In this paper, we propose an Adaptive Feature Fusion Network (AFFNet) to enhance the accuracy of time series classification. The network can adaptively fuse multi-scale temporal features and distance features of time series for classification. Specifically, the main work of this paper includes three aspects: firstly, we propose a multi-scale dynamic convolutional network to extract multi-scale temporal features of time series. Thus, it retains the high efficiency of dynamic convolution and can extract multi-scale data features. Secondly, we present a distance prototype network to extract the distance features of time series. This network obtains the distance features by calculating the distance between the prototype and embedding. Finally, we construct an adaptive feature fusion module to effectively fuse multi-scale temporal and distance features, solving the problem that two features with different semantics cannot be effectively fused. Experimental results on a large number of UCR datasets indicate that our AFFNet achieves higher accuracies than state-of-the-art models on most datasets, as well as on the WISDM, HAR and Opportunity datasets, demonstrating its effectiveness. • Adaptive Feature Fusion Network (AFFNet) to enhance time series classification accuracy. • Multi-scale dynamic convolutional network to extract multi-scale temporal features. • Distance prototype network to extract the distance features of time series. • Fuse multi-scale temporal and distance features time series for classification. • Experimental results on a large number of UCR datasets. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

30. Graph-based approach for outlier detection in sequential data and its application on stock market and weather data.

Author: Rahmani, Ali, Afra, Salim, Zarour, Omar, Addam, Omar, Koochakzadeh, Negar, Kianmehr, Keivan, Alhajj, Reda, and Rokne, Jon
Subjects: *GRAPH theory, *OUTLIER detection, *DATA analysis, *STOCK exchanges, *WEATHER, *COMPUTER networks, *HURRICANES
Abstract: Abstract: Outlier detection has a large variety of applications ranging from detecting intrusion in a computer network, to forecasting hurricanes and tornados in weather data, to identifying indicators of potential crisis in stock market data, etc. The problem of finding outliers in sequential data has been widely studied in the data mining literature and many techniques have been developed to tackle the problem in various application domains. However, many of these techniques rely on the peculiar characteristics of a specific type of data to detect the outliers. As a result, they cannot be easily applied to different types of data in other application domains; they should at least be tuned and customized to adapt to the new domain. They also may need certain amount of training data to build their models. This makes them hard to apply especially when only a limited amount of data is available. The work described in this paper tackle the problem by proposing a graph-based approach for the discovery of contextual outliers in sequential data. The developed algorithm offers a higher degree of flexibility and requires less amount of information about the nature of the analyzed data compared to the previous approaches described in the literature. In order to validate our approach, we conducted experiments on stock market and weather data; we compared the results with the results from our previous work. Our analysis of the results demonstrate that the algorithm proposed in this paper is successful and effective in detecting outliers in data from different domains, one financial and the other meteorological. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

31. Log based business process engineering using fuzzy web service discovery.

Author: Shafiq, Omair, Alhajj, Reda, and Rokne, Jon
Subjects: *FUZZY systems, *WEB services, *BUSINESS process management, *INFORMATION storage & retrieval systems, *MACHINE learning, *DATA mining, *CONSUMERS
Abstract: Abstract: Business process engineering and mining is a technique that allows discovery, analysis and modeling of possible Business Processes based on information gathered from enterprise information systems. Most of currently available business process engineering and mining techniques either focus on machine learning techniques to mine, discover and model any possible Business Processes from raw data, or use semantically-enabled process models and service descriptions to construct and represent complex Business Processes. However, in real-life scenario, all the required services are not always available and hence exact matching of the services in order to construct Business Process is not possible. In this paper, we present our approach of using fuzzy Web Service discovery to construct and represent Business Processes. It helps in relaxing the matching criteria of Web Services, and allows service consumers to specify business requirements in a more fuzzy way, and hence increases the possibility of finding required Web Services that could construct Business Processes. The paper presents the proposed solution then reports and discusses the evaluation. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

32. A method for extracting rules from spatial data based on rough fuzzy sets.

Author: Bai, Hexiang, Ge, Yong, Wang, Jinfeng, Li, Deyu, Liao, Yilan, and Zheng, Xiaoying
Subjects: *SPATIAL data structures, *DATA extraction, *ROUGH sets, *FUZZY sets, *DATA mining, *SOFT computing
Abstract: Abstract: With the development of data mining and soft computing techniques, it becomes possible to automatically mine knowledge from spatial data. Spatial rule extraction from spatial data with uncertainty is an important issue in spatial data mining. Rough set theory is an effective tool for rule extraction from data with roughness. In our previous studies, Rough set method has been successfully used in the analysis of social and environmental causes of neural tube birth defects. However, both roughness and fuzziness may co-exist in spatial data because of the complexity of the object and the subjective limitation of human knowledge. The situation of fuzzy decisions, which is often encountered in spatial data, is beyond the capability of classical rough set theory. This paper presents a model based on rough fuzzy sets to extract spatial fuzzy decision rules from spatial data that simultaneously have two types of uncertainties, roughness and fuzziness. Fuzzy entropy and fuzzy cross entropy are used to measure accuracies of the fuzzy decisions on unseen objects using the rules extracted. An example of neural tube birth defects is given in this paper. The identification result from rough fuzzy sets based model was compared with those from two classical rule extraction methods and three commonly used fuzzy set based rule extraction models. The comparison results support that the rule extraction model established is effective in dealing with spatial data which have roughness and fuzziness simultaneously. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

33. Twitter user profiling based on text and community mining for market analysis.

Author: Ikeda, Kazushi, Hattori, Gen, Ono, Chihiro, Asoh, Hideki, and Higashino, Teruo
Subjects: *MARKETING research, *DATA mining, *ESTIMATION theory, *DEMOGRAPHIC databases, *INFORMATION retrieval
Abstract: Abstract: This paper proposes demographic estimation algorithms for profiling Twitter users, based on their tweets and community relationships. Many people post their opinions via social media services such as Twitter. This huge volume of opinions, expressed in real time, has great appeal as a novel marketing application. When automatically extracting these opinions, it is desirable to be able to discriminate discrimination based on user demographics, because the ratio of positive and negative opinions differs depending on demographics such as age, gender, and residence area, all of which are essential for market analysis. In this paper, we propose a hybrid text-based and community-based method for the demographic estimation of Twitter users, where these demographics are estimated by tracking the tweet history and clustering of followers/followees. Our experimental results from 100,000 Twitter users show that the proposed hybrid method improves the accuracy of the text-based method. The proposed method is applicable to various user demographics and is suitable even for users who only tweet infrequently. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

34. 2-Tuple linguistic hybrid arithmetic aggregation operators and application to multi-attribute group decision making.

Author: Wan, Shu-Ping
Subjects: *COMPUTER arithmetic, *AGGREGATION operators, *MULTIPLE criteria decision making, *ATTRIBUTE focusing (Data mining), *OPERATOR theory, *DATA mining
Abstract: Abstract: The focus of this paper is on multi-attribute group decision making (MAGDM) problems in which the attribute values, attribute weights, and expert weights are all in the form of 2-tuple linguistic information, which are solved by developing a new decision method based on 2-tuple linguistic hybrid arithmetic aggregation operator. First, the operation laws for 2-tuple linguistic information are defined and the related properties of the operation laws are studied. Hereby some hybrid arithmetic aggregation operators with 2-tuple linguistic information are developed, involving the 2-tuple hybrid weighted arithmetic average (THWA) operator, the 2-tuple hybrid linguistic weighted arithmetic average (T-HLWA) operator, and the extended 2-tuple hybrid linguistic weighted arithmetic average (ET-HLWA) operator. In the proposed decision method, the individual overall preference values of alternatives are derived by using the extended 2-tuple weighted arithmetic average operator (ET-WA). Utilized the ET-HLWA operator, all the individual overall preference values of alternatives are further integrated into the collective ones of alternatives, which are used to rank the alternatives. A real example of personnel selection is given to illustrate the developed method and the comparison analyses demonstrate the universality and flexibility of the method proposed in this paper. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

35. Applying fuzzy quality function deployment to prioritize solutions of knowledge management for an international port in Taiwan

Author: Liang, Gin-Shuh, Ding, Ji-Feng, and Wang, Chun-Kai
Subjects: *QUALITY function deployment, *KNOWLEDGE management, *FUZZY logic, *CASE studies, *INFORMATION retrieval, *DATA mining, *QUALITY control
Abstract: Abstract: The purpose of this paper is to apply a fuzzy quality function deployment (QFD) approach to prioritize knowledge management (KM) solutions for an international port in Taiwan. First, the paper examines house of quality (HOQ) matrices to facilitate handling of the ‘what’ (i.e., KM requirements) and ‘how’ (KM solutions) aspects of the QFD problem, and proposes procedures for the use of a fuzzy QFD method. A case study concerning port K in Taiwan is then used to demonstrate a systematic appraisal process for prioritizing KM solutions, and twenty attributes with sixteen feasible KM implementation solutions are measured employing an HOQ matrix. Finally, the top five feasible solutions for implementing KM at port K are identified. The empirical results show that ‘establishment of a data storage and data mining system’ in the technology dimension is the most urgent requirement for KM implementation at port K in Taiwan. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

36. Consensus clustering based on constrained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems

Author: Yang, Yan, Tan, Wei, Li, Tianrui, and Ruan, Da
Subjects: *CLUSTER analysis (Statistics), *DECISION support systems, *DATA mining, *DATA structures, *PERFORMANCE evaluation, *COMPUTER algorithms, *SELF-organizing maps, *ELECTRONIC data processing
Abstract: Abstract: Data mining processes data from different perspectives into useful knowledge, and becomes an important component in designing intelligent decision support systems (IDSS). Clustering is an effective method to discover natural structures of data objects in data mining. Both clustering ensemble and semi-supervised clustering techniques have been emerged to improve the clustering performance of unsupervised clustering algorithms. Cop-Kmeans is a K-means variant that incorporates background knowledge in the form of pairwise constraints. However, there exists a constraint violation in Cop-Kmeans. This paper proposes an improved Cop-Kmeans (ICop-Kmeans) algorithm to solve the constraint violation of Cop-Kmeans. The certainty of objects is computed to obtain a better assignment order of objects by the weighted co-association. The paper proposes a new constrained self-organizing map (SOM) to combine multiple semi-supervised clustering solutions for further enhancing the performance of ICop-Kmeans. The proposed methods effectively improve the clustering results from the validated experiments and the quality of complex decisions in IDSS. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

37. Frequent approximate subgraphs as features for graph-based image classification

Author: Acosta-Mendoza, Niusvel, Gago-Alonso, Andrés, and Medina-Pagola, José E.
Subjects: *GRAPHIC methods, *COMPUTER algorithms, *IMAGE processing, *TOPOLOGY, *DATA mining, *APPROXIMATION theory, *IMAGE analysis, *IMAGING systems
Abstract: Abstract: The use of approximate graph matching for frequent subgraph mining has been identified in different applications as a need. To meet this need, several algorithms have been developed, but there are applications where it has not been used yet, for example image classification. In this paper, a new algorithm for mining frequent connected subgraphs over undirected and labeled graph collections VEAM (Vertex and Edge Approximate graph Miner) is presented. Slight variations of the data, keeping the topology of the graphs, are allowed in this algorithm. Approximate matching in existing algorithm (APGM) is only performed on vertex label set. In VEAM, the approximate matching between edge label set in frequent subgraph mining is included in the mining process. Also, a framework for graph-based image classification is introduced. The approximate method of VEAM was tested on an artificial image collection using a graph-based image representation proposed in this paper. The experimentation on this collection shows that our proposal gets better results than graph-based image classification using some algorithms reported in related work. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

38. A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Author: Saha, Sujan Kumar, Mitra, Pabitra, and Sarkar, Sudeshna
Subjects: *COMPARATIVE studies, *HINDI language, *BENGALI language, *DATA mining, *CLASSIFIERS (Linguistics), *ONLINE data processing, *CONTENT mining, *DIMENSION reduction (Statistics)
Abstract: Abstract: Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

39. Multi-task and multi-view learning based on particle swarm optimization for short-term traffic forecasting.

Author: Cheng, Shifen, Lu, Feng, Peng, Peng, and Wu, Sheng
Subjects: *PARTICLE swarm optimization, *TRAFFIC estimation, *SPATIOTEMPORAL processes, *WATER quality, *DATA mining
Abstract: Spatiotemporal prediction modeling of traffic is an important issue in the field of spatiotemporal data mining. However, it is encountering multiple challenges such as the global spatiotemporal correlation between predictive tasks, balanced between spatiotemporal heterogeneity and the global predictive power of the model, and parameter optimization of prediction models. Most existing short-term traffic prediction methods only emphasize spatiotemporal dependence and heterogeneity, so it is difficult to get satisfactory prediction accuracy. In this paper, spatiotemporal multi-task and multi-view feature learning models based on particle swarm optimization are combined to concurrently address these challenges. First, cross-correlation is used to construct the spatiotemporal proximity view, periodic view and trend view of each road segment to characterize spatiotemporal dependence and heterogeneity. Second, the prediction results of three spatiotemporal views are obtained using a set of kernels, which is further regarded as a high-level heterogeneous semantic feature as the input of the multi-task multi-view feature learning model. Third, additional regularization terms (e.g., group Lasso penalty, graph Laplacian regularization) are utilized to constrain all tasks to select a set of shared features and ensure the relatedness between tasks and consistency between views, so that the predictive model has a good global predictive ability and can capture global spatiotemporal correlation in the road network. Finally, particle swarm optimization is introduced to obtain the optimal parameter set and enhance the training speed of the proposed model. Experimental studies on real vehicular-speed datasets collected on city roads demonstrate that the proposed model significantly outperform the existing nine baseline methods in terms of prediction accuracy. The results suggest that the proposed model merits further attention for other spatiotemporal prediction tasks, such as water quality, crowd flow, owing to the versatility of the modeling process for spatiotemporal data. • Multi-task and multi-view feature learning models combined for traffic forecasting. • Prediction results of three spatiotemporal views obtained using a set of kernels. • Added regularization terms used to constrain all tasks to select shared features. • Proposed model outperforms existing baseline methods in prediction accuracy. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. Average Restrain Divider of Evaluation Value (ARDEV) in data stream algorithm for big data prediction.

Author: Wibisono, Ari and Sarwinda, Devvi
Subjects: *MINING methodology, *ALGORITHMS, *DATA mining, *RIVERS, *TREE development, *BIG data
Abstract: Today, big data processing has become a challenging task due to the amount of data collected using various sensors increasingly significantly. To build knowledge and predict the data, traditional data mining methods calculate all numerical attributes into the memory simultaneously. The data stream method is a solution for processing and calculating data. The method streams incrementally in batch form; therefore, infrastructure memory is sufficient to develop knowledge. The existing method for data stream prediction is FIMT-DD (Fast Incremental Model Tree-Drift Detection). Using this method, knowledge is developed in tree form for every instance. In this paper, enhanced FIMT-DD is proposed using ARDEV (Average Restrain Divider of Evaluation Value). ARDEV utilizes the Chernoff bound approach with error evaluation, improvement in learning rate, modification of perceptron rule calculation, and utilization of activation function. Standard FIMT-DD separates the tree formation process and perceptron prediction. The proposed method evaluates and connects the development of the tree for knowledge formation and the perceptron rule for prediction. The prediction accuracy of the proposed method is measured using MAE, RMSE and MAPE. From the experiment performed, the utilization of ARDEV enhancement shows significant improvement in terms of accuracy prediction. Statistically, the overall accuracy prediction improvement is approximately 6.99 % compared to standard FIMT-DD with a traffic dataset. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. Mining high-utility itemsets in dynamic profit databases.

Author: Nguyen, Loan T.T., Nguyen, Phuc, Nguyen, Trinh D.D., Vo, Bay, Fournier-Viger, Philippe, and Tseng, Vincent S.
Subjects: *MINING methodology, *PROFIT, *DATABASES, *CRITICAL currents, *RETAIL stores
Abstract: High-Utility Itemset (HUI) mining is an important data-mining task which has gained popularity in recent years due to its applications in numerous fields. HUI mining aims at discovering itemsets that have high utility (e.g., yield a high profit) in transactional databases. Although several algorithms have been designed to enumerate all HUIs, an important issue is that they assume that the utilities (e.g., unit profits) of items are static. But this simplifying assumption does not hold in real-life situations. For example, the unit profits of items often vary over time in a retail store due to fluctuating supply costs and promotions. Ignoring this important characteristic of real-life transactional databases makes current HUI-mining algorithms inapplicable in many real-world applications. To address this critical limitation of current HUI-mining techniques, this paper studies the novel problem of mining HUIs in databases having dynamic unit profits. To accurately assess the utility of any itemset in this context, a redefined utility measure is introduced. Furthermore, a novel algorithm named MEFIM (Modified EFficient high-utility Itemset Mining), which relies on a novel compact database format to discover the desired itemsets efficiently, is designed. An improved version of the MEFIM algorithm, named i MEFIM, is also introduced. This algorithm employs a novel structure called P-set to reduce the number of transaction scans and to speed up the mining process. Experimental results show that the proposed algorithms considerably outperform the state-of-the-art HUI-mining algorithms on dynamic profit databases in terms of runtime, memory usage, and scalability. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. Differential privacy preservation in regression analysis based on relevance.

Author: Gong, Maoguo, Pan, Ke, and Xie, Yu
Subjects: *REGRESSION analysis, *PRIVACY, *DATA mining, *RELEVANCE, *DATA privacy, *INFORMATION technology security
Abstract: Abstract With the development of data release and data mining, protecting the sensitive information in the data from being leaked has attracted quite a few attentions in the information security field. Differential privacy is an excellent paradigm for providing the preservation against the adversary that attempts to infer the sensitive information of individuals. However, the existing works show that the accuracy of the differentially private regression model is less than satisfactory since the amount of noise added in is uncertainty. In this paper, we present a novel framework PrivR, a differentially private regression analysis model based on relevance, which transforms the objective function into the form of polynomial and perturbs the polynomial coefficients according to the magnitude of relevance between the input features and the model output. Specifically, we add less noise to the coefficients of the polynomial representation of the objective function that involve strongly relevant features, and vice-versa. Experiments on Adult dataset and Banking dataset demonstrate that PrivR not only prevents the leakage of data privacy effectively but also retains the utility of the model. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. Empower event detection with bi-directional neural language model.

Author: Zhang, Yunyan, Xu, Guangluan, Wang, Yang, Liang, Xiao, Wang, Lei, and Huang, Tinglei
Subjects: *DATA mining, *ARTIFICIAL neural networks, *COMPUTER multitasking, *LEARNING, *PARADIGMS (Social sciences)
Abstract: Abstract Event detection is an essential and challenging task in Information Extraction (IE). Recent advances in neural networks make it possible to build reliable models without complicated feature engineering. However, data scarcity hinders their further performance. Moreover, training data has been underused since majority of labels in datasets are not event triggers and contribute very little to the training process. In this paper, we propose a novel multi-task learning framework to extract more general patterns from raw data and make better use of the training data. Specifically, we present two paradigms to incorporate neural language model into event detection model on both word and character levels: (1) we use the features extracted by language model as an additional input to event detection model. (2) We use a hard parameter sharing approach between language model and event detection model. The extensive experiments demonstrate the benefits of the proposed multi-task learning framework for event detection. Compared to the previous methods, our method does not rely on any additional supervision but still beats the majority of them and achieves a competitive performance on the ACE 2005 benchmark. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

44. Towards efficiently mining closed high utility itemsets from incremental databases.

Author: Dam, Thu-Lan, Ramampiaro, Heri, Nørvåg, Kjetil, and Duong, Quang-Huy
Subjects: *DATA mining, *MACHINE learning, *DATABASES, *STATICS, *UTILITY functions
Abstract: Abstract The set of closed high-utility itemsets (CHUIs) concisely represents the exact utility of all itemsets. Yet, it can be several orders of magnitude smaller than the set of all high-utility itemsets. Existing CHUI mining algorithms assume that databases are static, making them very expensive in the case of incremental data, since the whole dataset has to be processed for each batch of new transactions. To address this challenge, this paper presents the first approach, called IncCHUI, that mines CHUIs efficiently from incremental databases. In order to achieve this, we propose an incremental utility-list structure, which is built and updated with only one database scan. Further, we apply effective pruning strategies to fast construct incremental utility-lists and eliminate candidates that are not updated. Finally, we suggest an efficient hash-based approach to update or insert new closed sets that are found. Our extensive experimental evaluation on both real-life and synthetic databases shows the efficiency, as well as the feasibility of our approach. It significantly outperforms previously proposed methods that are mainly run in batch mode in terms of speed, and it is scalable with respect to the number of transactions. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

45. Nonstationary time series transformation methods: An experimental review.

Author: Salles, Rebecca, Belloze, Kele, Porto, Fabio, Gonzalez, Pedro H., and Ogasawara, Eduardo
Subjects: *TIME series analysis, *ELECTRONIC data processing, *DATA mining, *MATHEMATICAL transformations, *PREDICTION models
Abstract: Abstract Data preprocessing is a crucial step for mining and learning from data, and one of its primary activities is the transformation of data. This activity is very important in the context of time series prediction since most time series models assume the property of stationarity, i.e., statistical properties do not change over time, which in practice is the exception and not the rule in most real datasets. There are several transformation methods designed to treat nonstationarity in time series. However, the choice of a transformation that is appropriate to the adopted data model and to the problem at hand is not a simple task. This paper provides a review and experimental analysis of methods for transformation of nonstationary time series. The focus of this work is to provide a background on the subject and a discussion on their advantages and limitations to the problem of time series prediction. A subset of the reviewed transformation methods is compared through an experimental evaluation using benchmark datasets from time series prediction competitions and other real macroeconomic datasets. Suitable nonstationary time series transformation methods provided improvements of more than 30% in prediction accuracy for half of the evaluated time series and improved the prediction in more than 95% for 10% of the time series. Furthermore, the adoption of a validation phase during model training enables the selection of suitable transformation methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

46. Restricted Sensitive Attributes-based Sequential Anonymization (RSA-SA) approach for privacy-preserving data stream publishing.

Author: Abdelhameed, Saad A., Moussa, Sherin M., and Khalifa, Mohamed E.
Subjects: *DATA transmission systems, *DATA privacy, *CYBERTERRORISM, *SKEWNESS (Probability theory), *DATA mining
Abstract: Abstract Data streams have become a widely-adopted data representation format in many real-world applications. This data streaming may be published for different scientific research, mining, or analysis purposes. However, such streams may contain personal-specific data that could be considered as sensitive about individuals. This makes the privacy preserving of data streams against privacy disclosure attacks, while maintaining their utilization, is a real challenge. Some studies have considered privacy-preserving publishing over data streams with only Single Sensitive Attribute, in which they do not protect the published streams from all possible privacy attacks. In this paper, we propose a novel Restricted Sensitive Attributes-based Sequential Anonymization (RSA-SA) approach for privacy-preserving data stream publishing. Besides, two new privacy restrictions are introduced to restrict the published Sensitive Attributes values: Semantic-diversity and Sensitivity-diversity. RSA-SA can protect the sensitive values of the published data streams against the related privacy attacks, including the attribute disclosure, skewness, similarity, and sensitivity attacks. In addition, RSA-SA handles data streams that have either single or multiple sensitive attributes with minimum information loss and delay time. Thus, the data utility of the published data streams is efficiently maintained to provide more accurate mining and analytical results, where robust invulnerability to privacy attacks is sustained. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

47. Extractive single document summarization using multi-objective optimization: Exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm.

Author: Saini, Naveen, Saha, Sriparna, Jangra, Anubhav, and Bhattacharyya, Pushpak
Subjects: *MULTIDISCIPLINARY design optimization, *DIFFERENTIAL evolution, *DATA mining, *SELF-organizing maps, *CLUSTER analysis (Statistics)
Abstract: Abstract Text summarization techniques become paramount in extracting relevant information from large databa-ses. Current paper attempts to build some extractive single document text summarization (ESDS) systems using multi-objective optimization (MOO) frameworks. Three techniques are proposed: (1) first is an integration of self-organizing map (SOM) and multi-objective differential evolution (MODE) (named as ESDS_SMODE) (2) second is based on multi-objective grey wolf optimizer (ESDS_MGWO) and (3) third is based on multi-objective water cycle algorithm (ESDS_MWCA). The sentences present in the document are first clustered utilizing the concept of multi-objective clustering. Two objective functions measuring compactness and separation of the sentence clusters in two different ways are optimized simultaneously using MOO framework. The proposed approach is able to automatically detect the number of sentence clusters present in a document and then representative sentences are selected from different clusters using some sentence scoring features to generate the summary. The experiments were conducted on two benchmark datasets, DUC2001, and DUC2002, and the obtained results are compared with various state-of-the-art techniques using ROUGE measures. Results illustrate the superiority of our approach in comparison to state-of-the-art techniques in terms of ROUGE − 2 score for both datasets. Code of the developed approach ESDS_SMODE is available online at https://drive.google.com/open?id=1WagTeIDLgphttPrKHpnF_eO7QHWJxXxK. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

48. Knowledge-enhanced document embeddings for text classification.

Author: Sinoara, Roberta A., Camacho-Collados, Jose, Rossi, Rafael G., Navigli, Roberto, and Rezende, Solange O.
Subjects: *TEXT mining, *SEMANTIC computing, *COMPUTER science, *MACHINE learning, *DATA mining
Abstract: Abstract Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

49. Influence Factorization for identifying authorities in Twitter.

Author: Alp, Zeynep Zengin and Öğüdücü, Şule Gündüz
Subjects: *VIRAL marketing, *INTERNET marketing, *FACTORIZATION, *SOCIAL networks, *ONLINE social networks
Abstract: Abstract Prevalent usage of social media attracted companies and researchers to analyze its dynamics and effects on user behavior. One of the most intriguing aspects of social networks is to identify influencers who are experts on a specific topic. With the identification of these users within the network, many applications can be built for user recommendation, information diffusion modeling, viral marketing, user modeling and many more. In this paper, we aim to identify topic-based experts using a large dataset collected from Twitter. Our proposed approach has three phases: (1) identification of topics on social media posts (more specifically, tweets), (2) user modeling, based on a group of user specific features, and (3) Influence Factorization to identify topical influencers. The main advantage of the proposed method is to identify future influencers as well as current ones on Twitter. Moreover, it is an easy to implement algorithm using Spark MLlib, which can be easily extended to include other user specific features, and compare with other methodologies. The effectiveness of the proposed method is tested on a large dataset that contains tweets of 180K user over 70 day period. The experimental results show that our proposed method identifies influencers successfully when used with a hybrid user specific feature that contains follower count and authenticity information, and is a highly scalable and extensible algorithm. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

50. Hierarchical feature selection with subtree based graph regularization.

Author: Tuo, Qianjuan, Zhao, Hong, and Hu, Qinghua
Subjects: *HIERARCHICAL Bayes model, *FEATURE selection, *MACHINE learning, *DATA mining, *SPARSE graphs
Abstract: Abstract Feature selection is an important and challenging task in machine learning and data mining. In many practical problems, the classes have a hierarchical structure. However, some existing feature selection algorithms ignored the dependence among different classes in the hierarchical structure. Other feature selection algorithms only focused on one way dependence among different classes, ignoring two-way dependence. In this paper, we propose a novel feature selection method called hierarchical feature selection with subtree based graph regularization (HFSGR), which is aimed at exploring two-way dependence among different classes. First, we construct a subtree graph using the parent–child relationships of the subtrees in a predefined tree structure, where the subtree is obtained from its internal nodes. Second, we use the l 2 , 1 -norm regularization term to encourage nearby subtrees that share similar sparsity patterns. Third, we extend our algorithm to a directed acyclic graph structure so that it can be applied to common situations. Our method is applied to eight datasets with different tree structures. Experimental comparisons of our proposed algorithm with five hierarchical feature selection algorithms, justify its effectiveness and efficiency. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

328 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources