1,004 results
Search Results
2. Classification of Skin Disease using Ensemble Data Mining Techniques.
- Author
-
Verma AK, Pal S, and Kumar S
- Subjects
- Humans, Prognosis, Algorithms, Data Mining methods, Machine Learning, Skin Diseases classification, Skin Diseases diagnosis
- Abstract
Objective: Skin diseases are a major global health problem associated with high number of people. With the rapid development of technologies and the application of various data mining techniques in recent years, the progress of dermatological predictive classification has become more and more predictive and accurate. Therefore, development of machine learning techniques, which can effectively differentiate skin disease classification, is of vast importance. The machine learning techniques applied to skin disease prediction so far, no techniques outperforms over all the others. Methods: In this research paper, we present a new method, which applies five different data mining techniques and then developed an ensemble approach that consists all the five different data mining techniques as a single unit. We use informative Dermatology data to analysis different data mining techniques to classify the skin disease and then, an ensemble machine learning method is applied. Results: The proposed ensemble method, which is based on machine learning was tested on Dermatology datasets and classify the type of skin disease in six different classes like include C1: psoriasis, C2: seborrheic dermatitis, C3: lichen planus, C4: pityriasis rosea, C5: chronic dermatitis, C6: pityriasis rubra. The results show that the dermatological prediction accuracy of the test data set is increased compared to a single classifier. Conclusion: The ensemble method used on Dermatology datasets give better performance as compared to different classifier algorithms. Ensemble method gives more accurate and effective skin disease prediction.
- Published
- 2019
- Full Text
- View/download PDF
3. To Generate an Ensemble Model for Women Thyroid Prediction Using Data Mining Techniques
- Author
-
Yadav DC and Pal S
- Subjects
- Female, Humans, Neural Networks, Computer, Predictive Value of Tests, Algorithms, Data Mining methods, Decision Support Systems, Clinical, Machine Learning, Thyroid Gland pathology
- Abstract
Objective: The main objective of this paper is to easily identify thyroid symptom for treatment. Methods: In this paper two main techniques are proposed for mining the hidden pattern in the dataset. Ensemble-I and Ensemble- II both are machine learning techniques. Ensemble-I generated from decision tree, over fitting and neural network and Ensemble-II generated from combinations of Bagging and Boosting techniques. Finally proposed experiment is conducted by Ensemble-I vs. Ensemble-II. Results: In the entire experimental setup find an ensemble –II generated model is the higher compare to other ensemble-I model. In each experiment observe and compare the value of all the performance of ROC, MAE, RMSE, RAE and RRSE. Stacking (ensemble-I) ensemble model estimate the weights for input with output model by thyroid dataset. After the measurement find out the results ROC=(98.80), MAE= (0.89), 6RMSE=(0.21), RAE= (52.78), RRSE=(83.71)and in the ensemble-II observe thyroid dataset and measure all performance of the model ROC=(98.79), MAE= (0.31), RMSE=(0.05) and RAE= (35.89) and RRSE=(52.67). Finally concluded that (Bagging+ Boosting) ensemble-II model is the best compare to other., (Creative Commons Attribution License)
- Published
- 2019
- Full Text
- View/download PDF
4. Low-rank representation with adaptive graph regularization.
- Author
-
Wen J, Fang X, Xu Y, Tian C, and Fei L
- Subjects
- Cluster Analysis, Algorithms, Data Mining trends, Machine Learning trends
- Abstract
Low-rank representation (LRR) has aroused much attention in the community of data mining. However, it has the following twoproblems which greatly limit its applications: (1) it cannot discover the intrinsic structure of data owing to the neglect of the local structure of data; (2) the obtained graph is not the optimal graph for clustering. To solve the above problems and improve the clustering performance, we propose a novel graph learning method named low-rank representation with adaptive graph regularization (LRR_AGR) in this paper. Firstly, a distance regularization term and a non-negative constraint are jointly integrated into the framework of LRR, which enables the method to simultaneously exploit the global and local information of data for graph learning. Secondly, a novel rank constraint is further introduced to the model, which encourages the learned graph to have very clear clustering structures, i.e., exactly c connected components for the data with c clusters. These two approaches are meaningful and beneficial to learn the optimal graph that discovers the intrinsic structure of data. Finally, an efficient iterative algorithm is provided to optimize the model. Experimental results on synthetic and real datasets show that the proposed method can significantly improve the clustering performance., (Copyright © 2018 Elsevier Ltd. All rights reserved.)
- Published
- 2018
- Full Text
- View/download PDF
5. NEURIPS PAPERS AIM TO IMPROVE UNDERSTANDING AND ROBUSTNESS OF MACHINE LEARNING ALGORITHMS
- Subjects
Data mining ,Algorithms ,Machine learning ,Pellet fusion ,Data warehousing/data mining ,Algorithm ,News, opinion and commentary - Abstract
LIVERMORE, CA -- The following information was released by Lawrence Livermore National Laboratory (LLNL): The 34th Conference on Neural Information Processing Systems (NeurIPS) is featuring two papers advancing the reliability [...]
- Published
- 2020
6. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.
- Author
-
Karisani P, Qin ZS, and Agichtein E
- Subjects
- Probability, Algorithms, Data Curation methods, Data Mining methods, Databases, Factual, Machine Learning, Models, Theoretical
- Abstract
The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
- Published
- 2018
- Full Text
- View/download PDF
7. Adaptive Online Sequential ELM for Concept Drift Tackling.
- Author
-
Budiman A, Fanany MI, and Basaruddin C
- Subjects
- Computer Simulation, Humans, Serial Learning, Algorithms, Data Mining, Machine Learning, Neural Networks, Computer, Online Systems
- Abstract
A machine learning method needs to adapt to over time changes in the environment. Such changes are known as concept drift. In this paper, we propose concept drift tackling method as an enhancement of Online Sequential Extreme Learning Machine (OS-ELM) and Constructive Enhancement OS-ELM (CEOS-ELM) by adding adaptive capability for classification and regression problem. The scheme is named as adaptive OS-ELM (AOS-ELM). It is a single classifier scheme that works well to handle real drift, virtual drift, and hybrid drift. The AOS-ELM also works well for sudden drift and recurrent context change type. The scheme is a simple unified method implemented in simple lines of code. We evaluated AOS-ELM on regression and classification problem by using concept drift public data set (SEA and STAGGER) and other public data sets such as MNIST, USPS, and IDS. Experiments show that our method gives higher kappa value compared to the multiclassifier ELM ensemble. Even though AOS-ELM in practice does not need hidden nodes increase, we address some issues related to the increasing of the hidden nodes such as error condition and rank values. We propose taking the rank of the pseudoinverse matrix as an indicator parameter to detect "underfitting" condition.
- Published
- 2016
- Full Text
- View/download PDF
8. Semisupervised Learning Based Disease-Symptom and Symptom-Therapeutic Substance Relation Extraction from Biomedical Literature.
- Author
-
Feng Q, Gui Y, Yang Z, Wang L, and Li Y
- Subjects
- Animals, Humans, Algorithms, Data Mining methods, Machine Learning
- Abstract
With the rapid growth of biomedical literature, a large amount of knowledge about diseases, symptoms, and therapeutic substances hidden in the literature can be used for drug discovery and disease therapy. In this paper, we present a method of constructing two models for extracting the relations between the disease and symptom and symptom and therapeutic substance from biomedical texts, respectively. The former judges whether a disease causes a certain physiological phenomenon while the latter determines whether a substance relieves or eliminates a certain physiological phenomenon. These two kinds of relations can be further utilized to extract the relations between disease and therapeutic substance. In our method, first two training sets for extracting the relations between the disease-symptom and symptom-therapeutic substance are manually annotated and then two semisupervised learning algorithms, that is, Co-Training and Tri-Training, are applied to utilize the unlabeled data to boost the relation extraction performance. Experimental results show that exploiting the unlabeled data with both Co-Training and Tri-Training algorithms can enhance the performance effectively.
- Published
- 2016
- Full Text
- View/download PDF
9. A methodology for mining clinical data: experiences from TRANSFoRm project.
- Author
-
Danger R, Corrigan D, Soler JK, Kazienko P, Kajdanowicz T, Majeed A, and Curcin V
- Subjects
- Decision Support Systems, Clinical organization & administration, Algorithms, Data Mining methods, Electronic Health Records organization & administration, Machine Learning, Natural Language Processing
- Abstract
Data mining of electronic health records (eHRs) allows us to identify patterns of patient data that characterize diseases and their progress and learn best practices for treatment and diagnosis. Clinical Prediction Rules (CPRs) are a form of clinical evidence that quantifies the contribution of different clinical data to a particular clinical outcome and help clinicians to decide the diagnosis, prognosis or therapeutic conduct for any given patient. The TRANSFoRm diagnostic support system (DSS) is based on the construction of an ontological repository of CPRs for diagnosis prediction in which clinical evidence is expressed using a unified vocabulary. This paper explains the proposed methodology for constructing this CPR repository, addressing algorithms and quality measures for filtering relevant rules. Some preliminary application results are also presented.
- Published
- 2015
10. Active semi-supervised learning method with hybrid deep belief networks.
- Author
-
Zhou S, Chen Q, and Wang X
- Subjects
- Culture, Internet, Pattern Recognition, Automated, Algorithms, Data Mining statistics & numerical data, Machine Learning
- Abstract
In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.
- Published
- 2014
- Full Text
- View/download PDF
11. BSI issues position paper on the emergence of artificial intelligence and machine learning algorithms in healthcare
- Subjects
Algorithms ,Data mining ,Artificial intelligence ,Professional associations ,Machine learning ,Medical equipment ,Business, international ,Association for the Advancement of Medical Instrumentation - Abstract
London: The British Standards Institution has issued the following news release:BSI, the business standards company, has undertaken research in collaboration with the US standards organization for medical devices, the Association [...]
- Published
- 2019
12. A recent review on optimisation methods applied to credit scoring models
- Author
-
Kamimura, Elias Shohei, Pinto, Anderson Rogerio Faia, and Nagano, Marcelo Seido
- Published
- 2023
- Full Text
- View/download PDF
13. Knowledge Discovery in Inductive Databases: 4th International Workshop, KDID 2005, Porto, Portugal, October 3, 2005, Revised Selected and Invited Papers
- Author
-
Bonchi, Francesco, Boulicaut, Jean-François, KDD Lab CNR/U. Pisa (KDDLab), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), and Université de Lyon-Université Lumière - Lyon 2 (UL2)
- Subjects
query languages ,multi-objective regression ,constraint-based mining ,learning ,knowledge discovery ,data mining ,algorithms ,inductive databases ,machine learning ,classification ,[INFO]Computer Science [cs] ,data management ,database ,clustering ,pattern mining ,query optimization - Abstract
International audience; The 4th International Workshop on Knowledge Discovery in Inductive Databases (KDID 2005) was held in Porto, Portugal, on October 3, 2005 in conjunction with the 16th European Conference on Machine Learning and the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Ever since the start of the ?eld of data mining, it has been realized that the integration of the database technology into knowledge discovery processes was a crucial issue. This vision has been formalized into the inductive database perspective introduced by T. Imielinski and H. Mannila (CACM 1996, 39(11)). The main idea is to consider knowledge discovery as an extended querying p- cess for which relevant query languages are to be speci?ed. Therefore, inductive databases might contain not only the usual data but also inductive gener- izations (e. g. , patterns, models) holding within the data. Despite many recent developments, there is still a pressing need to understand the central issues in inductive databases. Constraint-based mining has been identi?ed as a core technology for inductive querying, and promising results have been obtained for rather simple types of patterns (e. g. , itemsets, sequential patterns). However, constraint-based mining of models remains a quite open issue. Also, coupling schemes between the available database technology and inductive querying p- posals are not yet well understood. Finally, the de?nition of a general purpose inductive query language is still an on-going quest.
- Published
- 2005
14. Research on Chinese Medical Entity Recognition Based on Multi-Neural Network Fusion and Improved Tri-Training Algorithm.
- Author
-
Qi, Renlong, Lv, Pengtao, Zhang, Qinghui, and Wu, Meng
- Subjects
SUPERVISED learning ,CONVOLUTIONAL neural networks ,MEDICAL informatics ,DATA mining ,ALGORITHMS ,MACHINE learning ,MEDICAL research - Abstract
Chinese medical texts contain a large number of medically named entities. Automatic recognition of these medical entities from medical texts is the key to developing medical informatics. In the field of Chinese medical information extraction, annotated Chinese medical text data are very few. In the named entity recognition task, there is insufficient labeled data, which leads to low model recognition performance. Therefore, this paper proposes a Chinese medical entity recognition model based on multi-neural network fusion and the improved Tri-Training algorithm. The model performs semi-supervised learning by improving the Tri-Training algorithm. According to the characteristics of the medical entity recognition task and medical data, the method in this paper is improved in terms of the division of the initial sub-training set, the construction of the base classifier, and the integration of the learning voting method. In addition, this paper also proposes a multi-neural network fusion entity recognition model for base classifier construction. The model learns feature information jointly by combining Iterated Dilated Convolutional Neural Network (IDCNN) and BiLSTM. Through experimental verification, the model proposed in this paper outperforms other models and improves the performance of the Chinese medical entity recognition model by incorporating and improving the semi-supervised learning algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Analysis of Images, Social Networks and Texts : 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020, Revised Selected Papers
- Author
-
Wil M. P. van der Aalst, Vladimir Batagelj, Dmitry I. Ignatov, Michael Khachay, Olessia Koltsova, Andrey Kutuzov, Sergei O. Kuznetsov, Irina A. Lomazova, Natalia Loukachevitch, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, Andrey V. Savchenko, Elena Tutubalina, Wil M. P. van der Aalst, Vladimir Batagelj, Dmitry I. Ignatov, Michael Khachay, Olessia Koltsova, Andrey Kutuzov, Sergei O. Kuznetsov, Irina A. Lomazova, Natalia Loukachevitch, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, Andrey V. Savchenko, and Elena Tutubalina
- Subjects
- Data mining, Machine learning, Natural language processing (Computer science), Database management, Algorithms
- Abstract
This book constitutes revised selected papers from the 9th International Conference on Analysis of Images, Social Networks and Texts, AIST 2020, held during October 15-16, 2020. The conference was planned to take place in Moscow, Russia, but changed to an online format due to the COVID-19 pandemic.The 27 full papers and 4 short papers presented in this volume were carefully reviewed and selected from a total of 108 qualified submissions. The papers are organized in topical sections as follows: invited papers; natural language processing; computer vision; social network analysis; data analysis and machine learning; theoretical machine learning and optimization; and process mining.
- Published
- 2021
16. Machine Learning, Optimization, and Data Science : 4th International Conference, LOD 2018, Volterra, Italy, September 13-16, 2018, Revised Selected Papers
- Author
-
Giuseppe Nicosia, Panos Pardalos, Giovanni Giuffrida, Renato Umeton, Vincenzo Sciacca, Giuseppe Nicosia, Panos Pardalos, Giovanni Giuffrida, Renato Umeton, and Vincenzo Sciacca
- Subjects
- Application software, Machine learning, Algorithms, Data mining
- Abstract
This book constitutes the post-conference proceedings of the 4th International Conference on Machine Learning, Optimization, and Data Science, LOD 2018, held in Volterra, Italy, in September 2018.The 46 full papers presented were carefully reviewed and selected from 126 submissions. The papers cover topics in the field of machine learning, artificial intelligence, reinforcement learning, computational optimization and data science presenting a substantial array of ideas, technologies, algorithms, methods and applications.
- Published
- 2019
17. Systematic review of content analysis algorithms based on deep neural networks.
- Author
-
Rezaeenour, Jalal, Ahmadi, Mahnaz, Jelodar, Hamed, and Shahrooei, Roshan
- Subjects
ARTIFICIAL neural networks ,DEEP learning ,MACHINE learning ,INFORMATION technology ,NATURAL language processing ,ALGORITHMS - Abstract
Today according to social media, the internet, Etc. Data is rapidly produced and occupies a large space in systems that have resulted in enormous data warehouses; the progress in information technology has significantly increased the speed and ease of data flow.text mining is one of the most important methods for extracting a useful model through extracting and adapting knowledge from data sets. However, many studies have been conducted based on the usage of deep learning for text processing and text mining issues.The idea and method of text mining are one of the fields that seek to extract useful information from unstructured textual data that is used very today. Deep learning and machine learning techniques in classification and text mining and their type are discussed in this paper as well. Neural networks of various kinds, namely, ANN, RNN, CNN, and LSTM, are the subject of study to select the best technique. In this study, we conducted a Systematic Literature Review to extract and associate the algorithms and features that have been used in this area. Based on our search criteria, we retrieved 130 relevant studies from electronic databases between 1997 and 2021; we have selected 43 studies for further analysis using inclusion and exclusion criteria in Section 3.2. According to this study, hybrid LSTM is the most widely used deep learning algorithm in these studies, and SVM in machine learning method high accuracy in result shown. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
18. High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis.
- Author
-
Mustapha, S. M. F. D. Syed
- Subjects
PATTERN recognition systems ,DATA analysis ,K-means clustering ,DATA mining ,ALGORITHMS ,MACHINE learning ,HIGH-dimensional model representation - Abstract
Clustering is an effective statistical data analysis technique; it has several applications, including data mining, pattern recognition, image analysis, bioinformatics, and machine learning. Clustering helps to partition data into groups of objects with distinct characteristics. Most of the methods for clustering use manually selected parameters to find the clusters from the dataset. Consequently, it can be very challenging and time-consuming to extract the optimal parameters for clustering a dataset. Moreover, some clustering methods are inadequate for locating clusters in high-dimensional data. To address these concerns systematically, this paper introduces a novel selection-free clustering technique named data point positioning analysis (DPPA). The proposed method is straightforward since it calculates 1-NN and Max-NN by analyzing the data point placements without the requirement of an initial manual parameter assignment. This method is validated using two well-known publicly available datasets used in several clustering algorithms. To compare the performance of the proposed method, this study also investigated four popular clustering algorithms (DBSCAN, affinity propagation, Mean Shift, and K-means), where the proposed method provides higher performance in finding the cluster without using any manually selected parameters. The experimental finding demonstrated that the proposed DPPA algorithm is less time-consuming compared to the existing traditional methods and achieves higher performance without using any manually selected parameters. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. The Potential of MicroRNAs as Non-Invasive Prostate Cancer Biomarkers: A Systematic Literature Review Based on a Machine Learning Approach.
- Author
-
Bevacqua, Emilia, Ammirato, Salvatore, Cione, Erika, Curcio, Rosita, Dolce, Vincenza, and Tucci, Paola
- Subjects
PROSTATE tumors treatment ,DISEASE progression ,SYSTEMATIC reviews ,MICRORNA ,EARLY detection of cancer ,MACHINE learning ,TUMOR markers ,SOFTWARE analytics ,PROSTATE tumors ,DATA mining ,ALGORITHMS - Abstract
Simple Summary: Prostate cancer (PCa) is the most common cancer in men worldwide. Screening and diagnosis are based on prostate-specific antigen (PSA) blood testing and digital rectal examination. Nevertheless, these methods are not specific and have a high risk of mistaken results. This has led to overtreatment and unnecessary radical therapy; thus, better prognostic tools are urgently needed. In this view, microRNAs (miRs) appear as potential non-invasive biomarkers for PCa diagnosis, prognosis, and therapy. As the scientific literature available in this field is huge and very often controversial, we identified and discussed three topics that characterize the investigated research area by combining the big data from the literature together with a novel machine learning approach. By analyzing the papers clustered into these topics we have offered a deeper understanding of the current research, which helps to contribute to the advancement of this research field. Background: Prostate cancer (PCa) is the second leading cause of cancer-related deaths in men. Although the prostate-specific antigen (PSA) test is used in clinical practice for screening and/or early detection of PCa, it is not specific, thus resulting in high false-positive rates. MicroRNAs (miRs) provide an opportunity as biomarkers for diagnosis, prognosis, and recurrence of PCa. Because the size of the literature on it is increasing and often controversial, this study aims to consolidate the state-of-art of relevant published research. Methods: A Systematic Literature Review (SLR) approach was applied to analyze a set of 213 scientific publications through a text mining method that makes use of the Latent Dirichlet Allocation (LDA) algorithm. Results and Conclusions: The result of this activity, performed through the MySLR digital platform, allowed us to identify a set of three relevant topics characterizing the investigated research area. We analyzed and discussed all the papers clustered into them. We highlighted that several miRs are associated with PCa progression, and that their detection in patients' urine seems to be the more reliable and promising non-invasive tool for PCa diagnosis. Finally, we proposed some future research directions to help future scientists advance the field further. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. A differential privacy protecting K-means clustering algorithm based on contour coefficients.
- Author
-
Zhang, Yaling, Liu, Na, and Wang, Shangping
- Subjects
K-means clustering ,INFORMATION storage & retrieval systems ,MACHINE learning ,COMPUTER algorithms ,DATA analysis - Abstract
This paper, based on differential privacy protecting K-means clustering algorithm, realizes privacy protection by adding data-disturbing Laplace noise to cluster center point. In order to solve the problem of Laplace noise randomness which causes the center point to deviate, especially when poor availability of clustering results appears because of small privacy budget parameters, an improved differential privacy protecting K-means clustering algorithm was raised in this paper. The improved algorithm uses the contour coefficients to quantitatively evaluate the clustering effect of each iteration and add different noise to different clusters. In order to be adapted to the huge number of data, this paper provides an algorithm design in MapReduce Framework. Experimental finding shows that the new algorithm improves the availability of the algorithm clustering results under the condition of ensuring individual privacy without significantly increasing its operating time. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
21. Digital‐first assessments: A security framework.
- Author
-
LaFlair, Geoffrey T., Langenfeld, Thomas, Baig, Basim, Horie, André Kenji, Attali, Yigal, and von Davier, Alina A.
- Subjects
NATIONAL competency-based educational tests ,COMPUTER software ,ENGLISH language ,RESEARCH evaluation ,DIGITAL technology ,MACHINE learning ,LEARNING ,ENGINEERING ,PSYCHOMETRICS ,DATA security ,AUTOMATION ,QUALITY assurance ,CERTIFICATION ,PROFESSIONAL licensure examinations ,DATA mining ,ALGORITHMS - Abstract
Background: Digital‐first assessments leverage the affordances of technology in all elements of the assessment process: from design and development to score reporting and evaluation to create test taker‐centric assessments. Objectives: The goal of this paper is to describe the engineering, machine learning, and psychometric processes and technologies of a test security framework (part of a larger ecosystem; Burstein et al., 2021) that can be used to create systems that protect the integrity of test scores. Methods: We use the Duolingo English Test to exemplify the processes and technologies that are presented. This includes methods for actively detecting and deterring malicious behaviour (e.g., a custom desktop app). It also includes methods for passively detecting and deterring malicious behaviour (e.g., a large item bank created through automatic generation methods). We describe the certification process that each test administration undergoes, which includes both automated and human review. Additionally, we describe our quality assurance dashboard which leverages psychometric data mining techniques to monitor test quality and inform decisions about item pool maintenance. Results and Conclusions: As assessment developers transition to online delivery and to a design approach that places the test taker at the centre, it becomes increasingly important to take advantage of the tools and methodological advances in different fields (e.g., engineering, machine learning, psychometrics). These tools and methods are essential to maintaining the security of assessments so that the score reliability is sustained and the interpretations and uses of test scores remain valid. Lay Description: What is known about this topic?: As more and more testing programmes transition to test taker‐centric administrations, effective measures to prevent cheating and protect content are critical to ensure the validity and integrity of scores.Two of the most common forms of cheating in online testing are (a) having someone other than the person who has registered take the test, and (b) stealing content and providing it to others to assist them in achieving a higher score. What does this paper add?: In designing a test taker‐centric digital‐first assessment, a test security framework must inform decisions from end‐to‐end (i.e., registration, onboarding, communications regarding test taker behaviours, test preparation and practice, test administration, and post‐administrative activities including scoring).Security is safeguarded through active and passive design methods; active methods include having test takers attest that they will follow the rules governing testing and informing them that they will be videoed during testing; passive methods include a computer adaptive design that limits item exposure and test overlap rates, development of a large item pool using automated item generation, and applying artificial intelligence to review test administration videos to flag unauthorized behaviours for human review. Implications: With more educational and assessment programmes transitioning to online digital models, the paper presents a comprehensive review of security issues and identifies an integrated approach for preventing cheating and other unauthorized behaviours. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. How Recommendation Algorithms Know What You'll Like.
- Author
-
Vitanova, Mirjana Kocaleva, Miteva, Marija, Gelova, Elena Karamazova, and Zlatanovska, Biljana
- Subjects
ALGORITHMS ,INTERNET stores ,DATA mining ,MACHINE learning ,ONLINE shopping ,ONLINE algorithms - Abstract
One of the most used statistical techniques that include machine learning and data mining for predicting future outcomes with help of data that already exist is known as predictive algorithm. Predictive models are not stable, and they build assumption based on past and present actions. In the paper we are going to introduce Amazon online store and how algorithms know what we like, so they can recommend products to us by their own. One of the biggest innovations in online shopping - first introduced by Amazon - was automatic recommendation generation. The more accurate prediction algorithms are, the more online stores will sell their products. For that reason, prediction algorithms are of great significance for online stores. [ABSTRACT FROM AUTHOR]
- Published
- 2023
23. Classification Accuracy of Hepatitis C Virus Infection Outcome: Data Mining Approach
- Author
-
Mario Frias, Mohamed Al-Twijri, Habib M. Fardoun, Jose M. Moyano, Antonio Rivero-Juárez, Angela Camacho, José María Luna, Isabel Machuca, Sebastián Ventura, and Antonio Rivero
- Subjects
Computer science ,Hepatitis C virus ,Decision tree ,Health Informatics ,02 engineering and technology ,Hepacivirus ,PART ,lcsh:Computer applications to medicine. Medical informatics ,medicine.disease_cause ,Machine learning ,computer.software_genre ,Outcome (game theory) ,HIV/HCV ,03 medical and health sciences ,classification accuracy ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Data Mining ,Humans ,Biomedicine ,030304 developmental biology ,Original Paper ,0303 health sciences ,Ensemble forecasting ,business.industry ,lcsh:Public aspects of medicine ,ensemble ,lcsh:RA1-1270 ,ComputingMethodologies_PATTERNRECOGNITION ,Patient classification ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Artificial intelligence ,Outcome data ,business ,computer ,Algorithms - Abstract
Background The dataset from genes used to predict hepatitis C virus outcome was evaluated in a previous study using a conventional statistical methodology. Objective The aim of this study was to reanalyze this same dataset using the data mining approach in order to find models that improve the classification accuracy of the genes studied. Methods We built predictive models using different subsets of factors, selected according to their importance in predicting patient classification. We then evaluated each independent model and also a combination of them, leading to a better predictive model. Results Our data mining approach identified genetic patterns that escaped detection using conventional statistics. More specifically, the partial decision trees and ensemble models increased the classification accuracy of hepatitis C virus outcome compared with conventional methods. Conclusions Data mining can be used more extensively in biomedicine, facilitating knowledge building and management of human diseases.
- Published
- 2021
24. STANFORD MACHINE LEARNING ALGORITHM PREDICTS BIOLOGICAL STRUCTURES MORE ACCURATELY THAN EVER BEFORE
- Subjects
Machine learning ,Data mining ,Algorithms ,Data warehousing/data mining ,Algorithm ,News, opinion and commentary ,Stanford University - Abstract
STANFORD, Calif. -- The following information was released by Stanford University: Stanford researchers develop machine learning methods that accurately predict the 3D shapes of drug targets and other important biological [...]
- Published
- 2021
25. Automated Categorization of Systemic Disease and Duration From Electronic Medical Record System Data Using Finite-State Machine Modeling: Prospective Validation Study
- Author
-
Ayush Deva, Gumpili Sai Prashanthi, Ranganath Vadapalli, and Anthony Vipin Das
- Subjects
020205 medical informatics ,Computer science ,data analysis ,Big data ,Medicine (miscellaneous) ,lcsh:Medicine ,Health Informatics ,02 engineering and technology ,algorithms ,computer.software_genre ,03 medical and health sciences ,0302 clinical medicine ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,Medical history ,030212 general & internal medicine ,Duration (project management) ,Original Paper ,Past medical history ,Finite-state machine ,business.industry ,lcsh:R ,Unstructured data ,Computer Science Applications ,ophthalmology ,electronic health records ,machine learning ,Analytics ,Data mining ,business ,computer - Abstract
Background One of the major challenges in the health care sector is that approximately 80% of generated data remains unstructured and unused. Since it is difficult to handle unstructured data from electronic medical record systems, it tends to be neglected for analyses in most hospitals and medical centers. Therefore, there is a need to analyze unstructured big data in health care systems so that we can optimally utilize and unearth all unexploited information from it. Objective In this study, we aimed to extract a list of diseases and associated keywords along with the corresponding time durations from an indigenously developed electronic medical record system and describe the possibility of analytics from the acquired datasets. Methods We propose a novel, finite-state machine to sequentially detect and cluster disease names from patients’ medical history. We defined 3 states in the finite-state machine and transition matrix, which depend on the identified keyword. In addition, we also defined a state-change action matrix, which is essentially an action associated with each transition. The dataset used in this study was obtained from an indigenously developed electronic medical record system called eyeSmart that was implemented across a large, multitier ophthalmology network in India. The dataset included patients’ past medical history and contained records of 10,000 distinct patients. Results We extracted disease names and associated keywords by using the finite-state machine with an accuracy of 95%, sensitivity of 94.9%, and positive predictive value of 100%. For the extraction of the duration of disease, the machine’s accuracy was 93%, sensitivity was 92.9%, and the positive predictive value was 100%. Conclusions We demonstrated that the finite-state machine we developed in this study can be used to accurately identify disease names, associated keywords, and time durations from a large cohort of patient records obtained using an electronic medical record system.
- Published
- 2020
26. Hybrid Clustering Algorithm Based on Improved Density Peak Clustering.
- Author
-
Guo, Limin, Qin, Weijia, Cai, Zhi, and Su, Xing
- Subjects
MACHINE learning ,DENSITY ,ALGORITHMS ,BIG data - Abstract
In the era of big data, unsupervised learning algorithms such as clustering are particularly prominent. In recent years, there have been significant advancements in clustering algorithm research. The Clustering by Density Peaks algorithm is known as Clustering by Fast Search and Find of Density Peaks (density peak clustering). This clustering algorithm, proposed in Science in 2014, automatically finds cluster centers. It is simple, efficient, does not require iterative computation, and is suitable for large-scale and high-dimensional data. However, DPC and most of its refinements have several drawbacks. The method primarily considers the overall structure of the data, often resulting in the oversight of many clusters. The choice of truncation distance affects the calculation of local density values, and varying dataset sizes may necessitate different computational methods, impacting the quality of clustering results. In addition, the initial assignment of labels can cause a 'chain reaction', i.e., if one data point is incorrectly labeled, it may lead to more subsequent data points being incorrectly labeled. In this paper, we propose an improved density peak clustering method, DPC-MS, which uses the mean-shift algorithm to find local density extremes, making the accuracy of the algorithm independent of the parameter dc. After finding the local density extreme points, the allocation strategy of the DPC algorithm is employed to assign the remaining points to appropriate local density extreme points, forming the final clusters. The robustness of this method in handling uncertain dataset sizes adds some application value, and several experiments were conducted on synthetic and real datasets to evaluate the performance of the proposed method. The results show that the proposed method outperforms some of the more recent methods in most cases. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. A STUDY ON OPTIMIZING ERROR DETECTION AND CORRECTION STRATEGIES IN PHYSICAL EDUCATION AND SPORT TEACHING USING DATA MINING ALGORITHMS.
- Author
-
ZIYAO GAO, SHENGFEI HU, GUO YU, and YINHUI LI
- Subjects
DATA mining ,PHYSICAL education ,PHYSICAL training & conditioning ,MACHINE learning ,ALGORITHMS ,SPORTS psychology - Abstract
In the fiercely competitive realm of sports and physical education, the application of data mining algorithms has emerged as a vital solution. Machine learning has streamlined processes, offering a seamless means of elevating the quality of education and training provided to students, particularly in the context of sports. This technological support empowers the sports education system to make more informed decisions pertaining to the physical development of aspiring athletes. In this comprehensive study, a blended approach of qualitative methods has been leveraged to gather intricate insights, enriching the overall understanding of the subject. Additionally, an in-depth exploration of articles and journals has been undertaken to scrutinize the practical implementation of data algorithm techniques geared towards enhancing physical training. The resultant findings underscore a substantial and tangible nexus between data algorithms and the domain of sports education. Of paramount significance is the central role played by data mining algorithms in augmenting performance. Notably, the National Sports Board (NSB) has extensively harnessed this technology to meticulously monitor players' on-field performance, ultimately leading to a granular comprehension of each player's capabilities. This paper emphasizes the methods of optimizing mistake detection and its joining systems for increasing the punishment in the operational procedures. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Algorithm selection using edge ML and case-based reasoning.
- Author
-
Ali, Rahman, Zada, Muhammad Sadiq Hassan, Khatak, Asad Masood, and Hussain, Jamil
- Subjects
CASE-based reasoning ,DECISION trees ,CLASSIFICATION algorithms ,MACHINE learning ,ALGORITHMS ,DATA mining ,EMPIRICAL research ,FEATURE extraction - Abstract
In practical data mining, a wide range of classification algorithms is employed for prediction tasks. However, selecting the best algorithm poses a challenging task for machine learning practitioners and experts, primarily due to the inherent variability in the characteristics of classification problems, referred to as datasets, and the unpredictable performance of these algorithms. Dataset characteristics are quantified in terms of meta-features, while classifier performance is evaluated using various performance metrics. The assessment of classifiers through empirical methods across multiple classification datasets, while considering multiple performance metrics, presents a computationally expensive and time-consuming obstacle in the pursuit of selecting the optimal algorithm. Furthermore, the scarcity of sufficient training data, denoted by dimensions representing the number of datasets and the feature space described by meta-feature perspectives, adds further complexity to the process of algorithm selection using classical machine learning methods. This research paper presents an integrated framework called eML-CBR that combines edge edge-ML and case-based reasoning methodologies to accurately address the algorithm selection problem. It adapts a multi-level, multi-view case-based reasoning methodology, considering data from diverse feature dimensions and the algorithms from multiple performance aspects, that distributes computations to both cloud edges and centralized nodes. On the edge, the first-level reasoning employs machine learning methods to recommend a family of classification algorithms, while at the second level, it recommends a list of the top-k algorithms within that family. This list is further refined by an algorithm conflict resolver module. The eML-CBR framework offers a suite of contributions, including integrated algorithm selection, multi-view meta-feature extraction, innovative performance criteria, improved algorithm recommendation, data scarcity mitigation through incremental learning, and an open-source CBR module, reshaping research paradigms. The CBR module, trained on 100 datasets and tested with 52 datasets using 9 decision tree algorithms, achieved an accuracy of 94% for correct classifier recommendations within the top k=3 algorithms, making it highly suitable for practical classification applications. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. Standardising Breast Radiotherapy Structure Naming Conventions: A Machine Learning Approach.
- Author
-
Haidar, Ali, Field, Matthew, Batumalai, Vikneswary, Cloak, Kirrily, Al Mouiee, Daniel, Chlap, Phillip, Huang, Xiaoshui, Chin, Vicky, Aly, Farhannah, Carolan, Martin, Sykes, Jonathan, Vinod, Shalini K., Delaney, Geoffrey P., and Holloway, Lois
- Subjects
SPECIALTY hospitals ,HUMAN body ,MACHINE learning ,RETROSPECTIVE studies ,ARTIFICIAL intelligence ,CANCER treatment ,TERMS & phrases ,RESEARCH funding ,RADIOTHERAPY ,DATA analysis ,ARTIFICIAL neural networks ,RECEIVER operating characteristic curves ,THREE-dimensional printing ,BREAST tumors ,ONCOLOGY ,ALGORITHMS ,LONGITUDINAL method ,RADIATION dosimetry ,DATA mining - Abstract
Simple Summary: In radiotherapy treatment, organs at risk and target volumes are contoured by the clinicians to prepare a dosimetry plan. In retrospective data, these structures are not often standardised to universal names across the patients plans, which is required to enable data mining and analysis. In this paper, a new method was proposed and evaluated to automatically standardise radiotherapy structures names using machine learning algorithms. The proposed approach was deployed over a dataset with 1613 patients collected from Liverpool & Macarthur Cancer Therapy Centres, New South Wales, Australia. It was concluded that machine learning techniques can standardise the dosimetry plan structures, taking into consideration the integration of multiple modalities representing each structure during the training process. In progressing the use of big data in health systems, standardised nomenclature is required to enable data pooling and analyses. In many radiotherapy planning systems and their data archives, target volumes (TV) and organ-at-risk (OAR) structure nomenclature has not been standardised. Machine learning (ML) has been utilised to standardise volumes nomenclature in retrospective datasets. However, only subsets of the structures have been targeted. Within this paper, we proposed a new approach for standardising all the structures nomenclature by using multi-modal artificial neural networks. A cohort consisting of 1613 breast cancer patients treated with radiotherapy was identified from Liverpool & Macarthur Cancer Therapy Centres, NSW, Australia. Four types of volume characteristics were generated to represent each target and OAR volume: textual features, geometric features, dosimetry features, and imaging data. Five datasets were created from the original cohort, the first four represented different subsets of volumes and the last one represented the whole list of volumes. For each dataset, 15 sets of combinations of features were generated to investigate the effect of using different characteristics on the standardisation performance. The best model reported 99.416% classification accuracy over the hold-out sample when used to standardise all the nomenclatures in a breast cancer radiotherapy plan into 21 classes. Our results showed that ML based automation methods can be used for standardising naming conventions in a radiotherapy plan taking into consideration the inclusion of multiple modalities to better represent each volume. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
30. Sipros Ensemble improves database searching and filtering for complex metaproteomics
- Author
-
Qiuming Yao, Jimmy K. Eng, IV William Judson Hervey, David L. Tabb, Zhou Li, Chongle Pan, Xuan Guo, and Ryan S. Mueller
- Subjects
0301 basic medicine ,Statistics and Probability ,Proteomics ,Computer science ,Peptide ,Machine learning ,computer.software_genre ,Biochemistry ,03 medical and health sciences ,Software ,Comet (programming) ,Molecular Biology ,chemistry.chemical_classification ,Database ,business.industry ,Systems Biology ,Microbiota ,Filter (signal processing) ,Original Papers ,Computer Science Applications ,Search Engine ,Computational Mathematics ,Identification (information) ,030104 developmental biology ,Computational Theory and Mathematics ,chemistry ,Metagenomics ,Metaproteomics ,Artificial intelligence ,Data mining ,business ,computer ,Algorithms - Abstract
Motivation Complex microbial communities can be characterized by metagenomics and metaproteomics. However, metagenome assemblies often generate enormous, and yet incomplete, protein databases, which undermines the identification of peptides and proteins in metaproteomics. This challenge calls for increased discrimination of true identifications from false identifications by database searching and filtering algorithms in metaproteomics. Results Sipros Ensemble was developed here for metaproteomics using an ensemble approach. Three diverse scoring functions from MyriMatch, Comet and the original Sipros were incorporated within a single database searching engine. Supervised classification with logistic regression was used to filter database searching results. Benchmarking with soil and marine microbial communities demonstrated a higher number of peptide and protein identifications by Sipros Ensemble than MyriMatch/Percolator, Comet/Percolator, MS-GF+/Percolator, Comet & MyriMatch/iProphet and Comet & MyriMatch & MS-GF+/iProphet. Sipros Ensemble was computationally efficient and scalable on supercomputers. Availability and implementation Freely available under the GNU GPL license at http://sipros.omicsbio.org. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2017
31. BUILDING DICTIONARIES FOR MACHINE LEARNING FROM SPARSE DATA
- Subjects
Data mining ,Algorithms ,Machine learning ,Cakes ,Decision making ,Libraries ,Encyclopedias and dictionaries ,Grocery stores ,Data warehousing/data mining ,Algorithm ,News, opinion and commentary ,Duke University - Abstract
DURHAM, N.C. -- The following information was released by Pratt School of Engineering at Duke University: Guillermo Sapiro's paper laying the foundations of modern machine learning earns a 'Test of [...]
- Published
- 2019
32. Data from Tongji University Broaden Understanding of Machine Learning (Risk prediction for cut-ins using multi-driver simulation data and machine learning algorithms: A comparison among decision tree, GBDT and LSTM)
- Subjects
Machine learning ,Data mining ,Physical fitness ,Algorithms ,Data warehousing/data mining ,Algorithm ,Health - Abstract
2023 SEP 16 (NewsRx) -- By a News Reporter-Staff News Editor at Obesity, Fitness & Wellness Week -- Fresh data on artificial intelligence are presented in a new report. According [...]
- Published
- 2023
33. A reconstruction method for cross-cut shredded documents based on the extreme learning machine algorithm.
- Author
-
Zhang, Zhenghui, Zou, Juan, Yang, Shengxiang, Zheng, Jinhua, Gong, Dunwei, and Pei, Tingrui
- Subjects
MACHINE learning ,DATA mining ,INFORMATION technology security ,DISTRIBUTED algorithms ,ALGORITHMS ,COMPUTER assisted language instruction - Abstract
Reconstruction of cross-cut shredded text documents (RCCSTD) has important applications for information security and judicial evidence collection. The traditional method of manual construction is a very time-consuming task, so the use of computer-assisted efficient reconstruction is a crucial research topic. Fragment consensus information extraction and fragment pair compatibility measurement are two fundamental processes in RCCSTD. Due to the limitations of the existing classical methods of these two steps, only documents with specific structures or characteristics can be spliced, and pairing error is larger when the cutting is more fine-grained. In order to reconstruct the fragments more effectively, this paper improves the extraction method for consensus information and constructs a new global pairwise compatibility measurement model based on the extreme learning machine algorithm. The purpose of the algorithm's design is to exploit all available information and computationally suggest matches to increase the algorithm's ability to discriminate between data in various complex situations, then find the best neighbor of each fragment for splicing according to pairwise compatibility. The overall performance of our approach in several practical experiments is illustrated. The results indicate that the matching accuracy of the proposed algorithm is better than that of the previously published classical algorithms and still ensures a higher matching accuracy in the noisy datasets, which can provide a feasible method for RCCSTD intelligent systems in real scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. An Adaptive Bandwidth Management Algorithm for Next-Generation Vehicular Networks.
- Author
-
Huang, Chenn-Jung, Hu, Kai-Wen, and Cheng, Hao-Wen
- Subjects
TERAHERTZ technology ,BANDWIDTH allocation ,IN-vehicle entertainment equipment ,OPTICAL communications ,BANDWIDTHS ,ALGORITHMS ,VEHICULAR ad hoc networks ,NEXT generation networks - Abstract
The popularity of video services such as video call or video on-demand has made it impossible for people to live without them in their daily lives. It can be anticipated that the explosive growth of vehicular communication owing to the widespread use of in-vehicle video infotainment applications in the future will result in increasing fragmentation and congestion of the wireless transmission spectrum. Accordingly, effective bandwidth management algorithms are demanded to achieve efficient communication and stable scalability in next-generation vehicular networks. To the best of our current knowledge, a noticeable gap remains in the existing literature regarding the application of the latest advancements in network communication technologies. Specifically, this gap is evident in the lack of exploration regarding how cutting-edge technologies can be effectively employed to optimize bandwidth allocation, especially in the realm of video service applications within the forthcoming vehicular networks. In light of this void, this paper presents a seamless integration of cutting-edge 6G communication technologies, such as terahertz (THz) and visible light communication (VLC), with the existing 5G millimeter-wave and sub-6 GHz base stations. This integration facilitates the creation of a network environment characterized by high transmission rates and extensive coverage. Our primary aim is to ensure the uninterrupted playback of real-time video applications for vehicle users. These video applications encompass video conferencing, live video, and on-demand video services. The outcomes of our simulations convincingly indicate that the proposed strategy adeptly addresses the challenge of bandwidth competition among vehicle users. Moreover, it notably boosts the efficient utilization of bandwidth from less crowded base stations, optimizes the fulfillment of bandwidth prerequisites for various video applications, and elevates the overall video quality experienced by users. Consequently, our findings serve as a successful validation of the practicality and effectiveness of the proposed methodology. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
35. Identifying the Regions of a Space with the Self-Parameterized Recursively Assessed Decomposition Algorithm (SPRADA).
- Author
-
Molinié, Dylan, Madani, Kurosh, Amarger, Véronique, and Chebira, Abdennasser
- Subjects
ANOMALY detection (Computer security) ,ALGORITHMS ,INDUSTRIALISM ,HUMAN behavior models ,MANUFACTURING processes - Abstract
This paper introduces a non-parametric methodology based on classical unsupervised clustering techniques to automatically identify the main regions of a space, without requiring the objective number of clusters, so as to identify the major regular states of unknown industrial systems. Indeed, useful knowledge on real industrial processes entails the identification of their regular states, and their historically encountered anomalies. Since both should form compact and salient groups of data, unsupervised clustering generally performs this task fairly accurately; however, this often requires the number of clusters upstream, knowledge which is rarely available. As such, the proposed algorithm operates a first partitioning of the space, then it estimates the integrity of the clusters, and splits them again and again until every cluster obtains an acceptable integrity; finally, a step of merging based on the clusters' empirical distributions is performed to refine the partitioning. Applied to real industrial data obtained in the scope of a European project, this methodology proved able to automatically identify the main regular states of the system. Results show the robustness of the proposed approach in the fully-automatic and non-parametric identification of the main regions of a space, knowledge which is useful to industrial anomaly detection and behavioral modeling. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
36. Drug susceptibility prediction against a panel of drugs using kernelized Bayesian multitask learning
- Author
-
Mehmet Gönen and Adam A. Margolin
- Subjects
Statistics and Probability ,Computer science ,Anti-HIV Agents ,Bayesian probability ,Multi-task learning ,Antineoplastic Agents ,Machine learning ,computer.software_genre ,Biochemistry ,03 medical and health sciences ,Bayes' theorem ,0302 clinical medicine ,Drug Resistance, Viral ,medicine ,Humans ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,business.industry ,Dimensionality reduction ,Novelty ,Cancer ,Bayes Theorem ,medicine.disease ,Original Papers ,3. Good health ,Computer Science Applications ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Binary classification ,Drug Resistance, Neoplasm ,030220 oncology & carcinogenesis ,Pharmacogenomics ,Kernel (statistics) ,Benchmark (computing) ,Bioinformatics of Health and Disease ,HIV-1 ,Data mining ,Artificial intelligence ,business ,Eccb 2014 Proceedings Papers Committee ,computer ,Subspace topology ,Algorithms ,Software - Abstract
Motivation: Human immunodeficiency virus (HIV) and cancer require personalized therapies owing to their inherent heterogeneous nature. For both diseases, large-scale pharmacogenomic screens of molecularly characterized samples have been generated with the hope of identifying genetic predictors of drug susceptibility. Thus, computational algorithms capable of inferring robust predictors of drug responses from genomic information are of great practical importance. Most of the existing computational studies that consider drug susceptibility prediction against a panel of drugs formulate a separate learning problem for each drug, which cannot make use of commonalities between subsets of drugs. Results: In this study, we propose to solve the problem of drug susceptibility prediction against a panel of drugs in a multitask learning framework by formulating a novel Bayesian algorithm that combines kernel-based non-linear dimensionality reduction and binary classification (or regression). The main novelty of our method is the joint Bayesian formulation of projecting data points into a shared subspace and learning predictive models for all drugs in this subspace, which helps us to eliminate off-target effects and drug-specific experimental noise. Another novelty of our method is the ability of handling missing phenotype values owing to experimental conditions and quality control reasons. We demonstrate the performance of our algorithm via cross-validation experiments on two benchmark drug susceptibility datasets of HIV and cancer. Our method obtains statistically significantly better predictive performance on most of the drugs compared with baseline single-task algorithms that learn drug-specific models. These results show that predicting drug susceptibility against a panel of drugs simultaneously within a multitask learning framework improves overall predictive performance over single-task learning approaches. Availability and implementation: Our Matlab implementations for binary classification and regression are available at https://github.com/mehmetgonen/kbmtl. Contact: mehmet.gonen@sagebase.org Supplementary Information: Supplementary data are available at Bioinformatics online.
- Published
- 2014
37. Improve Quality and Efficiency of Textile Process using Data-driven Machine Learning in Industry 4.0.
- Author
-
Chia-Yun Lee, Jia-Ying Lin, and Ray-I Chang
- Subjects
MACHINE learning ,BIG data ,ARTIFICIAL intelligence ,DATA mining ,ALGORITHMS - Abstract
The capabilities of self-awareness, self-prediction, and self-maintenance are important for textile factory in Industry 4.0. One of the most important issues is to intellectualize the way of setting operation parameters as the cyber-physical system (CPS), instead of using traditional trial and error method. To achieve these goals, this paper focuses on the relationship between key operation parameter and defect for machine learning to design an operation parameters recommender system (OPRS) in the textile industry. From the perspective of data science, this paper in- tegrates historic manufacturing process data, such as machine operation parameters from warping, sizing, beaming and weaving process, and management experience data, such as textile inspection results from quality control section. Then, the regression models are applied to predict the textile operation parameters. This research also uses the clas- sification models to predict the quality of textile. Based on the ten-fold cross-validation testing, experimental results show that our model can achieve 90.8% accuracy on quality level prediction and the best regression model for predict- ing weaving operation parameters can reduce the mean square error (MSE) to 0.01%. By combining the above two models, proposed OPRS can provide a completed analysis data of operation parameters. It provides good performance when comparing with previous stochastic methods. As the proposed OPRS can support technician setting operation parameters more precisely even for a new type of yarn, it can help to fix the tech skills gap in the textile manufacturing process. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
38. Predictive policing poses discrimination risk, thinktank warns; Machine-learning algorithms could replicate or amplify bias on race, sexuality and age
- Subjects
Data mining ,Algorithms ,Machine learning ,Sexuality ,Risk assessment ,Ethics ,Journalists ,Social media ,Biometry ,Data warehousing/data mining ,Algorithm ,News, opinion and commentary - Abstract
Byline: Jamie Grierson Home affairs correspondent Predictive policing -- the use of machine-learning algorithms to fight crime -- risks unfairly discriminating against protected characteristics including race, sexuality and age, a [...]
- Published
- 2019
39. Enhancing recall in automated record screening: A resampling algorithm.
- Author
-
Zhipeng Hou and Tipton, Elizabeth
- Subjects
- *
AUTOMATIC identification , *ALGORITHMS , *HUMAN error , *PROBABILITY theory , *TRACKING algorithms , *TEXT mining - Abstract
Literature screening is the process of identifying all relevant records from a pool of candidate paper records in systematic review, meta-analysis, and other research synthesis tasks. This process is time consuming, expensive, and prone to human error. Screening prioritization methods attempt to help reviewers identify most relevant records while only screening a proportion of candidate records with high priority. In previous studies, screening prioritization is often referred to as automatic literature screening or automatic literature identification. Numerous screening prioritization methods have been proposed in recent years. However, there is a lack of screening prioritization methods with reliable performance. Our objective is to develop a screening prioritization algorithm with reliable performance for practical use, for example, an algorithm that guarantees an 80% chance of identifying at least 80% of the relevant records. Based on a target-based method proposed in Cormack and Grossman, we propose a screening prioritization algorithm using sampling with replacement. The algorithm is a wrapper algorithm that can work with any current screening prioritization algorithm to guarantee the performance. We prove, with mathematics and probability theory, that the algorithm guarantees the performance. We also run numeric experiments to test the performance of our algorithm when applied in practice. The numeric experiment results show this algorithm achieve reliable performance under different circumstances. The proposed screening prioritization algorithm can be reliably used in real world research synthesis tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Integrative random forest for gene regulatory network inference
- Author
-
Francesca Petralia, Jialiang Yang, Pei Wang, and Zhidong Tu
- Subjects
Statistics and Probability ,Computer science ,Big data ,Gene regulatory network ,Inference ,Ismb/Eccb 2015 Proceedings Papers Committee July 10 to July 14, 2015, Dublin, Ireland ,Saccharomyces cerevisiae ,Machine learning ,computer.software_genre ,Biochemistry ,Gene Regulatory Networks ,Molecular Biology ,Transcription factor ,Biological data ,business.industry ,Systems ,Construct (python library) ,Computer Science Applications ,Random forest ,Computational Mathematics ,Computational Theory and Mathematics ,Artificial intelligence ,Data mining ,business ,computer ,Algorithms ,Test data ,Transcription Factors - Abstract
Motivation: Gene regulatory network (GRN) inference based on genomic data is one of the most actively pursued computational biological problems. Because different types of biological data usually provide complementary information regarding the underlying GRN, a model that integrates big data of diverse types is expected to increase both the power and accuracy of GRN inference. Towards this goal, we propose a novel algorithm named iRafNet: integrative random forest for gene regulatory network inference. Results: iRafNet is a flexible, unified integrative framework that allows information from heterogeneous data, such as protein–protein interactions, transcription factor (TF)-DNA-binding, gene knock-down, to be jointly considered for GRN inference. Using test data from the DREAM4 and DREAM5 challenges, we demonstrate that iRafNet outperforms the original random forest based network inference algorithm (GENIE3), and is highly comparable to the community learning approach. We apply iRafNet to construct GRN in Saccharomyces cerevisiae and demonstrate that it improves the performance in predicting TF-target gene regulations and provides additional functional insights to the predicted gene regulations. Availability and implementation: The R code of iRafNet implementation and a tutorial are available at: http://research.mssm.edu/tulab/software/irafnet.html Contact: zhidong.tu@mssm.edu Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2015
41. A Review on Electronic Health Record Text-Mining for Biomedical Name Entity Recognition in Healthcare Domain.
- Author
-
Ahmad, Pir Noman, Shah, Adnan Muhammad, and Lee, KangYoon
- Subjects
SUBJECT headings ,MEDICAL information storage & retrieval systems ,NATURAL language processing ,MACHINE learning ,ARTIFICIAL intelligence ,BIOINFORMATICS ,TERMS & phrases ,INFORMATION retrieval ,CLINICAL medicine ,ELECTRONIC health records ,DATA mining ,ALGORITHMS - Abstract
Biomedical-named entity recognition (bNER) is critical in biomedical informatics. It identifies biomedical entities with special meanings, such as people, places, and organizations, as predefined semantic types in electronic health records (EHR). bNER is essential for discovering novel knowledge using computational methods and Information Technology. Early bNER systems were configured manually to include domain-specific features and rules. However, these systems were limited in handling the complexity of the biomedical text. Recent advances in deep learning (DL) have led to the development of more powerful bNER systems. DL-based bNER systems can learn the patterns of biomedical text automatically, making them more robust and efficient than traditional rule-based systems. This paper reviews the healthcare domain of bNER, using DL techniques and artificial intelligence in clinical records, for mining treatment prediction. bNER-based tools are categorized systematically and represent the distribution of input, context, and tag (encoder/decoder). Furthermore, to create a labeled dataset for our machine learning sentiment analyzer to analyze the sentiment of a set of tweets, we used a manual coding approach and the multi-task learning method to bias the training signals with domain knowledge inductively. To conclude, we discuss the challenges facing bNER systems and future directions in the healthcare field. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
42. Improved SVRG for finite sum structure optimization with application to binary classification.
- Author
-
Shao, Guangmei, Xue, Wei, Yu, Gaohang, and Zheng, Xiao
- Subjects
CONVEX functions ,SMOOTHNESS of functions ,ALGORITHMS ,DATA mining ,MACHINE learning ,BINARY number system - Abstract
This paper looks at a stochastic variance reduced gradient (SVRG) method for minimizing the sum of a finite number of smooth convex functions, which has been involved widely in the field of machine learning and data mining. Inspired by the excellent performance of two-point stepsize gradient method in batch learning, in this paper we present an improved SVRG algorithm, named stochastic two-point stepsize gradient method. Under some mild conditions, the proposed method achieves a linear convergence rate O(ρ
k ) for smooth and strongly convex functions, where ρ ∈ (0.68,1). Simulation experiments on several benchmark data sets are reported to demonstrate the performance of the proposed method. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
43. IndeCut evaluates performance of network motif discovery algorithms
- Author
-
Molly Megraw, David Koslicki, and Mitra Ansariola
- Subjects
0301 basic medicine ,Statistics and Probability ,Computer science ,Systems biology ,0102 computer and information sciences ,computer.software_genre ,Machine learning ,01 natural sciences ,Biochemistry ,Outcome (game theory) ,03 medical and health sciences ,Network motif ,Escherichia coli ,Humans ,Gene Regulatory Networks ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,Basis (linear algebra) ,business.industry ,Computational Biology ,Original Papers ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,010201 computation theory & mathematics ,Artificial intelligence ,Data mining ,business ,computer ,Algorithm ,Algorithms ,Software ,Transcription Factors - Abstract
Motivation Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of achieving this goal. However, a reliable network motif discovery outcome requires generating random background networks which are the result of a uniform and independent graph sampling method. To date, there has been no method to numerically evaluate whether any network motif discovery algorithm performs as intended on realistically sized datasets—thus it was not possible to assess the validity of resulting network motifs. Results In this work, we present IndeCut, the first method to date that characterizes network motif finding algorithm performance in terms of uniform sampling on realistically sized networks. We demonstrate that it is critical to use IndeCut prior to running any network motif finder for two reasons. First, IndeCut indicates the number of samples needed for a tool to produce an outcome that is both reproducible and accurate. Second, IndeCut allows users to choose the tool that generates samples in the most independent fashion for their network of interest among many available options. Availability and implementation The open source software package is available at https://github.com/megrawlab/IndeCut. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2017
- Full Text
- View/download PDF
44. Role of biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (COVID-19): A Systematic Review.
- Author
-
Albahri, A. S., Hamid, Rula A., Alwan, Jwan k., Al-qays, Z.T., Zaidan, A. A., Zaidan, B. B., Albahri, A O. S., AlAmoodi, A. H., Khlaf, Jamal Mawlood, Almahdi, E. M., Thabet, Eman, Hadi, Suha M., Mohammed, K I., Alsalem, M. A., Al-Obaidi, Jameel R., and Madhloom, H.T.
- Subjects
ALGORITHMS ,ARTIFICIAL intelligence ,MACHINE learning ,MEDLINE ,ONLINE information services ,DATA mining ,SYSTEMATIC reviews ,COVID-19 - Abstract
Coronaviruses (CoVs) are a large family of viruses that are common in many animal species, including camels, cattle, cats and bats. Animal CoVs, such as Middle East respiratory syndrome-CoV, severe acute respiratory syndrome (SARS)-CoV, and the new virus named SARS-CoV-2, rarely infect and spread among humans. On January 30, 2020, the International Health Regulations Emergency Committee of the World Health Organisation declared the outbreak of the resulting disease from this new CoV called 'COVID-19', as a 'public health emergency of international concern'. This global pandemic has affected almost the whole planet and caused the death of more than 315,131 patients as of the date of this article. In this context, publishers, journals and researchers are urged to research different domains and stop the spread of this deadly virus. The increasing interest in developing artificial intelligence (AI) applications has addressed several medical problems. However, such applications remain insufficient given the high potential threat posed by this virus to global public health. This systematic review addresses automated AI applications based on data mining and machine learning (ML) algorithms for detecting and diagnosing COVID-19. We aimed to obtain an overview of this critical virus, address the limitations of utilising data mining and ML algorithms, and provide the health sector with the benefits of this technique. We used five databases, namely, IEEE Xplore, Web of Science, PubMed, ScienceDirect and Scopus and performed three sequences of search queries between 2010 and 2020. Accurate exclusion criteria and selection strategy were applied to screen the obtained 1305 articles. Only eight articles were fully evaluated and included in this review, and this number only emphasised the insufficiency of research in this important area. After analysing all included studies, the results were distributed following the year of publication and the commonly used data mining and ML algorithms. The results found in all papers were discussed to find the gaps in all reviewed papers. Characteristics, such as motivations, challenges, limitations, recommendations, case studies, and features and classes used, were analysed in detail. This study reviewed the state-of-the-art techniques for CoV prediction algorithms based on data mining and ML assessment. The reliability and acceptability of extracted information and datasets from implemented technologies in the literature were considered. Findings showed that researchers must proceed with insights they gain, focus on identifying solutions for CoV problems, and introduce new improvements. The growing emphasis on data mining and ML techniques in medical fields can provide the right environment for change and improvement. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
45. An ensemble approach to outlier detection using some conventional clustering algorithms.
- Author
-
Saha, Akash, Chatterjee, Agneet, Ghosh, Soulib, Kumar, Neeraj, and Sarkar, Ram
- Subjects
OUTLIER detection ,ALGORITHMS ,MACHINE learning ,DATA mining ,KALMAN filtering - Abstract
Outlier detection is an important requirement in data mining and machine learning. When data mining and machine learning algorithms are applied on the datasets with outliers, it leads to erroneous conclusion about the data. Therefore, researchers have been working in this field to remove outliers from dataset so that meaningful information from the datasets can be retrieved. In this paper, we take a cluster based ensemble approach for outlier detection, the backbone of which are some conventional clustering algorithms. Keeping in mind the drawbacks of supervised and semi supervised learning, we have relied on unsupervised learning algorithms. For our cluster based ensemble approach, we use three clustering algorithms, namely K-means, K-means++, and Fuzzy C-means. Our model intelligently combines results from individual clustering algorithms, assigning probabilities to each data point in order to decide its belongingness to a certain cluster. We have proposed a technique to assign a membership value to a data point in case of hard clustering algorithms, as we want to keep the flexibility of combining hard and soft clustering algorithms. From the probabilities assigned by the ensemble model, we then identify the outliers from the dataset. After removing these data points from the dataset, we obtain better values of cluster validity indices, thus reaffirming that removal of outliers has resulted in more stringent clusters of data. We have used five different cluster validity indices in our work to measure the goodness of the clusters formed, considering eight widely used datasets for evaluation of the proposed model amongst which three are large datasets. We have noticed a significant improvement in the cluster validity indices after applying our outlier detection algorithm. The experimental results prove that the proposed method is empirically sound. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
46. A novel feature selection approach with Pareto optimality for multi-label data.
- Author
-
Li, Guohe, Li, Yong, Zheng, Yifeng, Li, Ying, Hong, Yunfeng, and Zhou, Xiaoming
- Subjects
FEATURE selection ,DATA mining ,MACHINE learning ,ALGORITHMS - Abstract
Multi-label learning has widely applied in machine learning and data mining. The purpose of feature selection is to select an approximately optimal feature subset to characterize the original feature space. Similar to single-label data, feature selection is an import preprocessing step to enhance the performance of multi-label classification model. In this paper, we propose a multi-label feature selection approach with Pareto optimality for continuous data, called MLFSPO. It maps multi-label features to high-dimensional space to evaluate the correlation between features and labels by utilizing the Hilbert-Schmidt Independence Criterion (HSIC). Then, the feature subset obtains by combining the Pareto optimization with feature ordering criteria and label weighting. Eventually, extensive experimental results on publicly available data sets show the effectiveness of the proposed algorithm in multi-label tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
47. A Novel Comparison of Charotar Region Wheat Variety Classification Techniques using Purely Tree-based Data Mining Algorithms.
- Author
-
Raj, M.P. and Saini, Jatinderkumar R.
- Subjects
DATA mining ,WHEAT ,MACHINE learning ,ALGORITHMS ,CLASSIFICATION ,DURUM wheat - Abstract
Techniques for classifying data using data mining are now a day prevalent in agriculture. The method of classifying seeds involves grouping various seed varieties according to their morphological characteristics. To accomplish categorization of the typical Charotar region (generally comprising Anand and Kheda districts of the Gujarat State of India) Gujarat Wheat (GW) varieties (TRITICUM – AESTIVUM) viz. GW 273, GW 496, GW 322, LOK-1, and GDW 1255 (TRITICUM – DURUM), Weka Explorer was used. The features used are area, perimeter, solidity, aspect ratio, major and minor axis of seed kernel, Hue, Saturation, Value, and SF1 (empirical). Features reduction was done using Information Gain (IG) and its modified version Gain Ratio (GR). This paper compares performance of Tree based data mining algorithms in classifying wheat varieties. For classification we used purely tree-based machine learning algorithms viz. J48, Random Forest, Hoeffding Tree, Logistic Model Tree (LMT), and REPTree. LMT- logistics regression method gives higher accuracy 96.4% compared to other classifiers. Hoeffding Tree classifiers stood second with 96% accuracy. For validation 10-fold cross validation was used. By reducing the number of folds in cross validation performance of most algorithms decreased except J48. The percentage of correctly classified instance increased for all algorithms when features were selected by GR except for J48. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. RUCIB: a novel rule-based classifier based on BRADO algorithm.
- Author
-
Morovatian, Iman, Basiri, Alireza, and Rezaei, Samira
- Subjects
- *
SUPERVISED learning , *DATABASES , *ALGORITHMS , *MACHINE learning , *RANDOM forest algorithms - Abstract
Classification is a widely used supervised learning technique that enables models to discover the relationship between a set of features and a specified label using available data. Its applications span various fields such as engineering, telecommunication, astronomy, and medicine. In this paper, we propose a novel rule-based classifier called RUCIB (RUle-based Classifier Inspired by BRADO), which draws inspiration from the socio-inspired swarm intelligence algorithm known as BRADO. RUCIB introduces two key aspects: the ability to accommodate multiple values for features within a rule and the capability to explore all data features simultaneously. To evaluate the performance of RUCIB, we conducted experiments using ten databases sourced from the UCI machine learning database repository. In terms of classification accuracy, we compared RUCIB to ten well-known classifiers. Our results demonstrate that, on average, RUCIB outperforms Naive Bayes, SVM, PART, Hoeffding Tree, C4.5, ID3, Random Forest, CORER, CN2, and RACER by 9.32%, 8.97%, 7.58%, 7.4%, 7.34%, 7.34%, 7.22%, 5.06%, 5.01%, and 1.92%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Widening: using parallel resources to improve model quality
- Author
-
Berthold, Michael R., Fillbrunn, Alexander, and Siebes, Arno
- Published
- 2021
- Full Text
- View/download PDF
50. Adaptive Hierarchical Density-Based Spatial Clustering Algorithm for Streaming Applications.
- Author
-
Vijayan, Darveen and Aziz, Izzatdin
- Subjects
MACHINE learning ,SPANNING trees ,ALGORITHMS ,DEEP learning ,DATA mining - Abstract
Clustering algorithms are commonly used in the mining of static data. Some examples include data mining for relationships between variables and data segmentation into components. The use of a clustering algorithm for real-time data is much less common. This is due to a variety of factors, including the algorithm's high computation cost. In other words, the algorithm may be impractical for real-time or near-real-time implementation. Furthermore, clustering algorithms necessitate the tuning of hyperparameters in order to fit the dataset. In this paper, we approach clustering moving points using our proposed Adaptive Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm, which is an implementation of an adaptive approach to building the minimum spanning tree. We switch between the Boruvka and the Prim algorithms as a means to build the minimum spanning tree, which is one of the most expensive components of the HDBSCAN. The Adaptive HDBSCAN yields an improvement in execution time by 5.31% without depreciating the accuracy of the algorithm. The motivation for this research stems from the desire to cluster moving points on video. Cameras are used to monitor crowds and improve public safety. We can identify potential risks due to overcrowding and movements of groups of people by understanding the movements and flow of crowds. Surveillance equipment combined with deep learning algorithms can assist in addressing this issue by detecting people or objects, and the Adaptive HDBSCAN is used to cluster these items in real time to generate information about the clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.