56 results on '"Lior Rokach"'
Search Results
2. The Effectiveness of Bivalent mRNA Omicron Containing Booster Vaccines Among Patients with Hematological Neoplasms
- Author
-
Tamar Tadmor, Guy Melamed, Hilel Alapi, Tal Patalone, and Lior Rokach
- Published
- 2023
3. Tltd: Transfer Learning for Tabular Data
- Author
-
Maxim Bragilovski, Zahi Kapri, Lior Rokach, and Shelly Levy-Tzedek
- Published
- 2023
4. F-PENN— Forest path encoding for neural networks
- Author
-
Yoni Cohen, Gilad Katz, and Lior Rokach
- Subjects
Artificial neural network ,Pixel ,business.industry ,Computer science ,Sample (statistics) ,Machine learning ,computer.software_genre ,Random forest ,Hardware and Architecture ,Encoding (memory) ,Signal Processing ,Path (graph theory) ,Alternating decision tree ,Artificial intelligence ,Gradient boosting ,business ,computer ,Software ,Information Systems - Abstract
Deep neural nets (DNNs) mostly tend to outperform other machine learning (ML) approaches when the training data is abundant, high-dimensional, sparse, or consisting of raw data (e.g., pixels). For datasets with other characteristics – for example, dense tabular numerical data – algorithms such as Gradient Boosting Machines and Random Forest often achieve comparable or better performance at a fraction of the time and resources. These differences suggest that combining these approaches has potential to yield superior performance. Existing attempts to combine DNNs with other ML approaches, which usually consist of feeding the output of the latter into the former, often do not produce positive results. We argue that this lack of improvement stems from the fact that the final classifications fail to provide the DNN with an understanding of the other algorithms’ decision-making process (i.e., its “logic”). In this study we present F-PENN, a novel approach for combining decision forests and DNNs. Instead of providing the final output of the forest (or its trees) to the DNN, we provide the paths traveled by each sample. This information, when fed to the neural net, yields significant improvement in performance. We demonstrate the effectiveness of our approach by conducting extensive evaluation on 56 datasets and comparing F-PENN to four leading baselines: DNNs, Gradient Boosted Decision Trees (GBDT), Random Forest and DeepFM. We show that F-PENN outperforms the baselines in 69%–89% of dataset and achieves an overall average error reduction of 16%–26%.
- Published
- 2021
5. Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning
- Author
-
Itay Gabbay, Bracha Shapira, and Lior Rokach
- Subjects
Information Systems and Management ,Computational complexity theory ,Meta learning (computer science) ,Computer science ,02 engineering and technology ,Similarity measure ,Machine learning ,computer.software_genre ,Theoretical Computer Science ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Isolation (database systems) ,Cluster analysis ,business.industry ,05 social sciences ,Rank (computer programming) ,050301 education ,Computer Science Applications ,Ranking ,Control and Systems Engineering ,Benchmark (computing) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0503 education ,computer ,Software - Abstract
The data clustering problem can be described as the task of organizing data into groups, where in each group the objects share some similar attributes. Most of the problems clustering algorithms address do not have a prior solution. This paper addresses the algorithm selection challenge for data clustering, while taking the difficulty in evaluating clustering solutions into account. We present a new meta-learning method for recommending the most suitable clustering algorithm for a dataset. Based on concepts from the isolation forest algorithm, we propose a new similarity measure between datasets. Our proposed dataset characterization methods generate an embedding for a dataset using this similarity measure, which is then used to improve the quality of the problem’s characterization. The method utilizes landmarking concepts to characterize the dataset and then, inspired by the DeepFM algorithm, applies meta-learning to rank the candidate algorithms that are expected to perform the best for the current dataset. This ranking could, among other things, support AutoML systems. Our approach is evaluated on a corpus of 100 publicly available benchmark datasets. We compare our method’s ranking performance to that of existing meta-learning methods and show the dominance of our method in terms of predictive performance and computational complexity.
- Published
- 2021
6. Approximating XGBoost with an interpretable decision tree
- Author
-
Omer Sagi and Lior Rokach
- Subjects
Information Systems and Management ,Exploit ,Computer science ,Decision tree ,02 engineering and technology ,Machine learning ,computer.software_genre ,Theoretical Computer Science ,Artificial Intelligence ,Order (exchange) ,0202 electrical engineering, electronic engineering, information engineering ,Interpretability ,business.industry ,05 social sciences ,050301 education ,Computer Science Applications ,Random forest ,Tree (data structure) ,Control and Systems Engineering ,Transparency (graphic) ,020201 artificial intelligence & image processing ,Gradient boosting ,Artificial intelligence ,business ,0503 education ,computer ,Software - Abstract
The increasing usage of machine-learning models in critical domains has recently stressed the necessity of interpretable machine-learning models. In areas like healthcare, finary – the model consumer must understand the rationale behind the model output in order to use it when making a decision. For this reason, it is impossible to use black-box models in these scenarios, regardless of their high predictive performance. Decision forests, and in particular Gradient Boosting Decision Trees (GBDT), are examples of this kind of model. GBDT models are considered the state-of-the-art in many classification challenges, reflected by the fact that the majority of Kaggle’s recent winners used GBDT methods as a part of their solution (such as XGBoost). But despite their superior predictive performance, they cannot be used in tasks that require transparency. This paper presents a novel method for transforming a decision forest of any kind into an interpretable decision tree. The method extends the tool-set available for machine learning practitioners, who want to exploit the interpretability of decision trees without significantly impairing the predictive performance gained by GBDT models like XGBoost. We show in an empirical evaluation that in some cases the generated tree is able to approximate the predictive performance of a XGBoost model while enabling better transparency of the outputs.
- Published
- 2021
7. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities
- Author
-
Sergio González, Salvador García, Francisco Herrera, Lior Rokach, and Javier Del Ser
- Subjects
Boosting (machine learning) ,business.industry ,Computer science ,Decision tree ,Predictive capability ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Ensemble learning ,Random forest ,ComputingMethodologies_PATTERNRECOGNITION ,Software ,Workflow ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Categorical variable ,computer ,Algorithm ,Information Systems - Abstract
Ensembles, especially ensembles of decision trees, are one of the most popular and successful techniques in machine learning. Recently, the number of ensemble-based proposals has grown steadily. Therefore, it is necessary to identify which are the appropriate algorithms for a certain problem. In this paper, we aim to help practitioners to choose the best ensemble technique according to their problem characteristics and their workflow. To do so, we revise the most renowned bagging and boosting algorithms and their software tools. These ensembles are described in detail within their variants and improvements available in the literature. Their online-available software tools are reviewed attending to the implemented versions and features. They are categorized according to their supported programming languages and computing paradigms. The performance of 14 different bagging and boosting based ensembles, including XGBoost, LightGBM and Random Forest, is empirically analyzed in terms of predictive capability and efficiency. This comparison is done under the same software environment with 76 different classification tasks. Their predictive capabilities are evaluated with a wide variety of scenarios, such as standard multi-class problems, scenarios with categorical features and big size data. The efficiency of these methods is analyzed with considerably large data-sets. Several practical perspectives and opportunities are also exposed for ensemble learning.
- Published
- 2020
8. Explainable decision forest: Transforming a decision forest into an interpretable tree
- Author
-
Omer Sagi and Lior Rokach
- Subjects
Computer science ,business.industry ,Best practice ,Decision tree ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Random forest ,Set (abstract data type) ,Tree (data structure) ,Hardware and Architecture ,Signal Processing ,Path (graph theory) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Use case ,Artificial intelligence ,business ,computer ,Software ,Information Systems - Abstract
Decision forests are considered the best practice in many machine learning challenges, mainly due to their superior predictive performance. However, simple models like decision trees may be preferred over decision forests in cases in which the generated predictions must be efficient or interpretable (e.g. in insurance or health-related use cases). This paper presents a novel method for transforming a decision forest into an interpretable decision tree, which aims at preserving the predictive performance of decision forests while enabling efficient classifications that can be understood by humans. This is done by creating a set of rule conjunctions that represent the original decision forest; the conjunctions are then hierarchically organized to form a new decision tree. We evaluate the proposed method on 33 UCI datasets and show that the resulting model usually approximates the ROC AUC gained by random forest while providing an interpretable decision path for each classification.
- Published
- 2020
9. Inferring Demographic Characteristics of Mobile Subscribers from Cell Tower Interactions
- Author
-
Ariel Bar, Bracha Shapira, and Lior Rokach
- Published
- 2022
10. Automatic Feature Engineering for Learning Compact Decision Trees
- Author
-
Inbal Roshanski, Meir Kalech, and Lior Rokach
- Subjects
History ,Polymers and Plastics ,Artificial Intelligence ,General Engineering ,Business and International Management ,Industrial and Manufacturing Engineering ,Computer Science Applications - Published
- 2022
11. Constraint learning based gradient boosting trees
- Author
-
Asaf Shabtai, Abraham Israeli, and Lior Rokach
- Subjects
0209 industrial biotechnology ,Constraint learning ,Boosting (machine learning) ,Computer science ,business.industry ,General Engineering ,Intelligent decision support system ,02 engineering and technology ,Machine learning ,computer.software_genre ,Computer Science Applications ,020901 industrial engineering & automation ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Gradient boosting ,business ,computer - Abstract
Predictive regression models aim to find the most accurate solution to a given problem, often without any constraints related to the model’s predicted values. Such constraints have been used in prior research where they have been applied to a subpopulation within the training dataset which is of greater interest and importance. In this research we introduce a new setting of regression problems, in which each instance can be assigned a different constraint, defined based on the value of the target (predicted) attribute. The new use of constraints is taken into account and incorporated into the learning process, and is also considered when evaluating the induced model. We propose two algorithms which are modifications to the regression boosting method. There are two advantages of the proposed algorithms: they are not dependent on the base learner used during the learning process, and they can be adopted by any boosting technique. We implemented the algorithms by modifying the gradient boosting trees (GBT) model, and we also introduced two measures for evaluating the models that were trained to solve the constraint problems. We compared the proposed algorithms to three baseline algorithms using four real-life datasets. Due to the algorithms’ focus on satisfying the constraints, in most cases the results showed significant improvement in the constraint-related measures, with just a minimal effect on the general prediction error. The main impact of the proposed approach is in its ability to derive a model with a higher level of assurance for specific cases of interest (i.e., the constrained cases). This is extremely important and has great significance in various use cases and expert and intelligent systems, particularly critical systems, such as critical healthcare systems (e.g., when predicting blood pressure or blood sugar level), safety systems (e.g., when aiming to estimate the distance of cars or airplanes from other objects), or critical industrial systems (e.g., require to estimate their usability along time). In each of these cases, there is a subpopulation of all instances that is of greater interest to the expert or system, and the sensitivity of the model’s error changes according to the real value of the predicted feature. For example, for a subpopulation of patients (e.g., patients under the age of eight, or patients known to be at risk), physicians often require a sensitive model that accurately predicts blood pressure values.
- Published
- 2019
12. A deep learning framework for predicting burglaries based on multiple contextual factors
- Author
-
Adir Solomon, Mor Kertis, Bracha Shapira, and Lior Rokach
- Subjects
Artificial Intelligence ,General Engineering ,Computer Science Applications - Published
- 2022
13. Integrated prediction intervals and specific value predictions for regression problems using neural networks
- Author
-
Eli Simhayev, Gilad Katz, and Lior Rokach
- Subjects
Information Systems and Management ,Artificial Intelligence ,Software ,Management Information Systems - Published
- 2022
14. Contextual security awareness: A context-based approach for assessing the security awareness of users
- Author
-
Adir Solomon, Michael Michaelshvili, Ron Bitton, Bracha Shapira, Lior Rokach, Rami Puzis, and Asaf Shabtai
- Subjects
Information Systems and Management ,Artificial Intelligence ,Software ,Management Information Systems - Published
- 2022
15. Explainable machine learning for chronic lymphocytic leukemia treatment prediction using only inexpensive tests
- Author
-
Amiel Meiseles, Denis Paley, Mira Ziv, Yarin Hadid, Lior Rokach, and Tamar Tadmor
- Subjects
Machine Learning ,Male ,Humans ,Health Informatics ,Leukemia, Lymphocytic, Chronic, B-Cell ,Algorithms ,Aged ,Computer Science Applications - Abstract
Chronic lymphocytic leukemia (CLL) is one of the most common types of leukemia in the western world which affects mainly the elderly population. Progress of the disease is very heterogeneous both in terms of necessity of treatment and life expectancy. The current scoring system for prognostic evaluation of patients with CLL is called CLL-IPI and predicts the general progress of the disease but is not a measure or a decision aid for the necessity of treatment. Due to the heterogeneous behavior of CLL it is important to develop tools that will identify if and when patients will necessitate treatment for CLL. Recently, Machine Learning (ML) has spread to many public health fields including diagnosis and prognosis of diseases.Existing machine learning methods for CLL treatment prediction rely on expensive tests, such as genetic tests, rendering them useless in peripheral or low-resource clinics such as those in developing countries. We aim to develop a model for predicting whether a patient will need treatment for CLL within two years of diagnosis using a machine learning model based on only on demographic data and routine laboratory tests.We conducted a single center study that included adult patients (above the age of 18) that were diagnosed with CLL according to the IWCLL criteria and were under observation at the hematology unit of the Bnai-Zion medical center between 2009 and 2019. Patient data include demographic, clinical and laboratory measures that were extracted from patients' medical records anonymously. All laboratory results, during the observation period, were extracted for the entire cohort. Multiple ML approaches for classifying whether a patient will require treatment during a predetermined period of 2 years were evaluated. Performance of the ML models was measured using repeated cross validation. We evaluated the use of SHapley Additive exPlanation (SHAP) for explaining what influences the models decision. Additionally, we employ a method for extracting a single decision tree from the ML model which enables the doctor to understand the main logic governing the model prediction.The study included 109 patients of them 67 males (61%). Patients were under observation for a median of 44 months and the median age was 65 (age range: 45-87). 64% of the cohort received therapy during follow-up. A Gradient Boosting Model (GBM) model using all of the extracted variables to identify the need for treatment in the coming two years among patients with CLL achieved the AUPRC of 0.78 (±0.08). An identical GBM model, without genetic/FISH and flowcytometry (FACS) data, such that it can be used in peripheral clinics, scored an AUPRC of 0.7686 (±0.0837). A Generalized Linear Model (GLM) using the same features, scored an AUPRC of 0.7535 (±0.0995). All the models described above surpassed the performance of CLL-IPI that was evaluated using the CLL-TIM model. According to the SHAP results, red blood cell (RBC) count was the most predictive value for the necessity for treatment, where a high value is associated with a low probability of requiring treatment in the coming two years. Additionally, the SHAP method was used for estimating the personal risk of a random patient and showed sensible results. A simple Decision Tree classifier showed that patients who had a hemoglobin level of less than 13 gm/dL and a Neutrophil to Lymphocyte Ratio (NLR) less than 0.063, which constituted 34% percent of the patients included in our study, had a high probability (76%) of requiring treatment.Machine Learning algorithms that were evaluated in this work for predicting the necessity of treatment for patients with CLL achieved reasonable accuracy which surpassed that of CLL-IPI which was evaluated using the CLL-TIM model. Furthermore, we found that a machine learning model trained exclusively using inexpensive features only incurred a modest decrease in performance compared to the model trained using all of the features. Due to the small number of patients in this study it is necessary to validate the results on a larger population.
- Published
- 2022
16. Predicting Application Usage Based on Latent Contextual Information
- Author
-
Adir Solomon, Bracha Shapira, and Lior Rokach
- Subjects
Computer Networks and Communications - Published
- 2021
17. Machine-learning model for the prediction of preeclampsia – a step toward personalized risk assessment
- Author
-
Guy Shtar, Lior Rokach, Victor Novack, Lena Novack, Gabor Than, Hannele Laivouri, Antonio Farina, Amnon G. Hadar, and Ofer Erez
- Subjects
Obstetrics and Gynecology - Published
- 2022
18. Personal price aware multi-seller recommender system: Evidence from eBay
- Author
-
Asnat Greenstein-Messica and Lior Rokach
- Subjects
Information Systems and Management ,Computer science ,media_common.quotation_subject ,05 social sciences ,02 engineering and technology ,Recommender system ,Management Information Systems ,Microeconomics ,Product (business) ,Promotion (rank) ,Willingness to pay ,Artificial Intelligence ,Order (business) ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,ComputingMilieux_COMPUTERSANDSOCIETY ,Revenue ,020201 artificial intelligence & image processing ,Transaction data ,050203 business & management ,Software ,media_common ,Market penetration ,Reputation - Abstract
Many e-commerce sites use recommender systems, which suggest products that consumers may want to purchase in order to increase site revenue. Though recommender systems have achieved great success, they have not reached their full potential. Most current systems share a common weakness: they fail to take into account dynamic properties of the offering which could dramatically improve the effectiveness of a recommendation; these characteristics include the product price, promotion indication, and seller's reputation. Particularly, in a multi-seller platform (e.g., eBay, Amazon), where competing firms sell products differentiated mainly by the seller's reputation and product price, modeling consumer's sensitivity to these dynamic properties and incorporating it into a recommender system will optimize sellers’ revenue and market penetration. In this research, we introduce a novel approach for a personal price aware multi-seller recommender system (PMSRS) which implicitly models a consumer's willingness to pay (WTP) for a specific product, taking into account discount indication and seller reputation, and incorporating it within a context-aware recommendation model to improve its effectiveness. We use six months of transactional data from eBay.com to test the proposed approach and prove its validity and effectiveness. Our results show that the proposed approach provides a good estimation of the consumer's WTP, and that incorporating the consumer's WTP and seller's reputation into a recommender system significantly improves its prediction accuracy (F-score improvements of 84% compared to a matrix factorization recommendation model which doesn't take into account the seller's reputation or consumer's WTP).
- Published
- 2018
19. Taxonomy of mobile users' security awareness
- Author
-
Asaf Shabtai, Lior Sidi, Ron Bitton, Andrey Finkelshtein, Rami Puzis, and Lior Rokach
- Subjects
General Computer Science ,Computer science ,Internet privacy ,Vulnerability ,Covert channel ,02 engineering and technology ,Asset (computer security) ,Computer security ,computer.software_genre ,Security testing ,Security information and event management ,Security engineering ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Cloud computing security ,business.industry ,Social engineering (security) ,Information security ,Computer security model ,Security awareness ,Information sensitivity ,Security service ,Human-computer interaction in information security ,Security through obscurity ,020201 artificial intelligence & image processing ,business ,Law ,Mobile device ,computer ,Countermeasure (computer) - Abstract
The popularity of smartphones, coupled with the amount of valuable and private information they hold, make them attractive to attackers interested in exploiting the devices to harvest sensitive information. Exploiting human vulnerabilities (i.e., social engineering) is an approach widely used to achieve this goal. Improving the security awareness of users is an effective method for mitigating social engineering attacks. However, while in the domain of personal computers (PCs) the security awareness of users is relatively high, previous studies have shown that for the mobile platform, the security awareness level is significantly lower. The skills required from a mobile user to interact safely with his/her smartphone are different from those that are required for safe and responsible PC use. Therefore, the awareness of mobile users to security risks is an important aspect of information security. An essential and challenging requirement of assessing security awareness is the definition of measureable criteria for a security aware user. In this paper, we present a hierarchical taxonomy for security awareness, specifically designed for mobile device users. The taxonomy defines a set of measurable criteria that are categorized according to different technological focus areas (e.g., applications and browsing) and within the context of psychological dimensions (e.g., knowledge, attitude, and behavior). We demonstrate the applicability of the proposed taxonomy by introducing an expert-based procedure for deriving mobile security awareness models for different attack classes (each class is an aggregation of social engineering attacks that exploit a similar set of human vulnerabilities). Each model reflects the contribution (weight) of each criterion to the mitigation of the corresponding attack class. Application of the proposed procedure, based on the input of 17 security experts, to derive mobile security awareness models of four different attack classes, confirms that the skills required from a smartphone user to mitigate an attack are different for different attack classes.
- Published
- 2018
20. Explaining anomalies detected by autoencoders using Shapley Additive Explanations
- Author
-
Liat Antwarg, Bracha Shapira, Lior Rokach, and Ronnie Mindlin Miller
- Subjects
Ground truth ,Computer science ,business.industry ,Deep learning ,Anomaly (natural sciences) ,Supervised learning ,General Engineering ,Machine learning ,computer.software_genre ,Autoencoder ,Computer Science Applications ,Kernel (image processing) ,Artificial Intelligence ,Outlier ,Anomaly detection ,Artificial intelligence ,business ,computer - Abstract
Deep learning algorithms for anomaly detection, such as autoencoders, point out the outliers, saving experts the time-consuming task of examining normal cases in order to find anomalies. Most outlier detection algorithms output a score for each instance in the database. The top-k most intense outliers are returned to the user for further inspection; however, the manual validation of results becomes challenging without justification or additional clues. An explanation of why an instance is anomalous enables the experts to focus their investigation on the most important anomalies and may increase their trust in the algorithm. Recently, a game theory-based framework known as SHapley Additive exPlanations (SHAP) was shown to be effective in explaining various supervised learning models. In this paper, we propose a method that uses Kernel SHAP to explain anomalies detected by an autoencoder, which is an unsupervised model. The proposed explanation method aims to provide a comprehensive explanation to the experts by focusing on the connection between the features with high reconstruction error and the features that are most important in terms of their affect on the reconstruction error. We propose a black-box explanation method, because it has the advantage of being able to explain any autoencoder without being aware of the exact architecture of the autoencoder model. The proposed explanation method extracts and visually depicts both features that contribute the most to the anomaly and those that offset it. An expert evaluation using real-world data demonstrates the usefulness of the proposed method in helping domain experts better understand the anomalies. Our evaluation of the explanation method, in which a “perfect” autoencoder is used as the ground truth, shows that the proposed method explains anomalies correctly, using the exact features, and evaluation on real-data demonstrates that (1) our explanation model, which uses SHAP, is more robust than the Local Interpretable Model-agnostic Explanations (LIME) method, and (2) the explanations our method provides are more effective at reducing the anomaly score than other methods.
- Published
- 2021
21. Supporting unknown number of users in keystroke dynamics models
- Author
-
Itay Hazan, Oded Margalit, and Lior Rokach
- Subjects
Authentication ,Information Systems and Management ,Computer science ,business.industry ,02 engineering and technology ,Machine learning ,computer.software_genre ,Popularity ,Management Information Systems ,Reduction (complexity) ,Keystroke dynamics ,Artificial Intelligence ,020204 information systems ,Factor (programming language) ,0202 electrical engineering, electronic engineering, information engineering ,Identity (object-oriented programming) ,Leverage (statistics) ,020201 artificial intelligence & image processing ,Social media ,Artificial intelligence ,business ,computer ,Software ,computer.programming_language - Abstract
In recent years, keystroke dynamics has gained popularity as a reliable means of verifying user identity in remote systems. Due to its high performance in verification and the fact that it does not require additional effort from the user, keystroke dynamics has become one of the most preferred second factor of authentication. Despite its prominence, it has one major limitation: keystroke dynamics algorithms are good at fitting a model to one user and one user only. When such algorithms try to fit a model to more than one user, the verification accuracy decreases dramatically. However, in real-world applications it is common practice for two or more users to use the same credentials, such as in shared bank accounts, shared social media profiles, and shared streaming licenses which allow multiple users in one account. In these cases, keystroke dynamics solutions become unreliable. To address this limitation, we propose a method that can leverage existing keystroke dynamics algorithms to automatically determine the number of users sharing the account and accurately support accounts that are shared with multiple users. We evaluate our method using eight state-of-the-art keystroke dynamics algorithms and three public datasets, with up to five different users in one model, achieving an average improvement in verification of 9.2% for the AUC and 8.6% for the EER in the multi-user cases, with just a negligible reduction of 0.2% for the AUC and 0.3% for the EER in the one-user cases.
- Published
- 2021
22. A hybrid approach for improving unsupervised fault detection for robotic systems
- Author
-
Eliahu Khalastchi, Meir Kalech, and Lior Rokach
- Subjects
0209 industrial biotechnology ,business.industry ,Computer science ,Supervised learning ,General Engineering ,Intelligent decision support system ,02 engineering and technology ,Machine learning ,computer.software_genre ,Fault (power engineering) ,Flight simulator ,Fault detection and isolation ,Computer Science Applications ,Domain (software engineering) ,020901 industrial engineering & automation ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Robot ,Artificial intelligence ,business ,computer - Abstract
From unsupervised to supervised learning a fault detection model (for robots).Insights to why and when it becomes more accurate.Theoretical analysis and a prediction tool.Empirical results on 3 real-world domains that back these insights. The use of robots in our daily lives is increasing. As we rely more on robots, thus it becomes more important for us that the robots will continue on with their mission successfully. Unfortunately, these sophisticated, and sometimes very expensive, machines are susceptible to different kinds of faults. It becomes important to apply a Fault Detection (FD) mechanism which is suitable for the domain of robots. Two important requirements of such a mechanism are: high accuracy and low computational-load during operation (online). Supervised learning can potentially produce very accurate FD models, and if the learning takes place offline then the online computational-load can be reduced. Yet, the domain of robots is characterized with the absence of labeled data (e.g., faulty, normal) required by supervised approaches, and consequently, unsupervised approaches are being used. In this paper we propose a hybrid approach - an unsupervised approach can label a data set, with a low degree of inaccuracy, and then the labeled data set is used offline by a supervised approach to produce an online FD model. Now, we are faced with a choice should we use the unsupervised or the hybrid fault detector? Seemingly, there is no way to validate the choice due to the absence of (a priori) labeled data. In this paper we give an insight to why, and a tool to predict when, the hybrid approach is more accurate. In particular, the main impacts of our work are (1) we theoretically analyze the conditions under which the hybrid approach is expected to be more accurate. (2) Our theoretical findings are backed with empirical analysis. We use data sets of three different robotic domains: a high fidelity flight simulator, a laboratory robot, and a commercial Unmanned Arial Vehicle (UAV). (3) We analyze how different unsupervised FD approaches are improved by the hybrid technique and (4) how well this improvement fits our prediction tool. The significance of the hybrid approach and the prediction tool is the potential benefit to expert and intelligent systems in which labeled data is absent or expensive to create.
- Published
- 2017
23. Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem
- Author
-
Roni Stern, Nir Ofek, Lior Rokach, and Asaf Shabtai
- Subjects
0209 industrial biotechnology ,business.industry ,Computer science ,Cognitive Neuroscience ,Sentiment analysis ,Pareto principle ,02 engineering and technology ,Intrusion detection system ,computer.software_genre ,Machine learning ,Computer Science Applications ,Statistical classification ,ComputingMethodologies_PATTERNRECOGNITION ,020901 industrial engineering & automation ,Artificial Intelligence ,Undersampling ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data mining ,business ,Cluster analysis ,Time complexity ,Classifier (UML) ,computer - Abstract
Datasets that have imbalanced class distributions pose a challenge for learning and classification algorithms. Imbalanced datasets exist in many domains, such as: fraud detection, sentiment analysis, churn prediction, and intrusion detection in computer networks. To solve the imbalance problem, three main approaches are typically used: data resampling, method adaptation and cost-sensitive learning; of these, data resampling, either oversampling the minority class instances or undersampling the majority class instances, is the most used approach. However, in most cases, when implementing these approaches, there is a trade-off between the predictive performance and the complexity. In this paper we introduce a fast, novel clustering-based undersampling technique for addressing binary-class imbalance problems, which demonstrates high predictive performance, while its time complexity is bound by the size of the minority class instances. During the training phase, the algorithm clusters the minority instances and selects a similar number of majority instances from each cluster. A specific classifier is then trained for each cluster. An unlabeled instance is classified as the majority class if it does not fit into any of the clusters. Otherwise, cluster-specific classifiers are used to return the instance's classification, and the results are weighted by the inverse-distance from the clusters. Our evaluation includes several state-of-the-art methods. We plot the Pareto frontier for various datasets, to consider both computational cost and predictive performance measures. Extensive sets of experiments demonstrate that only the suggested method is always found on the frontier.
- Published
- 2017
24. Anomaly detection for smartphone data streams
- Author
-
Asaf Shabtai, Yuval Elovici, Yisroel Mirsky, Bracha Shapira, and Lior Rokach
- Subjects
Data stream ,Computer Networks and Communications ,Computer science ,Data stream mining ,02 engineering and technology ,computer.software_genre ,Accelerometer ,Computer Science Applications ,Mobile security ,Hardware and Architecture ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Anomaly detection ,Data mining ,computer ,Private information retrieval ,Real world data ,Software ,Information Systems - Abstract
Smartphones centralize a great deal of users’ private information and are thus a primary target for cyber-attack. The main goal of the attacker is to try to access and exfiltrate the private information stored in the smartphone without detection. In situations where explicit information is lacking, these attackers can still be detected in an automated way by analyzing data streams (continuously sampled information such as an application’s CPU consumption, accelerometer readings, etc.). When clustered, anomaly detection techniques may be applied to the data stream in order to detect attacks in progress. In this paper we utilize an algorithm called pcStream that is well suited for detecting clusters in real world data streams and propose extensions to the pcStream algorithm designed to detect point, contextual, and collective anomalies. We provide a comprehensive evaluation that addresses mobile security issues on a unique dataset collected from 30 volunteers over eight months. Our evaluations show that the pcStream extensions can be used to effectively detect data leakage (point anomalies) and malicious activities (contextual anomalies) associated with malicious applications. Moreover, the algorithm can be used to detect when a device is being used by an unauthorized user (collective anomaly) within approximately 30 s with 1 false positive every two days.
- Published
- 2017
25. Artificial Intelligence – Game Changer in the Teratology Information Service
- Author
-
Lior Rokach, Guy Shtar, Maya Berlin, Tal De Haan, Bracha Shapira, Matitiahu Berkovitch, Natalie Dinavitser, Elkana Kohn, and Rana Cohen
- Subjects
Service (business) ,World Wide Web ,Computer science ,MEDLINE ,Toxicology ,Teratology - Published
- 2020
26. An ensemble method for top-N recommendations from the SVD
- Author
-
Lior Rokach, David Ben-Shimon, and Bracha Shapira
- Subjects
business.industry ,General Engineering ,Decision tree ,Dot product ,02 engineering and technology ,Recommender system ,Machine learning ,computer.software_genre ,Ensemble learning ,Computer Science Applications ,Matrix decomposition ,Set (abstract data type) ,Tree (data structure) ,Artificial Intelligence ,020204 information systems ,Singular value decomposition ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data mining ,business ,computer ,Mathematics - Abstract
SVD suffers from computational limitation when delivering top-N items online.An ensemble algorithm for getting top-N items from the SVD results is proposed.The algorithm maps the items to the leaves of multiple compact trees offline.Users are assigned online to one leaf in each tree for obtaining their top-N items.The algorithm delivers faster and more accurate top-N items than the base SVD. Matrix factorization methods such as the singular value decomposition technique have become very popular in the area of recommender systems. Given a rating matrix as input, these techniques output two matrixes with lower dimensional space that represent the user and item features. The relevance of item i to user u is revealed by the score of the dot product between u vector of features and i vector of features. High scores indicate greater relevance. In order to deliver the best recommendations for a given user based on these latent features, one must obtain the list of scores of all the items for the given user and sort the resulting list. When the size of the catalogue is large, this phase consumes a large amount of computational time and cannot be done online. Another drawback with this approach is that once such a list is computed for a given user, it remains finite and it is impossible to incorporate within it new activities of the user. Hence, the use of such techniques is limited online.In this paper we propose an ensemble method for building a forest of trees offline, where each leaf in each tree is holding a unique set of item vectors. Once a user is engaged with the system, its vector is classified to one leaf in each one of the trees in the forest for conducting a dot product with the corresponding items. By using this method we compute online only a small number of dot products for a given user vector allowing us to quickly retrieve dynamic recommendations from the SVD, thereby presenting an alternative to the existing method which computes and caches all of the dot products among the items and users. The method maps the items to the leaves of multiple compact trees offline, each tree is a weak recommendation model, creating a forest of decision trees algorithm in which users that are assigned to these leaves online are likely to produce high dot product scores with the items that are already in the leaves. We demonstrate the effectiveness of the suggested ensemble method by applying it to three public datasets and comparing it to a state-of-the-art algorithm aimed at solving the problem.
- Published
- 2016
27. Analyzing movement predictability using human attributes and behavioral patterns
- Author
-
Lior Rokach, Adir Solomon, Amit Livne, Bracha Shapira, and Gilad Katz
- Subjects
Artificial neural network ,business.industry ,Computer science ,Ecological Modeling ,Deep learning ,Geography, Planning and Development ,0211 other engineering and technologies ,Behavioral pattern ,021107 urban & regional planning ,02 engineering and technology ,Markov model ,Machine learning ,computer.software_genre ,Urban Studies ,Targeted advertising ,Preprocessor ,Artificial intelligence ,Predictability ,business ,computer ,Predictive modelling ,021101 geological & geomatics engineering ,General Environmental Science - Abstract
The ability to predict human mobility, i.e., transitions between a user's significant locations (the home, workplace, etc.) can be helpful in a wide range of applications, including targeted advertising, personalized mobile services, and transportation planning. Most studies on human mobility prediction have focused on the algorithmic perspective rather than on investigating human predictability. Human predictability has great significance, because it enables the creation of more robust mobility prediction models and the assignment of more accurate confidence scores to location predictions. In this study, we propose a novel method for detecting a user's stay points from millions of GPS samples. Then, after detecting these stay points, a long short-term memory (LSTM) neural network is used to predict future stay points. We explore the use of two types of stay point prediction models (a general model that is trained in advance and a personal model that is trained over time) and analyze the number of previous locations needed for accurate prediction. Our evaluation on two real-world datasets shows that by using our preprocessing approach, we can detect stay points from routine trajectories with higher accuracy than the methods commonly used in this domain, and that by utilizing various LSTM architectures instead of the traditional Markov models and advanced deep learning models, our method can predict human movement with high accuracy of more than 40% when using the Acc@1 measure and more than 59% when using the Acc@3 measure. We also demonstrate that the movement prediction accuracy varies for different user populations based on their trajectory characteristics and demographic attributes.
- Published
- 2021
28. SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods
- Author
-
Nir Nissim, Yuval Elovici, Aviad Cohen, and Lior Rokach
- Subjects
Advanced persistent threat ,business.industry ,computer.internet_protocol ,Computer science ,Feature extraction ,General Engineering ,Feature selection ,02 engineering and technology ,computer.file_format ,Static analysis ,computer.software_genre ,Machine learning ,Computer Science Applications ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Ransomware ,Malware ,020201 artificial intelligence & image processing ,Artificial intelligence ,Executable ,business ,computer ,XML - Abstract
Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM’s features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains ∼4.9% malicious and ∼95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is ∼25% lower.
- Published
- 2016
29. Reducing preference elicitation in group decision making
- Author
-
Lior Rokach, Bracha Shapira, Lihi Naamani-Dery, and Meir Kalech
- Subjects
Preference learning ,Operations research ,Computer science ,General Engineering ,02 engineering and technology ,Recommender system ,Some confidence ,computer.software_genre ,Expert system ,Computer Science Applications ,Group decision-making ,Artificial Intelligence ,020204 information systems ,Computational social choice ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Preference elicitation ,Data mining ,computer - Abstract
Reducing preference elicitation when selecting a winner item.Computing approximate winners with some confidence level.Terminating preference elicitation sooner by returning k alternatives.Two preference aggregation strategies: Least Misery and Majority.A user study on collected data from a group recommender system. Groups may need assistance in reaching a joint decision. Elections can reveal the winning item, but this means the group members need to vote on, or at least consider all available items. Our challenge is to minimize the amount of preferences that need to be elicited and thus reduce the effort required from the group members. We present a model that offers a few innovations. First, rather than offering a single winner, we propose to offer the group the best top-k alternatives. This can be beneficial if a certain item suddenly becomes unavailable, or if the group wishes to choose manually from a few selected items. Secondly, rather than offering a definite winning item, we suggest to approximate the item or the top-k items that best suit the group, according to a predefined confidence level. We study the tradeoff between the accuracy of the proposed winner item and the amount of preference elicitation required. Lastly, we offer to consider different preference aggregation strategies. These strategies differ in their emphasis: towards the individual users (Least Misery Strategy) or towards the majority of the group (Majority Based Strategy). We evaluate our findings on data collected in a user study as well as on real world and simulated datasets and show that selecting the suitable aggregation strategy and relaxing the termination condition can reduce communication cost up to 90%. Furthermore, the commonly used Majority strategy does not always outperform the Least Misery strategy. Addressing these three challenges contributes to the minimization of preference elicitation in expert systems.
- Published
- 2016
30. Recommender systems for product bundling
- Author
-
Moran Beladev, Bracha Shapira, and Lior Rokach
- Subjects
Information Systems and Management ,Operations research ,Computer science ,business.industry ,02 engineering and technology ,E-commerce ,Recommender system ,Marketing strategy ,Management Information Systems ,Artificial Intelligence ,Order (business) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,Revenue ,020201 artificial intelligence & image processing ,business ,Software ,Information filtering system - Abstract
Recommender systems (RS) are a class of information filter applications whose main goal is to provide personalized recommendations, content, and services to users. Recommendation services may support a firm's marketing strategy and contribute to increase revenues. Most RS methods were designed to provide recommendations of single items. Generating bundle recommendations, i.e., recommendations of two or more items together, can satisfy consumer needs, while at the same time increase customers' buying scope and the firm's income. Thus, finding and recommending an optimal and personal bundle becomes very important. Recommendation of bundles of products should also involve personalized pricing to predict which price should be offered to a user in order for the bundle to maximize purchase probability. However, most recommendation methods do not involve such personal price adjustment.This paper introduces a novel model of bundle recommendations that integrates collaborative filtering (CF) techniques, demand functions, and price modeling. This model maximizes the expected revenue of a recommendation list by finding pairs of products and pricing them in a way that maximizes both the probability of its purchase by the user and the revenue received by selling the bundle.Experiments with several real-world datasets have been conducted in order to evaluate the accuracy of the bundling model predictions. This paper compares the proposed method with several state-of-the-art methods (collaborative filtering and SVD). It has been found that using bundle recommendation can improve the accuracy of results. Furthermore, the suggested price recommendation model provides a good estimate of the actual price paid by the user and at the same time can increase the firm's income.
- Published
- 2016
31. Matching entities across online social networks
- Author
-
Michael Fire, Olga Peled, Lior Rokach, and Yuval Elovici
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Matching (statistics) ,Social network ,Computer science ,business.industry ,InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS ,Cognitive Neuroscience ,Supervised learning ,Computer Science - Social and Information Networks ,02 engineering and technology ,Machine learning ,computer.software_genre ,Computer Science Applications ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Identity (object-oriented programming) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Personally identifiable information ,computer - Abstract
Online Social Networks (OSNs), such as Facebook and Twitter, have become an integral part of our daily lives. There are hundreds of OSNs, each with its own focus in that each offers particular services and functionalities. Recent studies show that many OSN users create several accounts on multiple OSNs using the same or different personal information. Collecting all the available data of an individual from several OSNs and fusing it into a single profile can be useful for many purposes. In this paper, we introduce novel machine learning based methods for solving Entity Resolution (ER), a problem for matching user profiles across multiple OSNs. The presented methods are able to match between two user profiles from two different OSNs based on supervised learning techniques, which use features extracted from each one of the user profiles. By using the extracted features and supervised learning techniques, we developed classifiers which can perform entity matching between two profiles for the following scenarios: (a) matching entities across two OSNs; (b) searching for a user by similar name; and (c) de-anonymizing a user's identity. The constructed classifiers were tested by using data collected from two popular OSNs, Facebook and Xing. We then evaluated the classifiers' performances using various evaluation measures, such as true and false positive rates, accuracy, and the Area Under the receiver operator Curve (AUC). The constructed classifiers were evaluated and their classification performance measured by AUC was quite remarkable, with an AUC of up to 0.982 and an accuracy of up to 95.9% in identifying user profiles across two OSNs.
- Published
- 2016
32. Utilizing transfer learning for in-domain collaborative filtering
- Author
-
Aviram Dayan, Ariel Bar, Edita Grolman, Bracha Shapira, and Lior Rokach
- Subjects
Information Systems and Management ,Computer science ,business.industry ,Event (computing) ,RSS ,02 engineering and technology ,computer.file_format ,Recommender system ,computer.software_genre ,Machine learning ,Linear subspace ,Management Information Systems ,Domain (software engineering) ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Collaborative filtering ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,Transfer of learning ,computer ,Software ,Sparse matrix - Abstract
In recent years, transfer learning has been used successfully to improve the predictive performance of collaborative filtering (CF) for sparse data by transferring patterns across domains. In this work, we advance transfer learning (TL) in recommendation systems (RSs), facilitating improvement within a domain rather than across domains. Specifically, we utilize TL for in-domain usage. This reduces the need to obtain information from additional domains, while achieving stronger single domain results than other state-of-the-art CF methods. We present two new algorithms; the first utilizes different event data within the same domain and boosts recommendations of the target event (e.g., the buy event), and the second algorithm transfers patterns from dense subspaces of the dataset to sparse subspaces. Experiments on real-life and publically available datasets reveal that the proposed methods outperform existing state-of-the-art CF methods.
- Published
- 2016
33. Towards latent context-aware recommendation systems
- Author
-
Bracha Shapira, Lior Rokach, Moshe Unger, and Ariel Bar
- Subjects
Information Systems and Management ,Information retrieval ,business.industry ,Computer science ,Process (engineering) ,Deep learning ,Context (language use) ,02 engineering and technology ,Recommender system ,Machine learning ,computer.software_genre ,Management Information Systems ,Set (abstract data type) ,Artificial Intelligence ,Mobile phone ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Mobile device ,computer ,Software ,Curse of dimensionality - Abstract
The emergence and penetration of smart mobile devices has given rise to the development of context-aware systems that utilize sensors to collect available data about users in order to improve various user services. Recently, the use of context-aware recommender systems (CARS) aimed at recommending items to users has expanded, particularly those that consider user context. Adding context to recommendation systems is challenging, because the addition of various environmental contexts to the recommendation process results in the expansion of its dimensionality, and thus increases sparsity. Therefore, existing CARS tend to incorporate a small set of pre-defined explicit contexts which do not necessary represent user context or reflect the optimal set of features for the recommendation process. We suggest a novel approach centered on representing environmental features as low dimensional unsupervised latent contexts. We extract data from a rich set of mobile sensors in order to infer unexplored user contexts in an unsupervised manner. The latent contexts are hidden context patterns modeled as numeric vectors which are efficiently extracted from raw sensor data. The latent contexts are automatically learned for each user utilizing unsupervised deep learning techniques and PCA on the data collected from the user's mobile phone. Integrating the data extracted from high dimensional sensors into a new latent context-aware recommendation algorithm results in up to a 20% increase in recommendation accuracy.
- Published
- 2016
34. Decision forest: Twenty years of research
- Author
-
Lior Rokach
- Subjects
Incremental decision tree ,Computer science ,business.industry ,Decision tree learning ,Decision tree ,ID3 algorithm ,02 engineering and technology ,computer.software_genre ,Machine learning ,Regression ,Random forest ,Hardware and Architecture ,020204 information systems ,Signal Processing ,Covariate ,0202 electrical engineering, electronic engineering, information engineering ,Alternating decision tree ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer ,Software ,Information Systems - Abstract
A decision tree is a predictive model that recursively partitions the covariate’s space into subspaces such that each subspace constitutes a basis for a different prediction function. Decision trees can be used for various learning tasks including classification, regression and survival analysis. Due to their unique benefits, decision trees have become one of the most powerful and popular approaches in data science. Decision forest aims to improve the predictive performance of a single decision tree by training multiple trees and combining their predictions. This paper provides an introduction to the subject by explaining how a decision forest can be created and when it is most valuable. In addition, we are reviewing some popular methods for generating the forest, fusion the individual trees’ outputs and thinning large decision forests.
- Published
- 2016
35. XML-AD: Detecting anomalous patterns in XML documents
- Author
-
Eitan Menahem, Alon Schclar, Yuval Elovici, and Lior Rokach
- Subjects
Information Systems and Management ,Information retrieval ,computer.internet_protocol ,Computer science ,XML Signature ,020206 networking & telecommunications ,02 engineering and technology ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Simple API for XML ,Artificial Intelligence ,Control and Systems Engineering ,Feature (computer vision) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,Information system ,020201 artificial intelligence & image processing ,Data mining ,computer ,Software ,XML - Abstract
Many information systems use XML documents to store data and to interact with other systems. Abnormal documents, which can be the result of either an on-going cyber attack or the actions of a benign user, can potentially harm the interacting systems and are therefore regarded as a threat. In this paper we address the problem of anomaly detection and localization in XML documents using machine learning techniques. We present XML-AD - a new XML anomaly detection framework. Within this framework, an automatic method for extraction of feature from XML documents as well as a practical method for transforming XML features into vectors of fixed dimensionality was developed. With these two methods in place, the XML-AD framework makes it possible to utilize general learning algorithms for anomaly detection. The core of the framework consists of a novel multi-univariate anomaly detection algorithm, ADIFA. The framework was evaluated using four XML documents datasets which were obtained from real information systems. It achieved over 89% true positive detection rate with less than 0.2% of false positives.
- Published
- 2016
36. Keystroke dynamics obfuscation using key grouping
- Author
-
Lior Rokach, Itay Hazan, and Oded Margalit
- Subjects
Password ,0209 industrial biotechnology ,business.industry ,Computer science ,General Engineering ,02 engineering and technology ,Encryption ,Machine learning ,computer.software_genre ,Keystroke logging ,Session (web analytics) ,Computer Science Applications ,020901 industrial engineering & automation ,Keystroke dynamics ,Artificial Intelligence ,Identity theft ,Obfuscation ,0202 electrical engineering, electronic engineering, information engineering ,Key (lock) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer - Abstract
Keystroke dynamics is one of the most widely adopted identity verification techniques in remote systems. It is based on modeling users’ specific patterns of typing on the keyboard. When utilized in conjunction with the commonly used passwords, the use of keystroke dynamics can dramatically increase the level of security without interfering with the user experience. However, aspects of keystroke dynamics that applied on passwords, such as processing keystroke events and storing feature vectors or user models, can expose users to identity theft and a new set of privacy risks, thus questioning the added value of keystroke dynamics. In addition, common encryption techniques will be unable to mitigate these threats, since the user's behavior changes from one session to another. In this paper, we suggest key grouping as an obfuscation method to ensure keystroke dynamics privacy. When applied on the keystroke events, the key grouping dramatically reduces the possibility of password theft. To perform the key grouping optimally, we present a novel method which produces groups that can integrated with any keystroke dynamics algorithm. Our method divides the keys into groups using hierarchical clustering with dedicated statistical heuristics algorithm. We tested our method's key grouping output on five keystroke dynamics algorithms using a public dataset and managed to show a consistent improvement of up to 7% in the AUC over other, more intuitive key groupings and random key groupings.
- Published
- 2020
37. Machine learning and operation research based method for promotion optimization of products with no price elasticity history
- Author
-
Asnat Greenstein-Messica and Lior Rokach
- Subjects
Marketing ,Price elasticity of demand ,Operations research ,Computer Networks and Communications ,Computer science ,Model prediction ,05 social sciences ,Prediction interval ,02 engineering and technology ,Product price ,Computer Science Applications ,020204 information systems ,Management of Technology and Innovation ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,050211 marketing ,Gradient boosting ,Elasticity (economics) - Abstract
Many leading e-commerce retailers adopt a consistent pricing strategy to build customer trust and promote just a small portion of their catalog each week. Promotion optimization for consistent pricing retailers is a challenging problem, as they need to decide which products to promote, with no historical price elasticity information for the candidate products. In this paper, we introduce a novel approach for predicting product price elasticity impact for e-commerce retailers who use a consistent pricing strategy. We combine the commonly used operation research-based log–log demand model with the nonlinear gradient boosting machines algorithm to predict the price elasticity impact of products with no historical price elasticity information. A pessimistic prediction interval measure is used to accelerate the learning period and reduce the probability of selecting low impact promotions due to high model prediction uncertainty. We demonstrate the effectiveness of our approach on a real-world dataset collected from an online European department store.
- Published
- 2020
38. Securing keystroke dynamics from replay attacks
- Author
-
Lior Rokach, Itay Hazan, and Oded Margalit
- Subjects
0209 industrial biotechnology ,Authentication ,Biometrics ,Computer science ,02 engineering and technology ,Keystroke logging ,Computer security ,computer.software_genre ,020901 industrial engineering & automation ,Keystroke dynamics ,0202 electrical engineering, electronic engineering, information engineering ,Identity (object-oriented programming) ,020201 artificial intelligence & image processing ,computer ,Replay attack ,Protocol (object-oriented programming) ,Software - Abstract
Keystroke dynamics is a viable behavioral biometric technique for identity verification based on users’ keyboard interaction traits. Keystroke dynamics can help prevent credentials from being abused in case of theft or leakage. But what happens if the keystroke events are eavesdropped and being replayed? Attackers that intercept keystroke dynamics authentication sessions of benign users can easily replay them from other sources unchanged or with minor changes and gain illegitimate privileges. Hence, even with its major security advantages, keystroke dynamics can still expose authentication mechanisms to replay attacks. Although replay attack is one of the oldest techniques to manipulate authentication systems, keystroke dynamics does not help preventing it. We suggest a new protocol for dynamics exchange based on choosing a subset of real and fake information snippets shared between the client and service providers to lure potential attackers. We evaluated our method on four state-of-the-art keystroke dynamics algorithms and three publicly available datasets and showed that we can dramatically reduce the possibility of replay attacks while preserving highly accurate user verification.
- Published
- 2019
39. Local-shapelets for fast classification of spectrographic measurements
- Author
-
Daniel Gordon, Aryeh Kontorovich, Lior Rokach, and Danny Hendler
- Subjects
Series (mathematics) ,Artificial Intelligence ,Computer science ,Decision tree learning ,General Engineering ,Data mining ,computer.software_genre ,Throughput (business) ,Class (biology) ,computer ,Computer Science Applications - Abstract
We present an algorithm for classifying spectrographic measurements.The concept of locality is introduced into an established time series algorithm.A technique for estimating a tolerance parameter is presented.Learning and classification times are reduced by two orders of magnitude.Accuracy levels are retained. Spectroscopy is widely used in the food industry as a time-efficient alternative to chemical testing. Lightning-monitoring systems also employ spectroscopic measurements. The latter application is important as it can help predict the occurrence of severe storms, such as tornadoes.The shapelet based classification method is particularly well-suited for spectroscopic data sets. This technique for classifying time series extracts patterns unique to each class. A significant downside of this approach is the time required to build the classification tree. In addition, for high throughput applications the classification time of long time series is inhibitive. Although some progress has been made in terms of reducing the time complexity of building shapelet based models, the problem of reducing classification time has remained an open challenge.We address this challenge by introducing local-shapelets. This variant of the shapelet method restricts the search for a match between shapelets and time series to the vicinity of the location from which each shapelet was extracted. This significantly reduces the time required to examine each shapelet during both the learning and classification phases. Classification based on local-shapelets is well-suited for spectroscopic data sets as these are typically very tightly aligned. Our experimental results on such data sets demonstrate that the new approach reduces learning and classification time by two orders of magnitude while retaining the accuracy of regular (non-local) shapelets-based classification. In addition, we provide some theoretical justification for local-shapelets.
- Published
- 2015
40. Novel active learning methods for enhanced PC malware detection in windows OS
- Author
-
Nir Nissim, Yuval Elovici, Robert Moskovitch, and Lior Rokach
- Subjects
Point (typography) ,Computer science ,Active learning (machine learning) ,business.industry ,General Engineering ,Information security ,computer.software_genre ,Machine learning ,Computer Science Applications ,Support vector machine ,Task (computing) ,Artificial Intelligence ,Microsoft Windows ,Malware ,Data mining ,Artificial intelligence ,Suspect ,business ,computer - Abstract
The formation of new malwares every day poses a significant challenge to anti-virus vendors since antivirus tools, using manually crafted signatures, are only capable of identifying known malware instances and their relatively similar variants. To identify new and unknown malwares for updating their anti-virus signature repository, anti-virus vendors must daily collect new, suspicious files that need to be analyzed manually by information security experts who then label them as malware or benign. Analyzing suspected files is a time-consuming task and it is impossible to manually analyze all of them. Consequently, anti-virus vendors use machine learning algorithms and heuristics in order to reduce the number of suspect files that must be inspected manually. These techniques, however, lack an essential element – they cannot be daily updated. In this work we introduce a solution for this updatability gap. We present an active learning (AL) framework and introduce two new AL methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files that are most probably malicious. Those new AL methods are designed and oriented towards new malware acquisition. To test the capability of our methods for acquiring new malwares from a stream of unknown files, we conducted a series of experiments over a ten-day period. A comparison of our methods to existing high performance AL methods and to random selection, which is the naive method, indicates that the AL methods outperformed random selection for all performance measures. Our AL methods outperformed existing AL method in two respects, both related to the number of new malwares acquired daily, the core measure in this study. First, our best performing AL method, termed “Exploitation”, acquired on the 9th day of the experiment about 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection. Secondly, while the existing AL method showed a decrease in the number of new malwares acquired over 10 days, our AL methods showed an increase and a daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist anti-virus vendors.
- Published
- 2014
41. Reaching a joint decision with minimal elicitation of voter preferences
- Author
-
Meir Kalech, Bracha Shapira, Lior Rokach, and Lihi Dery
- Subjects
Information Systems and Management ,business.industry ,Heuristic ,Range voting ,media_common.quotation_subject ,Probabilistic logic ,Machine learning ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Identification (information) ,Artificial Intelligence ,Control and Systems Engineering ,Voting ,Probability distribution ,Preference elicitation ,Artificial intelligence ,business ,Social choice theory ,computer ,Software ,Mathematics ,media_common - Abstract
Sometimes voters are required to reach a joint decision and find an item that best suits the group’s preferences. Voters may wish to state preferences only when necessary, particularly in cases where there are many available options, therefore it is unpractical to assume that all voter preferences are known at all times. In order to elicit voter preferences at a minimal cost, a preference elicitation process is required. We introduce a general approach for reaching a joint decision with minimal elicitation of voter preferences. The approach is probabilistic and uses voting rules to find a necessary winning item which is presented to the group as their best option. We propose computing a voter-item probability distribution and developing methods based on this distribution that can then determine which voter-item pair to query. Computing the optimal minimal set of voter-item queries is computationally intractable; therefore we propose novel heuristic algorithms, named DIG and ES, which proceed iteratively until the identification of a winning item. The probabilistic voting distribution is updated as more information is revealed. Experiments on simulated data examine the benefits of each of the algorithms under different settings. Experiments with the real-world Netflix data show that the proposed algorithms reduce the required number of ratings for identifying the winning item by more than 50%.
- Published
- 2014
42. Mobile malware detection through analysis of deviations in application network behavior
- Author
-
Asaf Shabtai, Bracha Shapira, Yuval Elovici, Dudu Mimran, Lena Tenenboim-Chekina, and Lior Rokach
- Subjects
General Computer Science ,Computer science ,Real-time computing ,Behavioral pattern ,Network behavior ,Computer security ,computer.software_genre ,Mobile malware ,Android malware ,Malware ,Anomaly detection ,Android (operating system) ,Law ,computer ,Mobile device - Abstract
In this paper we present a new behavior-based anomaly detection system for detecting meaningful deviations in a mobile application's network behavior. The main goal of the proposed system is to protect mobile device users and cellular infrastructure companies from malicious applications by: (1) identification of malicious attacks or masquerading applications installed on a mobile device, and (2) identification of republished popular applications injected with a malicious code (i.e., repackaging). More specifically, we attempt to detect a new type of mobile malware with self-updating capabilities that were recently found on the official Google Android marketplace. Malware of this type cannot be detected using the standard signatures approach or by applying regular static or dynamic analysis methods. The detection is performed based on the application's network traffic patterns only. For each application, a model representing its specific traffic pattern is learned locally (i.e., on the device). Semi-supervised machine-learning methods are used for learning the normal behavioral patterns and for detecting deviations from the application's expected behavior. These methods were implemented and evaluated on Android devices. The evaluation experiments demonstrate that: (1) various applications have specific network traffic patterns and certain application categories can be distinguished by their network patterns; (2) different levels of deviation from normal behavior can be detected accurately; (3) in the case of self-updating malware, original (benign) and infected versions of an application have different and distinguishable network traffic patterns that in most cases, can be detected within a few minutes after the malware is executed while presenting very low false alarms rate; and (4) local learning is feasible and has a low performance overhead on mobile devices.
- Published
- 2014
43. Volatile memory analysis using the MinHash method for efficient and secured detection of malware in private cloud
- Author
-
Yuval Elovici, Aviad Cohen, Omri Lahav, Lior Rokach, and Nir Nissim
- Subjects
General Computer Science ,business.industry ,Computer science ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,MinHash ,computer.software_genre ,Cloud computing architecture ,Virtual machine ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Ransomware ,Malware ,020201 artificial intelligence & image processing ,State (computer science) ,business ,Law ,computer ,Volatile memory - Abstract
Today, most organizations employ cloud computing environments for both computational reasons and for storing their critical files and data. Virtual servers are an example of widely used virtual resources provided by cloud computing architecture. Therefore, virtual servers are considered an attractive target for cyber-attackers, who launch their attacks by malware such as the well-known remote access trojans (RATs) and more modern malware such as ransomware and cryptojacking. Existing security solutions implemented on virtual servers fail to detect these newly created malware (zero-day attacks). In fact, by the time the security solution is updated, the organization has likely already been attacked. In this study, we present a designated framework aimed at trusted and secured detection of newly created and unknown instances of malware on virtual machines in an organization's private cloud. We took volatile memory dumps from a virtual machine (VM) in a secured and trusted manner, and analyzed all of the data within the memory dumps using the MinHash method; MinHash is well suited for the accurate detection of malware in VMs based on efficient volatile memory dump comparisons. The proposed framework is evaluated in a comprehensive set of experiments of increasing difficulty in which we also measured the detection performance of different classifiers (both similarity and machine learning-based classifiers, using collections of real-world, professional, notorious malware and legitimate applications. The evaluation results show that our framework can detect the anomalous state of a virtual server, as well as known, new, and unknown malware, with very high TPRs (100% for ransomware and RATs) and very low FPRs (1.8% for ransomware and no FPR for RATs). We also show how the methodology's performance can be improved, in terms of required time and storage space, saving more than 86% of these resources. Finally, we demonstrate the generalization capabilities and practicality of our methodology by using transfer learning and learning from just one virtual server in order to detect unknown malware on a different virtual server.
- Published
- 2019
44. User identity verification via mouse dynamics
- Author
-
Alon Schclar, Robert Moskovitch, Yuval Elovici, Lior Rokach, and Clint Feher
- Subjects
Password ,Information Systems and Management ,business.industry ,Computer science ,Aggregate (data warehouse) ,Pointing device ,Computer security ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Artificial Intelligence ,Control and Systems Engineering ,Human–computer interaction ,Identity theft ,Identity (object-oriented programming) ,Smart card ,business ,computer ,Software ,Hacker - Abstract
Identity theft is a crime in which hackers perpetrate fraudulent activity under stolen identities by using credentials, such as passwords and smartcards, unlawfully obtained from legitimate users or by using logged-on computers that are left unattended. User verification methods provide a security layer in addition to the username and password by continuously validating the identity of logged-on users based on their physiological and behavioral characteristics. We introduce a novel method that continuously verifies users according to characteristics of their interaction with the mouse. The contribution of this work is threefold: first, user verification is derived based on the classification results of each individual mouse action, in contrast to methods which aggregate mouse actions. Second, we propose a hierarchy of mouse actions from which the features are extracted. Third, we introduce new features to characterize the mouse activity which are used in conjunction with features proposed in previous work. The proposed algorithm outperforms current state-of-the-art methods by achieving higher verification accuracy while reducing the response time of the system.
- Published
- 2012
45. Limiting disclosure of sensitive data in sequential releases of databases
- Author
-
Tamir Tassa, Bracha Shapira, Lior Rokach, Erez Shmueli, and Raz Wasserstein
- Subjects
Information Systems and Management ,Database ,business.industry ,Generalization ,Computer science ,Privacy laws of the United States ,Usability ,Context (language use) ,Data publishing ,computer.software_genre ,Field (computer science) ,Computer Science Applications ,Theoretical Computer Science ,Artificial Intelligence ,Control and Systems Engineering ,Table (database) ,business ,computer ,Software - Abstract
Privacy Preserving Data Publishing (PPDP) is a research field that deals with the development of methods to enable publishing of data while minimizing distortion, for maintaining usability on one hand, and respecting privacy on the other hand. Sequential release is a scenario of data publishing where multiple releases of the same underlying table are published over a period of time. A violation of privacy, in this case, may emerge from any one of the releases, or as a result of joining information from different releases. Similarly to [37], our privacy definitions limit the ability of an adversary who combines information from all releases, to link values of the quasi-identifiers to sensitive values. We extend the framework that was considered in Ref. [37] in three ways: We allow a greater number of releases, we consider the more flexible local recoding model of ''cell generalization'' (as opposed to the global recoding model of ''cut generalization'' in Ref. [37]), and we include the case where records may be added to the underlying table from time to time. Our extension of the framework requires also to modify the manner in which privacy is evaluated. We show that while [37] based their privacy evaluation on the notion of the Match Join between the releases, it is no longer suitable for the extended framework considered here. We define more restrictive types of join between the published releases (the Full Match Join and the Kernel Match Join) that are more suitable for privacy evaluation in this context. We then present a top-down algorithm for anonymizing sequential releases in the cell generalization model, that is based on our modified privacy evaluations. Our theoretical study is followed by experimentation that demonstrates a staggering improvement in terms of utility due to the adoption of the cell generalization model, and exemplifies the correction in the privacy evaluation as offered by using the Full or Kernel Match Joins instead of the Match Join.
- Published
- 2012
46. Privacy-preserving data mining: A feature set partitioning approach
- Author
-
Oded Maimon, Nissim Matatov, and Lior Rokach
- Subjects
Information Systems and Management ,business.industry ,Computer science ,k-anonymity ,Machine learning ,computer.software_genre ,Multi-objective optimization ,Computer Science Applications ,Theoretical Computer Science ,Privacy preserving ,Artificial Intelligence ,Control and Systems Engineering ,Genetic algorithm ,Artificial intelligence ,Data mining ,Tuple ,business ,Feature set ,computer ,Classifier (UML) ,Software - Abstract
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers. Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.
- Published
- 2010
47. Troika – An improved stacking schema for classification tasks
- Author
-
Eitan Menahem, Lior Rokach, and Yuval Elovici
- Subjects
Information Systems and Management ,Computer science ,business.industry ,Stacking ,Pattern recognition ,Machine learning ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Random subspace method ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,Control and Systems Engineering ,Schema (psychology) ,Artificial intelligence ,business ,Classifier (UML) ,computer ,Software ,Cascading classifiers - Abstract
Stacking is a general ensemble method in which a number of base classifiers are combined using one meta-classifier which learns their outputs. Such an approach provides certain advantages: simplicity; performance that is similar to the best classifier; and the capability of combining classifiers induced by different inducers. The disadvantage of stacking is that on multiclass problems, stacking seems to perform worse than other meta-learning approaches. In this paper we present Troika, a new stacking method for improving ensemble classifiers. The new scheme is built from three layers of combining classifiers. The new method was tested on various datasets and the results indicate the superiority of the proposed method to other legacy ensemble schemes, Stacking and StackingC, especially when the classification task consists of more than two classes.
- Published
- 2009
48. Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography
- Author
-
Lior Rokach
- Subjects
Statistics and Probability ,Dependency (UML) ,Point (typography) ,business.industry ,Applied Mathematics ,Survey sampling ,Linear discriminant analysis ,Machine learning ,computer.software_genre ,Ensemble learning ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Taxonomy (general) ,Pattern recognition (psychology) ,Selection (linguistics) ,Artificial intelligence ,business ,computer ,Mathematics - Abstract
Ensemble methodology, which builds a classification model by integrating multiple classifiers, can be used for improving prediction performance. Researchers from various disciplines such as statistics, pattern recognition, and machine learning have seriously explored the use of ensemble methodology. This paper presents an updated survey of ensemble methods in classification tasks, while introducing a new taxonomy for characterizing them. The new taxonomy, presented from the algorithm designer's point of view, is based on five dimensions: inducer, combiner, diversity, size, and members' dependency. We also propose several selection criteria, presented from the practitioner's point of view, for choosing the most suitable ensemble method.
- Published
- 2009
49. Collective-agreement-based pruning of ensembles
- Author
-
Lior Rokach
- Subjects
Statistics and Probability ,business.industry ,Applied Mathematics ,Statistical computation ,Linear discriminant analysis ,Machine learning ,computer.software_genre ,Ensemble learning ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Redundancy (engineering) ,Collective agreement ,Artificial intelligence ,business ,computer ,Mathematics - Abstract
Ensemble methods combine several individual pattern classifiers in order to achieve better classification. The challenge is to choose the minimal number of classifiers that achieve the best performance. An ensemble that contains too many members might incur large storage requirements and even reduce the classification performance. The goal of ensemble pruning is to identify a subset of ensemble members that performs at least as good as the original ensemble and discard any other members. In this paper, we introduce the Collective-Agreement-based Pruning (CAP) method. Rather than ranking individual members, CAP ranks subsets by considering the individual predictive ability of each member along with the degree of redundancy among them. Subsets whose members highly agree with the class while having low inter-agreement are preferred.
- Published
- 2009
50. Improving malware detection by applying multi-inducer ensemble
- Author
-
Yuval Elovici, Lior Rokach, Eitan Menahem, and Asaf Shabtai
- Subjects
Statistics and Probability ,Exploit ,business.industry ,Computer science ,Applied Mathematics ,Decision theory ,Decision tree ,Machine learning ,computer.software_genre ,Execution time ,Computational Mathematics ,Naive Bayes classifier ,Task (computing) ,ComputingMethodologies_PATTERNRECOGNITION ,Software ,Computational Theory and Mathematics ,Malware ,Artificial intelligence ,Data mining ,business ,computer - Abstract
Detection of malicious software (malware) using machine learning methods has been explored extensively to enable fast detection of new released malware. The performance of these classifiers depends on the induction algorithms being used. In order to benefit from multiple different classifiers, and exploit their strengths we suggest using an ensemble method that will combine the results of the individual classifiers into one final result to achieve overall higher detection accuracy. In this paper we evaluate several combining methods using five different base inducers (C4.5 Decision Tree, Naive Bayes, KNN, VFI and OneR) on five malware datasets. The main goal is to find the best combining method for the task of detecting malicious files in terms of accuracy, AUC and Execution time.
- Published
- 2009
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.