1. Advanced machine learning techniques for online and data stream learning
- Author
-
Nguyen, Thi Thu Thuy
- Subjects
- Big data, Machine learning, Online learning, Data stream learning
- Abstract
Big Data is the certain result of our knowledge-intensive world when practically everything is being monitored and measured. According to the 2014 report of the International Data Corporation (IDC), the amount of information created, captured or replicated had exceeded available storage for the first time in 2007. The digital universe is doubling in size every two years and will multiply 10-fold between 2013 and 2020 – from 4.4 trillion gigabytes to 44 trillion gigabytes. Besides the massive volume, the velocity of data is another concern as in our information society, very often data come in the form of streams which continuously and rapidly grow over time. Examples of such data can be easily seen in many real-world applications like network traffic, sensor networks, web searches, stock market systems, social media and others. Mining big data can bring back big values and benefit humans in every aspect of life such as social communication, business management, and scientific research. They present a new world of opportunities as well as challenges that human beings need to deal with in a responsible way, maintaining adaptability, scalability and efficiency adequately. From the perspective of machine learning, the main approach is based on algorithmic improvement. Traditional offline machine learning techniques suffer from many restrictions such as the limitation of computational storage for saving the whole training set, and the impossibility of handling real-time data and responding instantly. To overcome those challenges, the advent of online methods offers the essential ability of predictive models which can be trained on-the-fly after the arrival of every new data point and be ready to give predictions at any time if requested, by making use of a single/set of observations and then discarding them permanently before the next observations are used. This typical one-pass-throw-away streamed learning requires online methods to preserve as much information extracted from the past instances as possible and at the same time must learn current instances effectively. They are also expected to better handle dynamically evolving environments with concept drifts (the change in data distribution), which are widely encountered in stream contexts. Furthermore, potentially extremely skewed classes in a large number of daily applications, such as accident diagnosis of real-time traffic surveillance, fault detection in online banking and intrusion monitoring, can significantly hinder the classification performance of online methods. There is a need for building more effective online frameworks, which offer rich information and high flexibility in adaption to concept changes, class imbalances and other advanced tasks (if needed) coming from real-life data stream mining. To address the research challenges mentioned above, we first proposed novel online supervised Bayesian classifiers based on variational inference (VI) for multivariate Gaussians (Minibatch-VIGO and VIGO), which outperformed recent and well-known methods on a wide range of datasets (reducing at least 5.5% mistake rate compared to the other benchmarks). From the same theoretical background, a lossless online classifier (OVIG) is presented, guaranteeing to produce the same prediction model as its offline counterpart regardless of the incremental training order. Through the application to movie genre classification, two strategies for dealing with highdimensional data were suggested including random projection based and stacking based ensembles. Exploiting the flexibility of a variational inference mechanism, we also developed advanced techniques to effectively tackle dynamic streaming data with concept drifts. They include VIGOd (online VI with a built-in concept drift detector for multivariate Gaussians) and VIGOw (online VI weighted for multivariate Gaussians), which are almost 20 times faster than the most accurate adaptive benchmark method. Next, we introduced OCSB (online cost-sensitive learning and sampling for Bayesian classifiers), a new imbalanced learning strategy that combines cost-sensitive learning and intermediate sampling. When applying OCSB to Minibatch-VIGO, it helps to increase the recognition of rare instances and the overall accuracy significantly (by over 25% and 16%, respectively). Furthermore, we generalize our online multi-class classifiers (each class has 1 label) to online multi-label classifiers (each class can have many labels) and our online VI-based multivariate Gaussians (VIGO) (data of each class described by a multivariate Gaussian) to online VI-based mixtures of Gaussians (VIMGO) (data of each class described by a number of Gaussians). For the former (multi-label learning), one more concept drift adaptation technique using a decay factor to weight the importance of instances according to their age is suggested and tested. At the same time, a new dynamic intermediate sampling technique (DIS) is developed to handle new challenges of online imbalanced learning for the multi-label scenarios. Although having to approximate a much bigger number of Gaussians in mixture models, the latter (VIMGO) and its adaptive version VIMGOw are optimized to run effectively. VIMGOw is combined with a simplified version of OCSB to obtain iVIMGOw. Applications of iVIMGOw in network intrusion and credit card fraud detection through recent UNSW-NB15 network data and real-world credit card data respectively show that it can accurately recognize attacks with a wide range of frequency. Finally, it is worth mentioning that all our proposed classifiers are second-order generative methods with rich information. They not only explore the underlying structure of data effectively but can also be used as a base framework for further advanced tasks of stream learning like cost-sensitive learning, active learning and semisupervised learning.
- Published
- 2019