23 results on '"YASUKO MATSUBARA"'
Search Results
2. Non-Linear Mining of Social Activities in Tensor Streams
- Author
-
Yasushi Sakurai, Yasuko Matsubara, Koki Kawabata, and Takato Honda
- Subjects
Nonlinear system ,Series (mathematics) ,Event (computing) ,Computer science ,Scalability ,Tensor ,Data mining ,Space (commercial competition) ,Time series ,computer.software_genre ,computer - Abstract
Given a large time-evolving event series such as Google web-search logs, which are collected according to various aspects, i.e., timestamps, locations and keywords, how accurately can we forecast their future activities? How can we reveal significant patterns that allow us to long-term forecast from such complex tensor streams? In this paper, we propose a streaming method, namely, CubeCast, that is designed to capture basic trends and seasonality in tensor streams and extract temporal and multi-dimensional relationships between such dynamics. Our proposed method has the following properties: (a) it is effective: it finds both trends and seasonality and summarizes their dynamics into simultaneous non-linear latent space. (b) it is automatic: it automatically recognizes and models such structural patterns without any parameter tuning or prior information. (c) it is scalable: it incrementally and adaptively detects shifting points of patterns for a semi-infinite collection of tensor streams. Extensive experiments that we conducted on real datasets demonstrate that our algorithm can effectively and efficiently find meaningful patterns for generating future values, and outperforms the state-of-the-art algorithms for time series forecasting in terms of forecasting accuracy and computational time.
- Published
- 2020
- Full Text
- View/download PDF
3. Nonlinear Dynamics of Information Diffusion in Social Networks
- Author
-
Yasuko Matsubara, Christos Faloutsos, Lei Li, Yasushi Sakurai, and B. Aditya Prakash
- Subjects
Reverse engineering ,Generality ,Unification ,Computer Networks and Communications ,Computer science ,Event (computing) ,media_common.quotation_subject ,02 engineering and technology ,computer.software_genre ,Popularity ,New media ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Topological graph theory ,020201 artificial intelligence & image processing ,Quality (business) ,Data mining ,computer ,media_common - Abstract
The recent explosion in the adoption of search engines and new media such as blogs and Twitter have facilitated the faster propagation of news and rumors. How quickly does a piece of news spread over these media? How does its popularity diminish over time? Does the rising and falling pattern follow a simple universal law? In this article, we propose S pike M, a concise yet flexible analytical model of the rise and fall patterns of information diffusion. Our model has the following advantages. First, unification power: it explains earlier empirical observations and generalizes theoretical models including the SI and SIR models. We provide the threshold of the take-off versus die-out conditions for S pike M and discuss the generality of our model by applying it to an arbitrary graph topology. Second, practicality: it matches the observed behavior of diverse sets of real data. Third, parsimony: it requires only a handful of parameters. Fourth, usefulness: it makes it possible to perform analytic tasks such as forecasting, spotting anomalies, and interpretation by reverse engineering the system parameters of interest (quality of news, number of interested bloggers, etc.). We also introduce an efficient and effective algorithm for the real-time monitoring of information diffusion, namely S pike S tream , which identifies multiple diffusion patterns in a large collection of online event streams. Extensive experiments on real datasets demonstrate that S pike M accurately and succinctly describes all patterns of the rise and fall spikes in social networks.
- Published
- 2017
- Full Text
- View/download PDF
4. Automatic Sequential Pattern Mining in Data Streams
- Author
-
Yasuko Matsubara, Yasushi Sakurai, and Koki Kawabata
- Subjects
Data stream ,Event (computing) ,Data stream mining ,Computer science ,Volume (computing) ,02 engineering and technology ,computer.software_genre ,Set (abstract data type) ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Streaming algorithm ,computer - Abstract
Given a large volume of multi-dimensional data streams, such as that produced by IoT applications, finance and online web-click logs, how can we discover typical patterns and compress them into compact models? In addition, how can we incrementally distinguish multiple patterns while considering the information obtained from a pattern found in a streaming setting? In this paper, we propose a streaming algorithm, namely StreamScope, that is designed to find intuitive patterns efficiently from event streams evolving over time. Our proposed method has the following properties: (a) it is effective: it operates on semi-infinite collections of co-evolving streams and summarizes all the streams into a set of multiple discrete segments grouped by their similarities. (b) it is automatic: it automatically and incrementally recognizes such patterns and generates models for each of them if necessary; (c) it is scalable: the complexity of our method does not depend on the length of the data streams. Our extensive experiments on real data streams demonstrate that StreamScope can find meaningful patterns and achieve great improvements in terms of computational time and memory space over its full batch method competitors.
- Published
- 2019
- Full Text
- View/download PDF
5. Multi-aspect Mining of Complex Sensor Sequences
- Author
-
Yasuko Matsubara, Mutsumi Abe, Yasushi Sakurai, Takato Honda, and Ryo Neyama
- Subjects
Computer science ,Feature extraction ,02 engineering and technology ,computer.software_genre ,Sensor fusion ,Automatic summarization ,020204 information systems ,Outlier ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Segmentation ,Data mining ,computer - Abstract
In recent years, a massive amount of time-stamped sensor data has been generated and collected by many Internet of Things (IoT) applications, such as advanced automobiles and health care devices. Given such a large collection of complex sensor sequences, which consists of multiple attributes (e.g., sensor, user, timestamp), how can we automatically find important dynamic time-series patterns and the points of variation? How can we summarize all the complex sensor sequences, and achieve a meaningful segmentation? Also, can we see any hidden user-specific differences and outliers? In this paper we present CUBEMARKER, an efficient and effective method for capturing such multi-aspect features in sensor sequences. CUBEMARKER performs multi-way summarization for all attributes, namely, sensors, users, and time, and specifically it extracts multi-aspect features, such as important time-series patterns (i.e., time-aspect features) and hidden groups of users (i.e., user-aspect features), in complex sensor sequences. Our proposed method has the following advantages: (a) It is effective: it extracts multi-aspect features from complex sensor sequences and enables the efficient and effective analysis of complicated datasets; (b) It is automatic: it requires no prior training and no parameter tuning; (c) It is scalable: our method is carefully designed to be linear as regards dataset size and applicable to a large number of sensor sequences. Extensive experiments on real datasets show that CUBEMARKER is effective in that it can capture meaningful patterns for various real-world datasets, such as those obtained from smart factories, human activities, and automobiles. CUBEMARKER consistently outperforms the best state-of-the-art methods in terms of both accuracy and execution speed.
- Published
- 2019
- Full Text
- View/download PDF
6. Dynamic Modeling and Forecasting of Time-evolving Data Streams
- Author
-
Yasuko Matsubara and Yasushi Sakurai
- Subjects
Data stream ,Data stream mining ,Differential equation ,Computer science ,02 engineering and technology ,computer.software_genre ,System dynamics ,Current (stream) ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,computer - Abstract
Given a large, semi-infinite collection of co-evolving data sequences (e.g., IoT/sensor streams), which contains multiple distinct dynamic time-series patterns, our aim is to incrementally monitor current dynamic patterns and forecast future behavior. We present an intuitive model, namely OrbitMap, which provides a good summary of time-series evolution in streams. We also propose a scalable and effective algorithm for fitting and forecasting time-series data streams. Our method is designed as a dynamic, interactive and flexible system, and is based on latent non-linear differential equations. Our proposed method has the following advantages: (a) It is effective: it captures important time-evolving patterns in data streams and enables real-time, long-range forecasting; (b) It is general: our model is general and practical and can be applied to various types of time-evolving data streams; (c) It is scalable: our algorithm does not depend on data size, and thus is applicable to very large sequences. Extensive experiments on real datasets demonstrate that OrbitMap makes long-range forecasts, and consistently outperforms the best existing state-of-the-art methods as regards accuracy and execution speed.
- Published
- 2019
- Full Text
- View/download PDF
7. Ecosystem on the Web: non-linear mining and forecasting of co-evolving online activities
- Author
-
Yasuko Matsubara, Christos Faloutsos, and Yasushi Sakurai
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,Volume (computing) ,02 engineering and technology ,Machine learning ,computer.software_genre ,World Wide Web ,Nonlinear system ,Hardware and Architecture ,Dynamics (music) ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Ecosystem ,Artificial intelligence ,Dynamical system (definition) ,business ,computer ,Software - Abstract
Given a large collection of co-evolving online activities, such as searches for the keywords "Xbox", "PlayStation" and "Wii", how can we find patterns and rules? Are these keywords related? If so, are they competing against each other? Can we forecast the volume of user activity for the coming month? We conjecture that online activities compete for user attention in the same way that species in an ecosystem compete for food. We present EcoWeb, (i.e., Ecosystem on the Web), which is an intuitive model designed as a non-linear dynamical system for mining large-scale co-evolving online activities. Our second contribution is a novel, parameter-free, and scalable fitting algorithm, EcoWeb-Fit, that estimates the parameters of EcoWeb. Extensive experiments on real data show that EcoWeb is effective, in that it can capture long-range dynamics and meaningful patterns such as seasonalities, and practical, in that it can provide accurate long-range forecasts. EcoWeb consistently outperforms existing methods in terms of both accuracy and execution speed.
- Published
- 2016
- Full Text
- View/download PDF
8. Automatic Mining of Large IoT Sensor Tensor
- Author
-
Yasuko Matsubara, Takato Honda, and Yasushi Sakurai
- Subjects
Series (mathematics) ,Computer science ,business.industry ,02 engineering and technology ,computer.software_genre ,020204 information systems ,Tensor (intrinsic definition) ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Time series ,Hidden Markov model ,Internet of Things ,business ,Representation (mathematics) ,computer - Abstract
Given a large collection of multiple time-evolving sensor sequences, how can we capture the transitions of time series patterns? How can we find individual differences between different sequences? In this paper we present CUBEMARKER, an effective method for capturing multi-aspect features of sensor sequences, which provides a compact and powerful representation of sensor behavior. Our second contribution is a novel, scalable, and parameter-free algorithm. CUBEMARKER performs two-way mining for all attributes. Specifically it discovers multi-aspect time series patterns (human motion, smart factory, etc) and groups of patterns simultaneously. Extensive experiments on real datasets show that CUBEMARKER is effective in that it can capture meaningful patterns for various real sensor datasets.
- Published
- 2018
- Full Text
- View/download PDF
9. Nonlinear Time-series Mining of Social Influence
- Author
-
Yasushi Sakurai, Yasuko Matsubara, and Thinh Minh Do
- Subjects
Nonlinear system ,Series (mathematics) ,Computer science ,020204 information systems ,Online search ,0202 electrical engineering, electronic engineering, information engineering ,Volume (computing) ,020201 artificial intelligence & image processing ,02 engineering and technology ,Data mining ,Duration (project management) ,computer.software_genre ,computer - Abstract
Given a large collection of time-evolving online user activities, such as Google Search queries for multiple keywords of various categories (celebrities, events, diseases, etc.), which consist of \(d\) keywords/activities, for \(l\) countries/locations of duration \(n\), how can we find patterns and rules? For example, assume that we have the online search volume for “Harry Potter”, “Barack Obama”, and “Amazon”, for 232 countries/territories, from 2004 to 2015, which include external shocks, sudden change of search volume, and more. How do we go about capturing nonlinear evolutions of local activities and forecasting future patterns? In this paper, we present \(\varDelta \)-SPOT, a unifying analytical nonlinear model for analyzing large-scale web search data, which is sensemaking, automatic, scalable, and free of parameters. \(\varDelta \)-SPOT can also forecast long-range future dynamics of the keywords/queries. We use the Google Search, Twitter, and MemeTracker dataset for extensive experiments, which show that our method outperforms other effective methods of nonlinear mining in terms of accuracy and in both fitting and forecasting.
- Published
- 2018
- Full Text
- View/download PDF
10. Data Stream Analysis of Online Activities
- Author
-
Yasuko Matsubara, Koki Kawabata, and Yasushi Sakurai
- Subjects
Data stream mining ,Event (computing) ,Computer science ,Volume (computing) ,02 engineering and technology ,computer.software_genre ,Data modeling ,Set (abstract data type) ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Hidden Markov model ,computer ,Streaming algorithm - Abstract
Given a large volume of multiple data streams, such as online web-click logs and sensor data, how can we discover typical patterns and compress them into compact models? How can we incrementally distinguish multiple patterns while considering the information obtained from a pattern found in a streaming setting? In this paper, we propose a streaming algorithm, namely STREAMSCOPE, that can find intuitive patterns efficiently from event streams evolving over time. Our method has the following properties: (a) Effective: It operates on semi-infinite data streams and summarizes all the streams into a set of multiple discrete segments grouped by their similarities. (b) Automatic: It automatically and incrementally recognizes such patterns and generates models for each of them if necessary. (c) Scalable: The complexity of our method does not depend on the length of the input streams. Our experiments on real datasets demonstrate that StreamsCopecan find meaningful patterns and achieve great improvements in terms of computational time and memory space over its full batch method competitors.
- Published
- 2018
- Full Text
- View/download PDF
11. Non-linear Time-series Analysis of Social Influence
- Author
-
Yasushi Sakurai, Yasuko Matsubara, and Thinh Minh Do
- Subjects
Nonlinear system ,Information retrieval ,General Computer Science ,Computer science ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,02 engineering and technology ,Data mining ,Time series ,computer.software_genre ,computer ,Social influence - Published
- 2016
- Full Text
- View/download PDF
12. Regime Shifts in Streams
- Author
-
Yasuko Matsubara and Yasushi Sakurai
- Subjects
Dynamical systems theory ,Data stream mining ,Computer science ,Event (computing) ,Real-time computing ,02 engineering and technology ,STREAMS ,computer.software_genre ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Dynamical system (definition) ,computer - Abstract
Given a large, online stream of multiple co-evolving event sequences, such as sensor data and Web-click logs, that contains various types of non-linear dynamic evolving patterns of different durations, how can we efficiently and effectively capture important patterns? How do we go about forecasting long-term future events? In this paper, we present REGIMECAST, an efficient and effective method for forecasting co-evolving data streams. REGIMECAST is designed as an adaptive non-linear dynamical system, which is inspired by the concept of "regime shifts" in natural dynamical systems. Our method has the following properties: (a) Effective: it operates on large data streams, captures important patterns and performs long-term forecasting; (b) Adaptive: it automatically and incrementally recognizes the latent trends and dynamic evolution patterns (i.e., regimes) that are unknown in advance; (c) Scalable: it is fast and the computation cost does not depend on the length of data streams; (d) Any-time: it provides a response at any time and generates long-range future events. Extensive experiments on real datasets demonstrate that REGIMECAST does indeed make long-range forecasts, and it outperforms state-of-the-art competitors as regards accuracy and speed.
- Published
- 2016
- Full Text
- View/download PDF
13. Non-Linear Mining of Competing Local Activities
- Author
-
Christos Faloutsos, Yasushi Sakurai, and Yasuko Matsubara
- Subjects
Computer science ,business.industry ,02 engineering and technology ,Machine learning ,computer.software_genre ,Automatic summarization ,Competition (economics) ,020204 information systems ,Online search ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Spike (software development) ,Artificial intelligence ,Duration (project management) ,Representation (mathematics) ,business ,computer - Abstract
Given a large collection of time-evolving activities, such as Google search queries, which consist of d keywords/activities for m locations of duration n, how can we analyze temporal patterns and relationships among all these activities and find location-specific trends? How do we go about capturing non-linear evolutions of local activities and forecasting future patterns? For example, assume that we have the online search volume for multiple keywords, e.g., "Nokia/Nexus/Kindle" or "CNN/BBC" for 236 countries/territories, from 2004 to 2015. Our goal is to analyze a large collection of multi-evolving activities, and specifically, to answer the following questions: (a) Is there any sign of interaction/competition between two different keywords? If so, who competes with whom? (b) In which country is the competition strong? (c) Are there any seasonal/annual activities? (d) How can we automatically detect important world-wide (or local) events? We present COMPCUBE, a unifying non-linear model, which provides a compact and powerful representation of co-evolving activities; and also a novel fitting algorithm, COMPCUBE-FIT, which is parameter-free and scalable. Our method captures the following important patterns: (B)asic trends, i.e., non-linear dynamics of co-evolving activities, signs of (C)ompetition and latent interaction, e.g., Nokia vs. Nexus, (S)easonality, e.g., a Christmas spike for iPod in the U.S. and Europe, and (D)eltas, e.g., unrepeated local events such as the U.S. election in 2008. Thanks to its concise but effective summarization, COMPCUBE can also forecast long-range future activities. Extensive experiments on real datasets demonstrate that COMPCUBE consistently outperforms the best state-of- the-art methods in terms of both accuracy and execution speed.
- Published
- 2016
- Full Text
- View/download PDF
14. D-Search: an efficient and exact search algorithm for large distribution sets
- Author
-
Yasuko Matsubara, Masatoshi Yoshikawa, and Yasushi Sakurai
- Subjects
Distribution sets ,Kullback–Leibler divergence ,Speedup ,Nearest neighbor search ,Singular value decomposition ,Likelihood ,computer.software_genre ,Human-Computer Interaction ,Full table scan ,Reduction (complexity) ,KL divergence ,Artificial Intelligence ,Hardware and Architecture ,Search algorithm ,Outlier ,Search cost ,Data mining ,computer ,Software ,Information Systems ,Mathematics - Abstract
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2, 300 times faster than the naive implementation without sacrificing accuracy.
- Published
- 2011
15. Mining and Forecasting of Big Time-series Data
- Author
-
Yasushi Sakurai, Yasuko Matsubara, and Christos Faloutsos
- Subjects
business.industry ,Computer science ,Nearest neighbor search ,Anomaly detection ,Segmentation ,Artificial intelligence ,Data mining ,Time series ,Machine learning ,computer.software_genre ,business ,computer ,Automatic summarization - Abstract
Given a large collection of time series, such as web-click logs, electric medical records and motion capture sensors, how can we efficiently and effectively find typical patterns? How can we statistically summarize all the sequences, and achieve a meaningful segmentation? What are the major tools for forecasting and outlier detection? Time-series data analysis is becoming of increasingly high importance, thanks to the decreasing cost of hardware and the increasing on-line processing capability. The objective of this tutorial is to provide a concise and intuitive overview of the most important tools that can help us find patterns in large-scale time-series sequences. We review the state of the art in four related fields: (1) similarity search and pattern discovery, (2) linear modeling and summarization, (3) non-linear modeling and forecasting, and (4) the extension of time-series mining and tensor analysis. The emphasis of the tutorial is to provide the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to introduce case studies that illustrate their practical use.
- Published
- 2015
- Full Text
- View/download PDF
16. The Web as a Jungle
- Author
-
Christos Faloutsos, Yasuko Matsubara, and Yasushi Sakurai
- Subjects
Computer science ,business.industry ,Volume (computing) ,Machine learning ,computer.software_genre ,World Wide Web ,Dynamics (music) ,Scalability ,Jungle ,Artificial intelligence ,Non linear dynamical systems ,Dynamical system (definition) ,business ,computer - Abstract
Given a large collection of co-evolving online activities, such as searches for the keywords "Xbox", "PlayStation" and "Wii", how can we find patterns and rules? Are these keywords related? If so, are they competing against each other? Can we forecast the volume of user activity for the coming month? We conjecture that online activities compete for user attention in the same way that species in an ecosystem compete for food. We present ECOWEB, (i.e., Ecosystem on the Web), which is an intuitive model designed as a non-linear dynamical system for mining large-scale co-evolving online activities. Our second contribution is a novel, parameter-free, and scalable fitting algorithm, ECOWEB-FIT, that estimates the parameters of ECOWEB. Extensive experiments on real data show that ECOWEB is effective, in that it can capture long-range dynamics and meaningful patterns such as seasonalities, and practical, in that it can provide accurate long-range forecasts. ECOWEB consistently outperforms existing methods in terms of both accuracy and execution speed.
- Published
- 2015
- Full Text
- View/download PDF
17. Fast and Exact Monitoring of Co-Evolving Data Streams
- Author
-
Naonori Ueda, Yasushi Sakurai, Masatoshi Yoshikawa, and Yasuko Matsubara
- Subjects
Data stream mining ,Computer science ,Markov model ,Viterbi algorithm ,computer.software_genre ,symbols.namesake ,Exact algorithm ,Subsequence ,Outlier ,symbols ,Algorithm design ,Forward algorithm ,Data mining ,Hidden Markov model ,computer - Abstract
Given a huge stream of multiple co-evolving sequences, such as motion capture and web-click logs, how can we find meaningful patterns and spot anomalies? Our aim is to monitor data streams statistically, and find sub sequences that have the characteristics of a given hidden Markov model (HMM). For example, consider an online web-click stream, where massive amounts of access logs of millions of users are continuously generated every second. So how can we find meaningful building blocks and typical access patterns such as weekday/weekend patterns, and also, detect anomalies and intrusions? In this paper, we propose Stream Scan, a fast and exact algorithm for monitoring multiple co-evolving data streams. Our method has the following advantages: (a) it is effective, leading to novel discoveries and surprising outliers, (b) it is exact, and we theoretically prove that Stream Scan guarantees the exactness of the output, (c) it is fast, and requires O (1) time and space per time-tick. Our experiments on 67GB of real data illustrate that Stream Scan does indeed detect the qualifying subsequence patterns correctly and that it can offer great improvements in speed (up to 479,000 times) over its competitors.
- Published
- 2014
- Full Text
- View/download PDF
18. FUNNEL
- Author
-
Christos Faloutsos, Yasushi Sakurai, Yasuko Matsubara, and Willem G. van Panhuis
- Subjects
business.product_category ,Computer science ,Disease ,Seasonality ,medicine.disease ,computer.software_genre ,Missing data ,Measles ,Pandemic ,Outlier ,medicine ,Data mining ,Funnel ,Scale (map) ,business ,computer - Abstract
Given a large collection of epidemiological data consisting of the count of d contagious diseases for l locations of duration n, how can we find patterns, rules and outliers? For example, the Project Tycho provides open access to the count infections for U.S. states from 1888 to 2013, for 56 contagious diseases (e.g., measles, influenza), which include missing values, possible recording errors, sudden spikes (or dives) of infections, etc. So how can we find a combined model, for all these diseases, locations, and time-ticks? In this paper, we present FUNNEL, a unifying analytical model for large scale epidemiological data, as well as a novel fitting algorithm, FUNNELFIT, which solves the above problem. Our method has the following properties: (a) Sense-making: it detects important patterns of epidemics, such as periodicities, the appearance of vaccines, external shock events, and more; (b) Parameter-free: our modeling framework frees the user from providing parameter values; (c) Scalable: FUNNELFIT is carefully designed to be linear on the input size; (d) General: our model is general and practical, which can be applied to various types of epidemics, including computer-virus propagation, as well as human diseases. Extensive experiments on real data demonstrate that FUNNELFIT does indeed discover important properties of epidemics: (P1) disease seasonality, e.g., influenza spikes in January, Lyme disease spikes in July and the absence of yearly periodicity for gonorrhea; (P2) disease reduction effect, e.g., the appearance of vaccines; (P3) local/state-level sensitivity, e.g., many measles cases in NY; (P4) external shock events, e.g., historical flu pandemics; (P5) detect incongruous values, i.e., data reporting errors.
- Published
- 2014
- Full Text
- View/download PDF
19. AutoPlait
- Author
-
Yasushi Sakurai, Yasuko Matsubara, and Christos Faloutsos
- Subjects
Computer science ,Scalability ,Segmentation ,Data mining ,Time series ,Precision and recall ,computer.software_genre ,computer - Abstract
Given a large collection of co-evolving multiple time-series, which contains an unknown number of patterns of different durations, how can we efficiently and effectively find typical patterns and the points of variation? How can we statistically summarize all the sequences, and achieve a meaningful segmentation? In this paper we present AutoPlait, a fully automatic mining algorithm for co-evolving time sequences. Our method has the following properties: (a) effectiveness: it operates on large collections of time-series, and finds similar segment groups that agree with human intuition; (b) scalability: it is linear with the input size, and thus scales up very well; and (c) AutoPlait is parameter-free, and requires no user intervention, no prior training, and no parameter tuning. Extensive experiments on 67GB of real datasets demonstrate that AutoPlait does indeed detect meaningful patterns correctly, and it outperforms state-of-the-art competitors as regards accuracy and speed: AutoPlait achieves near-perfect, over 95% precision and recall, and it is up to 472 times faster than its competitors.
- Published
- 2014
- Full Text
- View/download PDF
20. F-Trail: Finding Patterns in Taxi Trajectories
- Author
-
Lei Li, Christos Faloutsos, Evangelos E. Papalexakis, Yasushi Sakurai, Yasuko Matsubara, and David Lo
- Subjects
Computer science ,Outlier ,Path (graph theory) ,Search engine indexing ,Trip length ,Probability density function ,Data mining ,computer.software_genre ,Cluster analysis ,computer - Abstract
Given a large number of taxi trajectories, we would like to find interesting and unexpected patterns from the data. How can we summarize the major trends, and how can we spot anomalies? The analysis of trajectories has been an issue of considerable interest with many applications such as tracking trails of migrating animals and predicting the path of hurricanes. Several recent works propose methods on clustering and indexing trajectories data. However, these approaches are not especially well suited to pattern discovery with respect to the dynamics of social and economic behavior. To further analyze a huge collection of taxi trajectories, we develop a novel method, called F-Trail, which allows us to find meaningful patterns and anomalies. Our approach has the following advantages: (a) it is fast, and scales linearly on the input size, (b) it is effective, leading to novel discoveries, and surprising outliers. We demonstrate the effectiveness of our approach, by performing experiments on real taxi trajectories. In fact, F-Trail does produce concise, informative and interesting patterns.
- Published
- 2013
- Full Text
- View/download PDF
21. Fast mining and forecasting of complex time-stamped events
- Author
-
Yasuko Matsubara, Christos Faloutsos, Yasushi Sakurai, Tomoharu Iwata, and Masatoshi Yoshikawa
- Subjects
Topic model ,Computer science ,business.industry ,Scalability ,Artificial intelligence ,business ,Machine learning ,computer.software_genre ,Automatic summarization ,computer ,Task (project management) - Abstract
Given huge collections of time-evolving events such as web-click logs, which consist of multiple attributes (e.g., URL, userID, times- tamp), how do we find patterns and trends? How do we go about capturing daily patterns and forecasting future events? We need two properties: (a) effectiveness, that is, the patterns should help us understand the data, discover groups, and enable forecasting, and (b) scalability, that is, the method should be linear with the data size. We introduce TriMine, which performs three-way mining for all three attributes, namely, URLs, users, and time. Specifically TriMine discovers hidden topics, groups of URLs, and groups of users, simultaneously. Thanks to its concise but effective summarization, it makes it possible to accomplish the most challenging and important task, namely, to forecast future events. Extensive experiments on real datasets demonstrate that TriMine discovers meaningful topics and makes long-range forecasts, which are notoriously difficult to achieve. In fact, TriMine consistently outperforms the best state-of-the-art existing methods in terms of accuracy and execution speed (up to 74x faster).
- Published
- 2012
- Full Text
- View/download PDF
22. Statistical modeling of large distribution sets
- Author
-
Masatoshi Yoshikawa, Yasuko Matsubara, and Yasushi Sakurai
- Subjects
Computer science ,business.industry ,Data management ,Context (language use) ,Statistical model ,computer.software_genre ,Data type ,Hierarchical database model ,Anomaly detection ,Data mining ,Cluster analysis ,business ,computer ,Categorical variable - Abstract
In this paper we deal with a ubiquitous problem in data management: hierarchical model estimation for large distribution sets. This particular problem arises in many applications. Classification, top-k query processing, clustering and outlier detection are just a few possible applications. Our aim is to continuously and incrementally estimate the model parameters of 'typical' distributions that describe the characteristics of a database. Our approach to model estimation can handle arbitrary types of data (e.g., categorical and numerical data) in databases, incrementally, quickly, and with little resource consumption. Moreover, this paper proposes not only incremental algorithms for model fitting, but also a modeling framework in which the learning approach recognizes hierarchical groups, each of whose distributions has similar characteristics, and separately updates the model parameters of each group without scanning all the distributions in the database. Thus, it can provide a response, i.e., the parameters of typical distribution models, with an arbitrary level of granularity, at any time. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management.
- Published
- 2010
- Full Text
- View/download PDF
23. Scalable Algorithms for Distribution Search
- Author
-
Yasuko Matsubara, Masatoshi Yoshikawa, and Yasushi Sakurai
- Subjects
Full table scan ,Reduction (complexity) ,Kullback–Leibler divergence ,Speedup ,Market segmentation ,Nearest neighbor search ,Outlier ,Search cost ,Data mining ,computer.software_genre ,computer ,Mathematics - Abstract
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions), to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation, anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multi-step sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multi-dimensional datasets show that our solution achieves up to 2,300 faster wall-clock time over the naive implementation while it does not sacrifice accuracy.
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.