229 results on '"Evangelos E. Papalexakis"'
Search Results
202. Mining Actionable Information from Security Forums: The Case of Malicious IP Addresses
- Author
-
Konstantinos Pelechrinis, Andre Castro, Joobin Gharibshah, Evangelos E. Papalexakis, Tai Ching Li, and Michalis Faloutsos
- Subjects
Focus (computing) ,Data collection ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Blacklist ,Constructed language ,World Wide Web ,Identification (information) ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Feature (machine learning) ,020201 artificial intelligence & image processing ,Hacker - Abstract
The goal of this work is to systematically extract information from hacker forums, whose information would be in general described as unstructured: the text of a post is not necessarily following any writing rules. By contrast, many security initiatives and commercial entities are harnessing the readily public information, but they seem to focus on structured sources of information. Here, we focus on the problem of identifying malicious IP addresses, among the IP addresses which are reported in the forums. We develop a method to automate the identification of malicious IP addresses with the design goal of being independent of external sources. A key novelty is that we use a matrix decomposition method to extract latent features of the behavioral information of the users, which we combine with textual information from the related posts. A key design feature of our technique is that it can be readily applied to different language forums, since it does not require a sophisticated natural language processing approach. In particular, our solution only needs a small number of keywords in the new language plus the user’s behavior captured by specific features. We also develop a tool to automate the data collection from security forums. Using our tool, we collect approximately 600K posts from three different forums. Our method exhibits high classification accuracy, while the precision of identifying malicious IP in post is greater than 88% in all three forums. We argue that our method can provide significantly more information: we find up to three times more potentially malicious IP address compared to the reference blacklist VirusTotal. As the cyber-wars are becoming more intense, having early accesses to useful information becomes more imperative to remove the hackers first-move advantage, and our work is a solid step towards this direction.
- Published
- 2019
- Full Text
- View/download PDF
203. Embedding Lexical Features via Tensor Decomposition for Small Sample Humor Recognition
- Author
-
Xiaojuan Ma, Andrew Cattle, Evangelos E. Papalexakis, and Zhenjie Zhao
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,020206 networking & telecommunications ,Small sample ,02 engineering and technology ,computer.software_genre ,Pun ,SemEval ,Ranking (information retrieval) ,Task (project management) ,Simple (abstract algebra) ,020204 information systems ,Tensor (intrinsic definition) ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
We propose a novel tensor embedding method that can effectively extract lexical features for humor recognition. Specifically, we use word-word co-occurrence to encode the contextual content of documents, and then decompose the tensor to get corresponding vector representations. We show that this simple method can capture features of lexical humor effectively for continuous humor recognition. In particular, we achieve a distance of 0.887 on a global humor ranking task, comparable to the top performing systems from SemEval 2017 Task 6B (Potash et al., 2017) but without the need for any external training corpus. In addition, we further show that this approach is also beneficial for small sample humor recognition tasks through a semi-supervised label propagation procedure, which achieves about 0.7 accuracy on the 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) humour classification datasets using only 10% of known labels.
- Published
- 2019
- Full Text
- View/download PDF
204. t-PNE: Tensor-Based Predictable Node Embeddings
- Author
-
Ekta Gujral, Sarah S. Lam, Evangelos E. Papalexakis, Danai Koutra, and Saba A. Al-Sayouri
- Subjects
Theoretical computer science ,Computer science ,Network structure ,02 engineering and technology ,ENCODE ,k-nearest neighbors algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,Adjacency list ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Adjacency matrix ,Vector space - Abstract
Graph representations have increasingly grown in popularity during the last years. Existing embedding approaches explicitly encode network structure. Despite their good performance in downstream processes (e.g., node classification), there is still room for improvement in different aspects, like effectiveness. In this paper, we propose, t-PNE, a method that addresses this limitation. Contrary to baseline methods, which generally learn explicit node representations by solely using an adjacency matrix, t-PNE avails a multi-view information graph---the adjacency matrix represents the first view, and a nearest neighbor adjacency, computed over the node features, is the second view---in order to learn explicit and implicit node representations, using the Canonical Polyadic (a.k.a. CP) decomposition. We argue that the implicit and the explicit mapping from a higher-dimensional to a lower-dimensional vector space is the key to learn more useful and highly predictable representations. Extensive experiments show that t-PNE drastically outperforms baseline methods by up to 158.6% with respect to Micro-F1, in several multi-label classification problems.
- Published
- 2018
- Full Text
- View/download PDF
205. Semi-supervised Content-Based Detection of Misinformation via Tensor Embeddings
- Author
-
Sara Abdali, Gisel Bastidas Guacho, Evangelos E. Papalexakis, and Neil Shah
- Subjects
FOS: Computer and information sciences ,Computer science ,Feature extraction ,Machine Learning (stat.ML) ,02 engineering and technology ,Semi-supervised learning ,Machine learning ,computer.software_genre ,Belief propagation ,Statistics - Applications ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Tensor decomposition ,Applications (stat.AP) ,Misinformation ,Tensor ,Social and Information Networks (cs.SI) ,business.industry ,Computer Science - Social and Information Networks ,Support vector machine ,Computer Science - Learning ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Classifier (UML) - Abstract
Fake news may be intentionally created to promote economic, political and social interests, and can lead to negative impacts on humans beliefs and decisions. Hence, detection of fake news is an emerging problem that has become extremely prevalent during the last few years. Most existing works on this topic focus on manual feature extraction and supervised classification models leveraging a large number of labeled (fake or real) articles. In contrast, we focus on content-based detection of fake news articles, while assuming that we have a small amount of labels, made available by manual fact-checkers or automated sources. We argue this is a more realistic setting in the presence of massive amounts of content, most of which cannot be easily factchecked. To that end, we represent collections of news articles as multi-dimensional tensors, leverage tensor decomposition to derive concise article embeddings that capture spatial/contextual information about each news article, and use those embeddings to create an article-by-article graph on which we propagate limited labels. Results on three real-world datasets show that our method performs on par or better than existing models that are fully supervised, in that we achieve better detection accuracy using fewer labels. In particular, our proposed method achieves 75.43% of accuracy using only 30% of labels of a public dataset while an SVM-based classifier achieved 67.43%. Furthermore, our method achieves 70.92% of accuracy in a large dataset using only 2% of labels.
- Published
- 2018
- Full Text
- View/download PDF
206. t-PINE: Tensor-based Predictable and Interpretable Node Embeddings
- Author
-
Evangelos E. Papalexakis, Saba A. Al-Sayouri, Danai Koutra, Ekta Gujral, and Sarah S. Lam
- Subjects
FOS: Computer and information sciences ,Theoretical computer science ,Computer science ,Communication ,Machine Learning (stat.ML) ,02 engineering and technology ,Machine Learning (cs.LG) ,Computer Science Applications ,k-nearest neighbors algorithm ,Visualization ,Human-Computer Interaction ,Computer Science - Learning ,Statistics - Machine Learning ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Adjacency list ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Adjacency matrix ,Feature learning ,Information Systems ,Interpretability ,Vector space - Abstract
Graph representations have increasingly grown in popularity during the last years. Existing representation learning approaches explicitly encode network structure. Despite their good performance in downstream processes (e.g., node classification, link prediction), there is still room for improvement in different aspects, such as efficacy, visualization, and interpretability. In this paper, we propose, t-PINE, a method that addresses these limitations. Contrary to baseline methods, which generally learn explicit graph representations by solely using an adjacency matrix, t-PINE avails a multi-view information graph—the adjacency matrix represents the first view, and a nearest neighbor adjacency, computed over the node features, is the second view—in order to learn explicit and implicit node representations, using the Canonical Polyadic (a.k.a. CP) decomposition. We argue that the implicit and the explicit mapping from a higher-dimensional to a lower-dimensional vector space is the key to learn more useful, highly predictable, and gracefully interpretable representations. Having good interpretable representations provides a good guidance to understand how each view contributes to the representation learning process. In addition, it helps us to exclude unrelated dimensions. Extensive experiments show that t-PINE drastically outperforms baseline methods by up to 351.5% with respect to Micro-F1, in several multi-label classification problems, while it has high visualization and interpretability utility.
- Published
- 2018
207. Network Anomaly Detection Using Co-clustering.
- Author
-
Evangelos E. Papalexakis, Alex Beutel, and Peter Steenkiste
- Published
- 2018
- Full Text
- View/download PDF
208. Balancing interpretability and predictive accuracy for unsupervised tensor mining
- Author
-
Evangelos E. Papalexakis and Ishmam Zabir
- Subjects
FOS: Computer and information sciences ,Rank (linear algebra) ,Computer science ,Heuristic (computer science) ,010401 analytical chemistry ,Machine Learning (stat.ML) ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Field (computer science) ,Machine Learning (cs.LG) ,0104 chemical sciences ,Data modeling ,Computer Science - Learning ,020401 chemical engineering ,Statistics - Machine Learning ,Tensor (intrinsic definition) ,Decomposition (computer science) ,Tensor ,Data mining ,0204 chemical engineering ,computer ,Interpretability - Abstract
The PARAFAC tensor decomposition has enjoyed an increasing success in exploratory multi-aspect data mining scenarios. A major challenge remains the estimation of the number of latent factors (i.e., the rank) of the decomposition, which is known to yield high-quality, interpretable results. Previously, AutoTen, an automated tensor mining method which leverages a well-known quality heuristic from the field of Chemometrics, the Core Consistency Diagnostic (CORCONDIA), in order to automatically determine the rank for the PARAFAC decomposition, was proposed. In this work, building upon AutoTen, we set out to explore the trade-off between 1) the interpretability of the results (as expressed by CORCONDIA), and 2) the predictive accuracy of the decomposition, towards improving rank estimation quality. Our preliminary results indicate that striking a good balance in that trade-off yields high-quality rank estimation, towards achieving unsupervised tensor mining.
- Published
- 2017
- Full Text
- View/download PDF
209. Tensor Decomposition for Signal Processing and Machine Learning
- Author
-
Kejun Huang, Evangelos E. Papalexakis, Christos Faloutsos, Xiao Fu, Nicholas D. Sidiropoulos, and Lieven De Lathauwer
- Subjects
Topic model ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Theoretical computer science ,Rank (linear algebra) ,Machine Learning (stat.ML) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Matrix decomposition ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,Multilinear subspace learning ,FOS: Mathematics ,Mathematics - Numerical Analysis ,Tensor ,Electrical and Electronic Engineering ,Mathematics ,Signal processing ,SISTA ,business.industry ,020206 networking & telecommunications ,Numerical Analysis (math.NA) ,Signal Processing ,Data analysis ,Identifiability ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer - Abstract
Tensors or {\em multi-way arrays} are functions of three or more indices $(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functions of two indices $(r,c)$ for (row,column). Tensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth {\em and depth} that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning., Comment: revised version, overview article
- Published
- 2017
210. Data mining based on random forest model to predict the California ISO day-ahead market prices
- Author
-
Mahdi Kohansal, Hamed Mohsenian-Rad, Evangelos E. Papalexakis, and Ashkan Sadeghi-Mobarakeh
- Subjects
020209 energy ,02 engineering and technology ,computer.software_genre ,Ensemble learning ,Random forest ,Data modeling ,Order (exchange) ,Market data ,Value (economics) ,0202 electrical engineering, electronic engineering, information engineering ,Market price ,Economics ,Electricity market ,Data mining ,computer - Abstract
In this paper, an ensemble learning model, namely the random forest (RF) model, is used to predict both the exact values as well as the class labels of 24 hourly prices in the California Independent System Operator (CAISO)'s day-ahead electricity market. The focus is on predicting the prices for the Pacific Gas and Company (PG&E) default load aggregation point (DLAP). Several effective features, such as the historical hourly prices at different locations, calender data, and new ancillary service requirements are engineered and the model is trained in order to capture the best relations between the features and the target electricity price variables. Insightful case studies are implemented on the CAISO market data from January 1, 2014 to February 28, 2016. It is observed that the proposed data mining approach provides promising results in both predicting the exact value and in classifying the prices as low, medium and high.
- Published
- 2017
- Full Text
- View/download PDF
211. SPARTan: Scalable PARAFAC2 for Large & Sparse Data
- Author
-
Richard Vuduc, Ioakeim Perros, Fei Wang, Jimeng Sun, Evangelos E. Papalexakis, Elizabeth Searles, and Michael Thompson
- Subjects
FOS: Computer and information sciences ,Computer science ,02 engineering and technology ,Machine learning ,computer.software_genre ,Machine Learning (cs.LG) ,Set (abstract data type) ,020204 information systems ,Tensor (intrinsic definition) ,FOS: Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Tensor ,Spartan ,Sparse matrix ,Structure (mathematical logic) ,business.industry ,Process (computing) ,Computer Science - Numerical Analysis ,Numerical Analysis (math.NA) ,Computer Science - Learning ,Scalability ,Unsupervised learning ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer - Abstract
In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To handle such data, the state-of-the-art tensor model is the so-called PARAFAC2, which yields interpretable and robust output and can naturally handle sparse data. However, its main limitation up to now has been the lack of efficient algorithms that can handle large-scale datasets. In this work, we fill this gap by developing a scalable method to compute the PARAFAC2 decomposition of large and sparse datasets, called SPARTan. Our method exploits special structure within PARAFAC2, leading to a novel algorithmic reformulation that is both faster (in absolute time) and more memory-efficient than prior work. We evaluate SPARTan on both synthetic and real datasets, showing 22X performance gains over the best previous implementation and also handling larger problem instances for which the baseline fails. Furthermore, we are able to apply SPARTan to the mining of temporally-evolving phenotypes on data taken from real and medically complex pediatric patients. The clinical meaningfulness of the phenotypes identified in this process, as well as their temporal evolution over time for several patients, have been endorsed by clinical experts.
- Published
- 2017
212. Parallel Randomly Compressed Cubes : A scalable distributed architecture for big tensor decomposition
- Author
-
Christos Faloutsos, Nicholas D. Sidiropoulos, and Evangelos E. Papalexakis
- Subjects
Rank (linear algebra) ,Computer science ,business.industry ,Applied Mathematics ,Replica ,Computation ,Big data ,Parallel computing ,Parallel processing (DSP implementation) ,Signal Processing ,Identifiability ,Tensor ,Electrical and Electronic Engineering ,business ,Massively parallel - Abstract
This article combines a tutorial on state-of-the-art tensor decomposition as it relates to big data analytics, with original research on parallel and distributed computation of low-rank decomposition for big tensors, and a concise primer on Hadoop?MapReduce. A novel architecture for parallel and distributed computation of low-rank tensor decomposition that is especially well suited for big tensors is proposed. The new architecture is based on parallel processing of a set of randomly compressed, reduced-size replicas of the big tensor. Each replica is independently decomposed, and the results are joined via a master linear equation per tensor mode. The approach enables massive parallelism with guaranteed identifiability properties: if the big tensor is of low rank and the system parameters are appropriately chosen, then the rank-one factors of the big tensor will indeed be recovered from the analysis of the reduced-size replicas. Furthermore, the architecture affords memory/storage and complexity gains of order for a big tensor of size of rank F with No sparsity is required in the tensor or the underlying latent factors, although such sparsity can be exploited to improve memory, storage, and computational savings.
- Published
- 2014
- Full Text
- View/download PDF
213. Network Anomaly Detection Using Co-clustering.
- Author
-
Evangelos E. Papalexakis, Alex Beutel, and Peter Steenkiste
- Published
- 2014
- Full Text
- View/download PDF
214. From $K$-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors
- Author
-
Nicholas D. Sidiropoulos, Evangelos E. Papalexakis, and Rasmus Bro
- Subjects
Multilinear map ,business.industry ,k-means clustering ,Pattern recognition ,Missing data ,Multilinear principal component analysis ,Biclustering ,ComputingMethodologies_PATTERNRECOGNITION ,Signal Processing ,Unsupervised learning ,Artificial intelligence ,Electrical and Electronic Engineering ,Cluster analysis ,business ,Algorithm ,Sparse matrix ,Mathematics - Abstract
Co-clustering is a generalization of unsupervised clustering that has recently drawn renewed attention, driven by emerging data mining applications in diverse areas. Whereas clustering groups entire columns of a data matrix, co-clustering groups columns over select rows only, i.e., it simultaneously groups rows and columns. The concept generalizes to data “boxes” and higher-way tensors, for simultaneous grouping along multiple modes. Various co-clustering formulations have been proposed, but no workhorse analogous to K-means has emerged. This paper starts from K-means and shows how co-clustering can be formulated as a constrained multilinear decomposition with sparse latent factors. For three- and higher-way data, uniqueness of the multilinear decomposition implies that, unlike matrix co-clustering, it is possible to unravel a large number of possibly overlapping co-clusters. A basic multi-way co-clustering algorithm is proposed that exploits multilinearity using Lasso-type coordinate updates. Various line search schemes are then introduced to speed up convergence, and suitable modifications are proposed to deal with missing values. The imposition of latent sparsity pays a collateral dividend: it turns out that sequentially extracting one co-cluster at a time is almost optimal, hence the approach scales well for large datasets. The resulting algorithms are benchmarked against the state-of-art in pertinent simulations, and applied to measured data, including the ENRON e-mail corpus.
- Published
- 2013
- Full Text
- View/download PDF
215. Efficient and Distributed Algorithms for Large-Scale Generalized Canonical Correlations Analysis
- Author
-
Kejun Huang, Tom M. Mitchell, Christos Faloutsos, Xiao Fu, Evangelos E. Papalexakis, Partha Pratim Talukdar, Hyun Ah Song, and Nicholas D. Sidiropoulos
- Subjects
Theoretical computer science ,Computer science ,Approximation algorithm ,020206 networking & telecommunications ,02 engineering and technology ,Matrix (mathematics) ,Quadratic equation ,Square root ,Factorization ,Dimension (vector space) ,Distributed algorithm ,020204 information systems ,Generalized canonical correlation ,0202 electrical engineering, electronic engineering, information engineering ,Algorithm design ,Cubic function ,Sparse matrix - Abstract
Generalized canonical correlation analysis (GCCA) aims at extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains – an extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100,000 – while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.
- Published
- 2016
- Full Text
- View/download PDF
216. Unsupervised Tensor Mining for Big Data Practitioners
- Author
-
Christos Faloutsos and Evangelos E. Papalexakis
- Subjects
Information Systems and Management ,business.industry ,Computer science ,Big data ,02 engineering and technology ,Data science ,Computer Science Applications ,Machine Learning ,Fully automated ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,Data Mining ,020201 artificial intelligence & image processing ,Computer Simulation ,Tensor ,Timestamp ,business ,Information Systems - Abstract
Multiaspect data are ubiquitous in modern Big Data applications. For instance, different aspects of a social network are the different types of communication between people, the time stamp of each interaction, and the location associated to each individual. How can we jointly model all those aspects and leverage the additional information that they introduce to our analysis? Tensors, which are multidimensional extensions of matrices, are a principled and mathematically sound way of modeling such multiaspect data. In this article, our goal is to popularize tensors and tensor decompositions to Big Data practitioners by demonstrating their effectiveness, outlining challenges that pertain to their application in Big Data scenarios, and presenting our recent work that tackles those challenges. We view this work as a step toward a fully automated, unsupervised tensor mining tool that can be easily and broadly adopted by practitioners in academia and industry.
- Published
- 2016
217. Good-Enough Brain Model: Challenges, Algorithms, and Discoveries in Multisubject Experiments
- Author
-
Alona Fyshe, Evangelos E. Papalexakis, Christos Faloutsos, Partha Pratim Talukdar, Nicholas D. Sidiropoulos, and Tom M. Mitchell
- Subjects
Information Systems and Management ,Brain model ,Computer science ,business.industry ,Brain activity and meditation ,Big data ,Human brain ,Stimulus (physiology) ,Computer Science Applications ,medicine.anatomical_structure ,Noun ,Outlier ,medicine ,Habituation ,business ,Neuroscience ,Algorithm ,Information Systems - Abstract
Given a simple noun such as apple, and a question such as "Is it edible?," what processes take place in the human brain? More specifically, given the stimulus, what are the interactions between (groups of) neurons (also known as functional connectivity) and how can we automatically infer those interactions, given measurements of the brain activity? Furthermore, how does this connectivity differ across different human subjects? In this work, we show that this problem, even though originating from the field of neuroscience, can benefit from big data techniques; we present a simple, novel good-enough brain model, or GeBM in short, and a novel algorithm Sparse-SysId, which are able to effectively model the dynamics of the neuron interactions and infer the functional connectivity. Moreover, GeBM is able to simulate basic psychological phenomena such as habituation and priming (whose definition we provide in the main text). We evaluate GeBM by using real brain data. GeBM produces brain activity patterns that are strikingly similar to the real ones, where the inferred functional connectivity is able to provide neuroscientific insights toward a better understanding of the way that neurons interact with each other, as well as detect regularities and outliers in multisubject brain activity measurements.
- Published
- 2016
218. Coclustering-a useful tool for chemometrics
- Author
-
Rasmus Bro, Evrim Acar, Nicholas D. Sidiropoulos, and Evangelos E. Papalexakis
- Subjects
business.industry ,Applied Mathematics ,Rank (computer programming) ,computer.software_genre ,Machine learning ,Analytical Chemistry ,Chemometrics ,Principal component analysis ,Data mining ,Artificial intelligence ,Cluster analysis ,business ,computer ,Relevant information ,Mathematics - Abstract
Nowadays, chemometric applications in biology can readily deal with tens of thousands of variables, for instance, inomics and environmental analysis. Other areas of chemometrics also deal with distilling relevant information inhighly information-rich data sets. Traditional tools such as the principal component analysis or hierarchical cluster-ing are often not optimal for providing succinct and accurate information from high rank data sets. A relatively littleknownapproach thathas shownsignificant potential inother areas ofresearch iscoclustering, where adata matrix issimultaneously clustered in its rows and columns (objects and variables usually).Coclustering is the tool of choice when only a subset of variables is related to a specific grouping among objects.Hence, coclustering allows a select number of objects to share a particular behavior on a select number of variables.Inthispaper,wedescribethebasicsofcoclusteringandusethreedifferentexampledatasetstoshowtheadvantagesand shortcomings of coclustering. Copyright © 2012 John Wiley & Sons, Ltd.Keywords: clustering; coclustering; L1 norm; sparsity
- Published
- 2012
- Full Text
- View/download PDF
219. Power-Hop: A Pervasive Observation for Real Complex Networks
- Author
-
Konstantinos Pelechrinis, Bryan Hooi, Evangelos E. Papalexakis, and Christos Faloutsos
- Subjects
Proteomics ,Correlation dimension ,Computer and Information Sciences ,Computer science ,Social Sciences ,Geometry ,lcsh:Medicine ,Network topology ,01 natural sciences ,Biochemistry ,010305 fluids & plasmas ,Sociology ,0103 physical sciences ,Physicists ,Humans ,Computer Networks ,010306 general physics ,Protein Interactions ,lcsh:Science ,Network model ,Discrete mathematics ,Internet ,Multidisciplinary ,Social network ,business.industry ,Scale-free network ,lcsh:R ,Biology and Life Sciences ,Proteins ,Social Support ,Complex network ,Models, Theoretical ,Degree distribution ,Professions ,Fractals ,Social Networks ,Metric (mathematics) ,Physical Sciences ,People and Places ,Protein Interaction Networks ,Population Groupings ,lcsh:Q ,business ,Scale-Free Networks ,Biological network ,Network Analysis ,Mathematics ,Network analysis ,Research Article - Abstract
Complex networks have been shown to exhibit universal properties, with one of the most consistent patterns being the scale-free degree distribution, but are there regularities obeyed by the r-hop neighborhood in real networks? We answer this question by identifying another power-law pattern that describes the relationship between the fractions of node pairs C(r) within r hops and the hop count r. This scale-free distribution is pervasive and describes a large variety of networks, ranging from social and urban to technological and biological networks. In particular, inspired by the definition of the fractal correlation dimension D2 on a point-set, we consider the hop-count r to be the underlying distance metric between two vertices of the network, and we examine the scaling of C(r) with r. We find that this relationship follows a power-law in real networks within the range 2 ≤ r ≤ d, where d is the effective diameter of the network, that is, the 90-th percentile distance. We term this relationship as power-hop and the corresponding power-law exponent as power-hop exponent h. We provide theoretical justification for this pattern under successful existing network models, while we analyze a large set of real and synthetic network datasets and we show the pervasiveness of the power-hop.
- Published
- 2016
220. The Anatomy of American Football: Evidence from 7 Years of NFL Game Data
- Author
-
Konstantinos Pelechrinis, Evangelos E. Papalexakis, and Eriksson, Kimmo
- Subjects
Computer science ,Physiology ,Social Sciences ,Action Potentials ,lcsh:Medicine ,Football ,01 natural sciences ,Coaching ,010104 statistics & probability ,Mathematical and Statistical Techniques ,Models ,Medicine and Health Sciences ,Psychology ,050207 economics ,lcsh:Science ,Multidisciplinary ,Anthropometry ,Applied Mathematics ,05 social sciences ,Statistical ,Sports Science ,Electrophysiology ,Physical Sciences ,Body Composition ,Probability distribution ,Engineering and Technology ,Team Behavior ,Games ,Game theory ,Statistics (Mathematics) ,Research Article ,Sports ,Statistical Distributions ,General Science & Technology ,American football ,Neurophysiology ,League ,Athletic Performance ,Research and Analysis Methods ,Membrane Potential ,Game Theory ,0502 economics and business ,Humans ,Engines ,0101 mathematics ,Statistical Methods ,Behavior ,Actuarial science ,Models, Statistical ,business.industry ,Mechanical Engineering ,lcsh:R ,Offensive ,Biology and Life Sciences ,Probability Theory ,United States ,Collective Human Behavior ,Athletes ,Recreation ,lcsh:Q ,business ,Mathematics ,Forecasting ,Neuroscience - Abstract
How much does a fumble affect the probability of winning an American football game? How balanced should your offense be in order to increase the probability of winning by 10%? These are questions for which the coaching staff of National Football League teams have a clear qualitative answer. Turnovers are costly; turn the ball over several times and you will certainly lose. Nevertheless, what does "several" mean? How "certain" is certainly? In this study, we collected play-by-play data from the past 7 NFL seasons, i.e., 2009-2015, and we build a descriptive model for the probability of winning a game. Despite the fact that our model incorporates simple box score statistics, such as total offensive yards, number of turnovers etc., its overall cross-validation accuracy is 84%. Furthermore, we combine this descriptive model with a statistical bootstrap module to build FPM (short for Football Prediction Matchup) for predicting future match-ups. The contribution of FPM is pertinent to its simplicity and transparency, which however does not sacrifice the system's performance. In particular, our evaluations indicate that our prediction engine performs on par with the current state-of-the-art systems (e.g., ESPN's FPI and Microsoft's Cortana). The latter are typically proprietary but based on their components described publicly they are significantly more complicated than FPM. Moreover, their proprietary nature does not allow for a head-to-head comparison in terms of the core elements of the systems but it should be evident that the features incorporated in FPM are able to capture a large percentage of the observed variance in NFL games.
- Published
- 2016
221. HaTen2: Billion-scale tensor decompositions
- Author
-
Christos Faloutsos, U Kang, Inah Jeon, and Evangelos E. Papalexakis
- Subjects
Theoretical computer science ,Scale (ratio) ,Computer science ,Algorithm design ,Tensor ,Data mining ,computer.software_genre ,computer ,Matrix decomposition ,Data modeling - Abstract
How can we find useful patterns and anomalies in large scale real-world data with multiple attributes? For example, network intrusion logs, with (source-ip, target-ip, port-number, timestamp)? Tensors are suitable for modeling these multi-dimensional data, and widely used for the analysis of social networks, web data, network traffic, and in many other settings. However, current tensor decomposition methods do not scale for tensors with millions and billions of rows, columns and ‘fibers’, that often appear in real datasets. In this paper, we propose HaTen2, a scalable distributed suite of tensor decomposition algorithms running on the MapReduce platform. By carefully reordering the operations, and exploiting the sparsity of real world tensors, HaTen2 dramatically reduces the intermediate data, and the number of jobs. As a result, using HaTen2, we analyze big real-world tensors that can not be handled by the current state of the art, and discover hidden concepts.
- Published
- 2015
- Full Text
- View/download PDF
222. A parallel algorithm for big tensor decomposition using randomly compressed cubes (PARACOMP)
- Author
-
Evangelos E. Papalexakis, Nikos D. Sidiropoulos, and Christos Faloutsos
- Subjects
Discrete mathematics ,Set (abstract data type) ,Theoretical computer science ,Rank (linear algebra) ,Parallel processing (DSP implementation) ,Computer science ,Parallel algorithm ,Multilinear subspace learning ,Identifiability ,Tensor ,Linear equation - Abstract
A parallel algorithm for low-rank tensor decomposition that is especially well-suited for big tensors is proposed. The new algorithm is based on parallel processing of a set of randomly compressed, reduced-size `replicas' of the big tensor. Each replica is independently decomposed, and the results are joined via a master linear equation per tensor mode. The approach enables massive parallelism with guaranteed identifiability properties: if the big tensor has low rank and the system parameters are appropriately chosen, then the rank-one factors of the big tensor will be exactly recovered from the analysis of the reduced-size replicas. The proposed algorithm is proven to yield memory / storage and complexity gains of order up to IJ/F for a big tensor of size I × J × K of rank F with F ≤I ≤J ≤K.
- Published
- 2014
- Full Text
- View/download PDF
223. Predicting Code-switching in Multilingual Communication for Immigrant Communities
- Author
-
A. Seza Doğruöz, Evangelos E. Papalexakis, and Dong Nguyen
- Subjects
Online discussion ,computational social sciences ,Computer science ,Turkish ,social media ,media_common.quotation_subject ,Lt3 ,Immigration ,Multilingualism ,code-switching ,02 engineering and technology ,Languages and Literatures ,IR-94058 ,0202 electrical engineering, electronic engineering, information engineering ,Social media ,natural language processing ,Natural Language Processing ,media_common ,060201 languages & linguistics ,EWI-25499 ,METIS-309773 ,06 humanities and the arts ,Code-switching ,computational sociolinguistics ,Linguistics ,language.human_language ,0602 languages and literature ,language ,020201 artificial intelligence & image processing ,Dutch ,Social Media ,Host (network) - Abstract
Immigrant communities host multilingual speakers who switch across languages and cultures in their daily communication practices. Although there are in-depth linguistic descriptions of code-switching across different multilingual communication settings, there is a need for automatic prediction of code-switching in large datasets. We use emoticons and multi-word expressions as novel features to predict code-switching in a large online discussion forum for the Turkish-Dutch immigrant community in the Netherlands. Our results indicate that multi-word expressions are powerful features to predict code-switching.
- Published
- 2014
- Full Text
- View/download PDF
224. TensorSplat: Spotting Latent Anomalies in Time
- Author
-
Christos Faloutsos, Danai Koutra, and Evangelos E. Papalexakis
- Subjects
Network security ,business.industry ,Computer science ,Graph theory ,Spotting ,Variety (linguistics) ,computer.software_genre ,Matrix decomposition ,Large networks ,Multiple time dimensions ,Decomposition method (constraint satisfaction) ,Data mining ,business ,computer ,Computer Science::Databases - Abstract
How can we spot anomalies in large, time-evolving graphs? When we have multi-aspect data, e.g. who published which paper on which conference and on what year, how can we combine this information, in order to obtain good summaries thereof and unravel hidden anomalies and patterns? Such multi-aspect data, including time-evolving graphs, can be successfully modelled using Tensors. In this paper, we show that when we have multiple dimensions in the dataset, then tensor analysis is a powerful and promising tool. Our method TENSORSPLAT, at the heart of which lies the "PARAFAC" decomposition method, can give good insights about the large networks that are of interest nowadays, and contributes to spotting micro-clusters, changes and, in general, anomalies. We report extensive experiments on a variety of datasets (co-authorship network, time-evolving DBLP network, computer network and Facebook wall posts) and show how tensors can be proved useful in detecting "strange" behaviors.
- Published
- 2012
- Full Text
- View/download PDF
225. Translation invariant word embeddings
- Author
-
Kejun Huang, Christos Faloutsos, Tom M. Mitchell, Matt Gardner, Evangelos E. Papalexakis, Xiao Fu, Nikos D. Sidiropoulos, and Partha Pratim Talukdar
- Subjects
Theoretical computer science ,business.industry ,Computer science ,Artificial intelligence ,Invariant (mathematics) ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
This work focuses on the task of finding latent vector representations of the words in a corpus. In particular, we address the issue of what to do when there are multiple languages in the corpus. Prior work has, among other techniques, used canonical correlation analysis to project pre-trained vectors in two languages into a common space. We propose a simple and scalable method that is inspired by the notion that the learned vector representations should be invariant to translation between languages. We show empirically that our method outperforms prior work on multilingual tasks, matches the performance of prior work on monolingual tasks, and scales linearly with the size of the input data (and thus the number of languages being embedded).
226. NetSpot: Spotting significant anomalous regions on dynamic networks
- Author
-
Misael Mongiovì, Petko Bogdanov, Razvan Ranca, Evangelos E. Papalexakis, Christos Faloutsos, and Ambuj K. Singh
227. Proceedings of the 2022 SIAM International Conference on Data Mining, SDM 2022, Alexandria, VA, USA, April 28-30, 2022
- Author
-
Arindam Banerjee 0001, Zhi-Hua Zhou, Evangelos E. Papalexakis, and Matteo Riondato
- Published
- 2022
- Full Text
- View/download PDF
228. The Anatomy of American Football: Evidence from 7 Years of NFL Game Data.
- Author
-
Pelechrinis K and Papalexakis E
- Subjects
- Humans, United States, Anthropometry methods, Athletes statistics & numerical data, Athletic Performance physiology, Body Composition, Football physiology, Models, Statistical
- Abstract
How much does a fumble affect the probability of winning an American football game? How balanced should your offense be in order to increase the probability of winning by 10%? These are questions for which the coaching staff of National Football League teams have a clear qualitative answer. Turnovers are costly; turn the ball over several times and you will certainly lose. Nevertheless, what does "several" mean? How "certain" is certainly? In this study, we collected play-by-play data from the past 7 NFL seasons, i.e., 2009-2015, and we build a descriptive model for the probability of winning a game. Despite the fact that our model incorporates simple box score statistics, such as total offensive yards, number of turnovers etc., its overall cross-validation accuracy is 84%. Furthermore, we combine this descriptive model with a statistical bootstrap module to build FPM (short for Football Prediction Matchup) for predicting future match-ups. The contribution of FPM is pertinent to its simplicity and transparency, which however does not sacrifice the system's performance. In particular, our evaluations indicate that our prediction engine performs on par with the current state-of-the-art systems (e.g., ESPN's FPI and Microsoft's Cortana). The latter are typically proprietary but based on their components described publicly they are significantly more complicated than FPM. Moreover, their proprietary nature does not allow for a head-to-head comparison in terms of the core elements of the systems but it should be evident that the features incorporated in FPM are able to capture a large percentage of the observed variance in NFL games., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2016
- Full Text
- View/download PDF
229. Power-Hop: A Pervasive Observation for Real Complex Networks.
- Author
-
Papalexakis E, Hooi B, Pelechrinis K, and Faloutsos C
- Subjects
- Humans, Models, Theoretical, Social Support
- Abstract
Complex networks have been shown to exhibit universal properties, with one of the most consistent patterns being the scale-free degree distribution, but are there regularities obeyed by the r-hop neighborhood in real networks? We answer this question by identifying another power-law pattern that describes the relationship between the fractions of node pairs C(r) within r hops and the hop count r. This scale-free distribution is pervasive and describes a large variety of networks, ranging from social and urban to technological and biological networks. In particular, inspired by the definition of the fractal correlation dimension D2 on a point-set, we consider the hop-count r to be the underlying distance metric between two vertices of the network, and we examine the scaling of C(r) with r. We find that this relationship follows a power-law in real networks within the range 2 ≤ r ≤ d, where d is the effective diameter of the network, that is, the 90-th percentile distance. We term this relationship as power-hop and the corresponding power-law exponent as power-hop exponent h. We provide theoretical justification for this pattern under successful existing network models, while we analyze a large set of real and synthetic network datasets and we show the pervasiveness of the power-hop.
- Published
- 2016
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.