153 results on '"James, Stephen A"'
Search Results
2. Angle-based joint and individual variation explained
- Author
-
Meilei Jiang, Qing Feng, James Stephen Marron, and Jan Hannig
- Subjects
FOS: Computer and information sciences ,0301 basic medicine ,Statistics and Probability ,Normalization (statistics) ,Numerical Analysis ,Computer science ,business.industry ,Machine Learning (stat.ML) ,computer.software_genre ,Machine learning ,Data structure ,03 medical and health sciences ,Exploratory data analysis ,030104 developmental biology ,Software ,Disparate system ,Statistics - Machine Learning ,Linear algebra ,Segmentation ,Data mining ,Artificial intelligence ,Statistics, Probability and Uncertainty ,business ,computer ,Data integration - Abstract
Integrative analysis of disparate data blocks measured on a common set of experimental subjects is a major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. In this paper we introduce Angle-Based Joint and Individual Variation Explained capturing both joint and individual variation within each data block. This is a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Important mathematical contributions are the use of score subspaces as the principal descriptors of variation structure and the use of perturbation theory as the guide for variation segmentation. This leads to an exploratory data analysis method which is insensitive to the heterogeneity among data blocks and does not require separate normalization. An application to cancer data reveals different behaviors of each type of signal in characterizing tumor subtypes. An application to a mortality data set reveals interesting historical lessons. Software and data are available at GitHub ., Comment: arXiv admin note: text overlap with arXiv:1512.04060
- Published
- 2018
3. Fast Algorithms for Large-Scale Generalized Distance Weighted Discrimination
- Author
-
James Stephen Marron, Defeng Sun, Kim-Chuan Toh, and Xin Yee Lam
- Subjects
Statistics and Probability ,021103 operations research ,Scale (ratio) ,Computer science ,0211 other engineering and technologies ,02 engineering and technology ,01 natural sciences ,Support vector machine ,010104 statistics & probability ,Range (statistics) ,Discrete Mathematics and Combinatorics ,Statistical analysis ,0101 mathematics ,Statistics, Probability and Uncertainty ,Algorithm - Abstract
High-dimension-low-sample size statistical analysis is important in a wide range of applications. In such situations, the highly appealing discrimination method, support vector machine, can be impr...
- Published
- 2018
4. A survey of high dimension low sample size asymptotics
- Author
-
Yi-Hui Zhou, James Stephen Marron, Makoto Aoshima, Haipeng Shen, Dan Shen, and Kazuyoshi Yata
- Subjects
Statistics and Probability ,05 social sciences ,Mathematical statistics ,01 natural sciences ,Article ,010104 statistics & probability ,Dimension (vector space) ,Sample size determination ,0502 economics and business ,Statistical physics ,0101 mathematics ,Statistics, Probability and Uncertainty ,Smoothing ,050205 econometrics ,Mathematics - Abstract
Peter Hall’s work illuminated many aspects of statistical thought, some of which are very well known including the bootstrap and smoothing. However, he also explored many other lesser known aspects of mathematical statistics. This is a survey of one of those areas, initiated by a seminal paper in 2005, on high dimension low sample size asymptotics. An interesting characteristic of that first paper, and of many of the following papers, is that they contain deep and insightful concepts which are frequently surprising and counter-intuitive, yet have mathematical underpinnings which tend to be direct and not difficult to prove.
- Published
- 2018
5. Comments on: Data science, big data and statistics
- Author
-
James Stephen Marron
- Subjects
Statistics and Probability ,Computer science ,business.industry ,Big data ,Statistics, Probability and Uncertainty ,business ,Data science - Published
- 2019
6. Eigenvalue significance testing for genetic association
- Author
-
Fred A. Wright, YiâHui Zhou, and James Stephen Marron
- Subjects
0301 basic medicine ,Statistics and Probability ,Mathematical optimization ,General Immunology and Microbiology ,Applied Mathematics ,Null (mathematics) ,Asymptotic distribution ,Contrast (statistics) ,General Medicine ,Mathematics::Spectral Theory ,Population stratification ,01 natural sciences ,General Biochemistry, Genetics and Molecular Biology ,010104 statistics & probability ,03 medical and health sciences ,Permutation ,030104 developmental biology ,Covariate ,Applied mathematics ,0101 mathematics ,General Agricultural and Biological Sciences ,Spurious relationship ,Eigenvalues and eigenvectors ,Mathematics - Abstract
Genotype eigenvectors are widely used as covariates for control of spurious stratification in genetic association. Significance testing for the accompanying eigenvalues has typically been based on a standard Tracy-Widom limiting distribution for the largest eigenvalue, derived under white-noise assumptions. It is known that even modest local correlation among markers inflates the largest eigenvalues, even in the absence of true stratification. In addition, a few sample eigenvalues may be extreme, creating further complications in accurate testing. We explore several methods to identify appropriate null eigenvalue thresholds, while remaining sensitive to eigenvalues corresponding to population stratification. We introduce a novel block permutation approach, designed to produce an appropriate null eigenvalue distribution by eliminating long-range genomic correlation while preserving local correlation. We also propose a fast approach based on eigenvalue distribution modeling, using a simple fit criterion and the general Marcenko-Pastur equation under a simple discrete eigenvalue model. Block permutation and the model-based approach work well for pure simulations and for data resampled from the 1000 Genomes project. In contrast, we find that the standard approach of computing an "effective" number of markers does not perform well. The performance of the methods is also demonstrated for a motivating example from the International Cystic Fibrosis Consortium.
- Published
- 2017
7. Computation of ancestry scores with mixed families and unrelated individuals
- Author
-
Fred A. Wright, James Stephen Marron, and Yi-Hui Zhou
- Subjects
0301 basic medicine ,Statistics and Probability ,General Immunology and Microbiology ,Covariance matrix ,Applied Mathematics ,Robustness (evolution) ,General Medicine ,Population stratification ,01 natural sciences ,General Biochemistry, Genetics and Molecular Biology ,010104 statistics & probability ,03 medical and health sciences ,030104 developmental biology ,Resampling ,Principal component analysis ,Statistics ,0101 mathematics ,1000 Genomes Project ,General Agricultural and Biological Sciences ,Eigenvalues and eigenvectors ,Mathematics ,Genetic association - Abstract
The issue of robustness to family relationships in computing genotype ancestry scores such as eigenvector projections has received increased attention in genetic association, and is particularly challenging when sets of both unrelated individuals and closely related family members are included. The current standard is to compute loadings (left singular vectors) using unrelated individuals and to compute projected scores for remaining family members. However, projected ancestry scores from this approach suffer from shrinkage toward zero. We consider two main novel strategies: (i) matrix substitution based on decomposition of a target family-orthogonalized covariance matrix, and (ii) using family-averaged data to obtain loadings. We illustrate the performance via simulations, including resampling from 1000 Genomes Project data, and analysis of a cystic fibrosis dataset. The matrix substitution approach has similar performance to the current standard, but is simple and uses only a genotype covariance matrix, while the family-average method shows superior performance. Our approaches are accompanied by novel ancillary approaches that provide considerable insight, including individual-specific eigenvalue scree plots.
- Published
- 2017
8. Statistical significance for hierarchical clustering
- Author
-
Yufeng Liu, Patrick K. Kimes, James Stephen Marron, and David N. Hayes
- Subjects
0301 basic medicine ,Statistics and Probability ,Fuzzy clustering ,Correlation clustering ,computer.software_genre ,Machine learning ,General Biochemistry, Genetics and Molecular Biology ,03 medical and health sciences ,Consensus clustering ,Cluster analysis ,Mathematics ,Brown clustering ,General Immunology and Microbiology ,business.industry ,Applied Mathematics ,Constrained clustering ,General Medicine ,Hierarchical clustering ,ComputingMethodologies_PATTERNRECOGNITION ,030104 developmental biology ,Canopy clustering algorithm ,Data mining ,Artificial intelligence ,General Agricultural and Biological Sciences ,business ,computer - Abstract
Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high-dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this article, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets.
- Published
- 2017
9. Principal Nested Spheres for Time-Warped Functional Data Analysis
- Author
-
James Stephen Marron, Qunqun Yu, and Xiaosun Lu
- Subjects
Statistics and Probability ,Dynamic time warping ,Mathematical analysis ,Functional data analysis ,Geometry ,02 engineering and technology ,01 natural sciences ,Domain (mathematical analysis) ,Manifold ,010104 statistics & probability ,Principal component analysis ,Metric (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Tangent space ,Discrete Mathematics and Combinatorics ,020201 artificial intelligence & image processing ,0101 mathematics ,Statistics, Probability and Uncertainty ,Image warping ,Mathematics - Abstract
There are often two important types of variation in functional data: the horizontal (or phase) variation and the vertical (or amplitude) variation. These two types of variation have been appropriately separated and modeled through a domain warping method (or curve registration) based on the Fisher–Rao metric. This article focuses on the analysis of the horizontal variation, captured by the domain warping functions. The square-root velocity function representation transforms the manifold of the warping functions to a Hilbert sphere. Motivated by recent results on manifold analogs of principal component analysis, we propose to analyze the horizontal variation via a principal nested spheres approach. Compared with earlier approaches, such as approximating tangent plane principal component analysis, this is seen to be an efficient and interpretable approach to decompose the horizontal variation in both simulated and real data examples.
- Published
- 2017
10. Bump hunting by topological data analysis
- Author
-
Giseon Heo, Peter T. Kim, James Stephen Marron, Stephen Rush, and Max Sommerfeld
- Subjects
Statistics and Probability ,Persistent homology ,Kernel density estimation ,Zero (complex analysis) ,010103 numerical & computational mathematics ,01 natural sciences ,Data set ,010104 statistics & probability ,Topological data analysis ,Statistical physics ,0101 mathematics ,Statistics, Probability and Uncertainty ,Mathematics ,Bump hunting - Abstract
A topological data analysis approach is taken to the challenging problem of finding and validating the statistical significance of local modes in a data set. As with the SIgnificance of the ZERo (S ...
- Published
- 2017
11. Modeling seasonality and serial dependence of electricity price curves with warping functional autoregressive dynamics
- Author
-
James Stephen Marron, Jiejie Zhang, and Ying Chen
- Subjects
0301 basic medicine ,Statistics and Probability ,Electricity price ,Seasonal functional time series ,01 natural sciences ,warping function ,010104 statistics & probability ,03 medical and health sciences ,Econometrics ,medicine ,0101 mathematics ,Image warping ,Karcher mean ,Mathematics ,business.industry ,Seasonality ,medicine.disease ,030104 developmental biology ,Amplitude ,Autoregressive model ,Modeling and Simulation ,Electricity ,Statistics, Probability and Uncertainty ,business ,Smoothing ,Serial dependence - Abstract
Electricity prices are high dimensional, serially dependent and have seasonal variations. We propose a Warping Functional AutoRegressive (WFAR) model that simultaneously accounts for the cross time-dependence and seasonal variations of the large dimensional data. In particular, electricity price curves are obtained by smoothing over the $24$ discrete hourly prices on each day. In the functional domain, seasonal phase variations are separated from level amplitude changes in a warping process with the Fisher–Rao distance metric, and the aligned (season-adjusted) electricity price curves are modeled in the functional autoregression framework. In a real application, the WFAR model provides superior out-of-sample forecast accuracy in both a normal functioning market, Nord Pool, and an extreme situation, the California market. The forecast performance as well as the relative accuracy improvement are stable for different markets and different time periods.
- Published
- 2019
12. Direction-Projection-Permutation for High-Dimensional Hypothesis Tests
- Author
-
Susan Wei, Lindsay B. Wichers, James Stephen Marron, and Chihoon Lee
- Subjects
0301 basic medicine ,Statistics and Probability ,business.industry ,Binary number ,Pattern recognition ,Linear classifier ,High dimensional ,Machine learning ,computer.software_genre ,01 natural sciences ,010104 statistics & probability ,03 medical and health sciences ,030104 developmental biology ,Sample size determination ,Resampling ,Discrete Mathematics and Combinatorics ,Artificial intelligence ,0101 mathematics ,Statistics, Probability and Uncertainty ,business ,computer ,Classifier (UML) ,Interpretability ,Statistical hypothesis testing ,Mathematics - Abstract
High-dimensional low sample size (HDLSS) data are becoming increasingly common in statistical applications. When the data can be partitioned into two classes, a basic task is to construct a classifier that can assign objects to the correct class. Binary linear classifiers have been shown to be especially useful in HDLSS settings and preferable to more complicated classifiers because of their ease of interpretability. We propose a computational tool called direction-projection-permutation (DiProPerm), which rigorously assesses whether a binary linear classifier is detecting statistically significant differences between two high-dimensional distributions. The basic idea behind DiProPerm involves working directly with the one-dimensional projections of the data induced by binary linear classifier. Theoretical properties of DiProPerm are studied under the HDLSS asymptotic regime whereby dimension diverges to infinity while sample size remains fixed. We show that certain variations of DiProPerm are consistent ...
- Published
- 2016
13. Visualization of robust L1PCA
- Author
-
Yi-Hui Zhou and James Stephen Marron
- Subjects
0301 basic medicine ,Statistics and Probability ,business.industry ,Computer science ,Pattern recognition ,02 engineering and technology ,Visualization ,03 medical and health sciences ,030104 developmental biology ,Robustness (computer science) ,Outlier ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Statistics, Probability and Uncertainty ,business - Published
- 2016
14. A note on automatic data transformation
- Author
-
Qing Feng, Jan Hannig, and James Stephen Marron
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Heteroscedasticity ,Logarithm ,Computer science ,Gaussian ,Data transformation (statistics) ,01 natural sciences ,Methodology (stat.ME) ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,0302 clinical medicine ,Transformation (function) ,Skewness ,030220 oncology & carcinogenesis ,symbols ,Test statistic ,0101 mathematics ,Statistics, Probability and Uncertainty ,Marginal distribution ,Algorithm ,Statistics - Methodology - Abstract
Modern data analysis frequently involves variables with highly non-Gaussian marginal distributions. However, commonly used analysis methods are most effective with roughly Gaussian data. This paper introduces an automatic transformation that improves the closeness of distributions to normality. For each variable, a new family of parametrizations of the shifted logarithm transformation is proposed, which is unique in treating the data as real valued and in allowing transformation for both left and right skewness within the single family. This also allows an automatic selection of the parameter value (which is crucial for high-dimensional data with many variables to transform) by minimizing the Anderson–Darling test statistic of the transformed data. An application to image features extracted from melanoma microscopy slides demonstrates the utility of the proposed transformation in addressing data with excessive skewness, heteroscedasticity and influential observations. Copyright © 2016 John Wiley & Sons, Ltd.
- Published
- 2016
15. Data Science vs. Statistics: Two Cultures?
- Author
-
James Stephen Marron and Iain Carmichael
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Computer science ,Other Statistics (stat.OT) ,Computation ,02 engineering and technology ,Exploratory analysis ,01 natural sciences ,Data science ,Term (time) ,010104 statistics & probability ,Statistics - Other Statistics ,Computational Theory and Mathematics ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0101 mathematics - Abstract
Data science is the business of learning from data, which is traditionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analysis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research.
- Published
- 2017
16. The limits of multiplexing
- Author
-
James Stephen Marron, Dirk P. Dittmer, and Dan Shen
- Subjects
Statistics and Probability ,Physics ,Small volume ,Solution hybridization ,Multiplex ,Bioinformatics ,Algorithm ,Multiplexing ,Upper and lower bounds - Abstract
We were motivated by three novel technologies, which exemplify a new design paradigm in high throughput genomics: nanostring TM, DNA-mediated Annealing, Selection, extension, and Ligation DASL TM, and multiplex real-time quantitative polymerase chain reaction (QPCR). All three are solution hybridization based, and all three employ on 10–1000 DNA sequence probes in a small volume, each probe specific for a particular sequence in a different human gene. nanostring TM uses 50-mer, DASL and multiplex QPCR use ∼20-mer probes. Assuming a 1-nM probe concentration in a 1 μL volume, there are 10− 9 × 10− 9 × 6.23 × 1023 or 6.23 × 105 molecules of each probe present in the reaction compared to 10–1000 target molecules. Excess probe drives the sensitivity of the reaction. We are interested in the limits of multiplexing, i.e., the probability that in such a design a particular probe would bind to any other, sequence-related probe rather than the intended, specific target. If this were to happen with appreciable frequency, this would result in much reduced sensitivity and potential failure of this design. We established upper and lower bounds for the probability that in a multiplex assay at least one probe would bind to another sequence-related probe rather than its cognate target. These bounds are reassuring, because for reasonable degrees of multiplexing (103 probes) the probability for such an event is practically negligible. As the degree of multiplexing increases to ∼106 probes, our theoretical boundaries gain practical importance and establish a principal upper limit for the use of highly multiplexed solution-based assays vis--a-vis solid-support anchored designs. WIREs Comput Stat 2015, 7:394–399. doi: 10.1002/wics.1364 For further resources related to this article, please visit the WIREs website.
- Published
- 2015
17. Nested nonnegative cone analysis
- Author
-
Shu Lu, James Stephen Marron, and Lingsong Zhang
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Discrete mathematics ,Rank (linear algebra) ,Applied Mathematics ,010103 numerical & computational mathematics ,01 natural sciences ,Linear subspace ,Machine Learning (cs.LG) ,Non-negative matrix factorization ,Methodology (stat.ME) ,Computer Science - Learning ,010104 statistics & probability ,Computational Mathematics ,Matrix (mathematics) ,Computational Theory and Mathematics ,Factorization ,Principal component analysis ,Singular value decomposition ,0101 mathematics ,Algorithm ,Statistics - Methodology ,Mathematics ,Interpretability - Abstract
Motivated by the analysis of nonnegative data objects, a novel Nested Nonnegative Cone Analysis (NNCA) approach is proposed to overcome some drawbacks of existing methods. The application of traditional PCA/SVD method to nonnegative data often cause the approximation matrix leave the nonnegative cone, which leads to non-interpretable and sometimes nonsensical results. The nonnegative matrix factorization (NMF) approach overcomes this issue, however the NMF approximation matrices suffer several drawbacks: 1) the factorization may not be unique, 2) the resulting approximation matrix at a specific rank may not be unique, and 3) the subspaces spanned by the approximation matrices at different ranks may not be nested. These drawbacks will cause troubles in determining the number of components and in multi-scale (in ranks) interpretability. The NNCA approach proposed in this paper naturally generates a nested structure, and is shown to be unique at each rank. Simulations are used in this paper to illustrate the drawbacks of the traditional methods, and the usefulness of the NNCA method., Comment: 18 pages, 3 figures
- Published
- 2015
18. Non-linear Hypothesis Testing of Geometric Object Properties of Shapes Applied to Hippocampi
- Author
-
Fred Godtliebsen, Stephen M. Pizer, James Stephen Marron, and Jörn Schulz
- Subjects
Statistics and Probability ,Boundary (topology) ,Topology ,01 natural sciences ,010104 statistics & probability ,03 medical and health sciences ,0302 clinical medicine ,Resampling ,Euclidean geometry ,Sensitivity (control systems) ,0101 mathematics ,Representation (mathematics) ,Mathematics ,Statistical hypothesis testing ,business.industry ,Applied Mathematics ,Pattern recognition ,Condensed Matter Physics ,Object (computer science) ,Nonlinear system ,Modeling and Simulation ,Geometry and Topology ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,030217 neurology & neurosurgery - Abstract
This paper presents a novel method to test mean differences of geometric object properties (GOPs). The method is designed for data whose representations include both Euclidean and non-Euclidean elements. It is based on advanced statistical analysis methods such as backward means on spheres. We develop a suitable permutation test to find global and simultaneously individual morphological differences between two populations based on the GOPs. To demonstrate the sensitivity of the method, an analysis exploring differences between hippocampi of first-episode schizophrenics and controls is presented. Each hippocampus is represented by a discrete skeletal representation (s-rep). We investigate important model properties using the statistics of populations. These properties are highlighted by the s-rep model that allows accurate capture of the object interior and boundary while, by design, being suitable for statistical analysis of populations of objects. By supporting non-Euclidean GOPs such as direction vectors, the proposed hypothesis test is novel in the study of morphological shape differences. Suitable difference measures are proposed for each GOP. Both global and simultaneous GOP analyses showed statistically significant differences between the first-episode schizophrenics and controls.
- Published
- 2015
19. Least squares sieve estimation of mixture distributions with boundary effects
- Author
-
Ling Wang, Guang Guo, James Stephen Marron, Mihee Lee, Haipeng Shen, and Peter Hall
- Subjects
Statistics and Probability ,Mean squared error ,Sieve estimator ,Estimator ,M-estimator ,Empirical distribution function ,law.invention ,Sieve ,law ,Statistics ,Mixture distribution ,Applied mathematics ,Probability distribution ,Mathematics - Abstract
In this study, we propose two types of sieve estimators, based on least squares (LS), for probability distributions that are mixtures of a finite number of discrete atoms and a continuous distribution under the framework of measurement error models. This research is motivated by the maximum likelihood (ML) sieve estimator developed in Lee et al. (2013). We obtain two types of LS sieve estimators through minimizing the distance between the empirical distribution/characteristic functions and the model distribution/characteristic functions. The LS estimators outperform the ML sieve estimator in several aspects: (1) they need much less computational time; (2) they give smaller integrated mean squared error; (3) the characteristic function based LS estimator is more robust against mis-specification of the error distribution. We also use roughness penalization to improve the smoothness of the resulting estimators and reduce the estimation variance. As an application of our proposed LS estimators, we use the Framingham Heart Study data to investigate the distribution of genetic effects on body mass index. Finally asymptotic properties of the LS estimators are investigated.
- Published
- 2015
20. Analysis of Rotational Deformations From Directional Data
- Author
-
Sungkyu Jung, Jörn Schulz, Stephen M. Pizer, James Stephen Marron, Stephan Huckemann, and Michael R. Pierrynowski
- Subjects
Statistics and Probability ,Unit sphere ,Computer science ,business.industry ,Directional statistics ,02 engineering and technology ,Bending ,Object (computer science) ,01 natural sciences ,Image (mathematics) ,010104 statistics & probability ,Simple (abstract algebra) ,0202 electrical engineering, electronic engineering, information engineering ,Discrete Mathematics and Combinatorics ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,0101 mathematics ,Statistics, Probability and Uncertainty ,business ,Algorithm ,Circle of a sphere - Abstract
This article discusses a novel framework to analyze rotational deformations of real three-dimensional objects. The rotational deformations such as twisting or bending have been observed as the major variation in some medical applications, where the features of the deformed three-dimensional objects are directional data. We propose modeling and estimation of the global deformations in terms of generalized rotations of directions. The proposed method can be cast as a generalized small circle fitting on the unit sphere. We also discuss the estimation of descriptors for more complex deformations composed of two simple deformations. The proposed method can be used for a number of different three-dimensional object models. Two analyses of three-dimensional object data are presented in detail: one using skeletal representations in medical image analysis and the other from biomechanical gait analysis of the knee joint. Supplementary materials for this article are available online.
- Published
- 2015
21. Distance-weighted discrimination
- Author
-
James Stephen Marron
- Subjects
Statistics and Probability ,Optimization problem ,Computer science ,Property (programming) ,business.industry ,Direction vector ,Machine learning ,computer.software_genre ,Support vector machine ,Data visualization ,Class differences ,Key (cryptography) ,Generalizability theory ,Artificial intelligence ,business ,computer - Abstract
Distance-weighted discrimination is a classification discrimination method. Like the popular support vector machine, it is rooted in optimization; however, the underlying optimization problem is modified to give better generalizability, particularly in high dimensions. The two key ideas are that distance-weighted discrimination directly targets the data piling problem and also correctly handles unknown, unbalanced subclasses in the data. A useful property of distance-weighted discrimination, beyond just good classification performance, is that it provides a direction vector in high-dimensional data space with several purposes, including indication of driving phenomena behind class differences, data visualization, and batch adjustment tasks. WIREs Comput Stat 2015, 7:109-114. doi: 10.1002/wics.1345
- Published
- 2015
22. Distance weighted discrimination of face images for gender classification
- Author
-
Eduardo García-Portugués, James Stephen Marron, Daniel Peña, and Mónica Benito
- Subjects
Statistics and Probability ,FOS: Computer and information sciences ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Feature extraction ,Computer Science - Computer Vision and Pattern Recognition ,Context (language use) ,62H30, 62H35 ,01 natural sciences ,Statistics - Applications ,050105 experimental psychology ,Methodology (stat.ME) ,010104 statistics & probability ,Dimension (vector space) ,0501 psychology and cognitive sciences ,Applications (stat.AP) ,0101 mathematics ,Statistics - Methodology ,business.industry ,05 social sciences ,Pattern recognition ,Linear discriminant analysis ,Support vector machine ,Discriminant ,Sample size determination ,Principal component analysis ,Artificial intelligence ,Statistics, Probability and Uncertainty ,business - Abstract
We illustrate the advantages of distance weighted discrimination for classification and feature extraction in a High Dimension Low Sample Size (HDLSS) situation. The HDLSS context is a gender classification problem of face images in which the dimension of the data is several orders of magnitude larger than the sample size. We compare distance weighted discrimination with Fisher's linear discriminant, support vector machines, and principal component analysis by exploring their classification interpretation through insightful visuanimations and by examining the classifiers' discriminant errors. This analysis enables us to make new contributions to the understanding of the drivers of human discrimination between males and females., Comment: 9 pages, 4 figures, 1 table
- Published
- 2017
- Full Text
- View/download PDF
23. Object-Oriented Data Analysis of Cell Images
- Author
-
Perry D. Haaland, James Stephen Marron, and Xiaosun Lu
- Subjects
Statistics and Probability ,Set (abstract data type) ,Object-oriented programming ,Information retrieval ,Point (typography) ,Process (engineering) ,Container (abstract data type) ,Statistical analysis ,Statistics, Probability and Uncertainty ,Data objects ,Algorithm ,Mathematics ,Image (mathematics) - Abstract
This article discusses a study of cell images in cell culture biology from an object-oriented point of view. The motivation of this research is to develop a statistical approach to cell image analysis that better supports the automated development of stem cell growth media. A major hurdle in this process is the need for human expertise, based on studying cells under the microscope, to make decisions about the next step of the cell culture process. We aim to use digital imaging technology coupled with statistical analysis to tackle this important problem. The discussion in this article highlights a common critical issue: choice of data objects. Instead of conventionally treating either the individual cells or the wells (a container in which the cells are grown) as data objects, a new type of data object is proposed, that is the union of a well with its corresponding set of cells. The image data analysis suggests that the cell–well unions can be a better choice of data objects than the cells or the wells al...
- Published
- 2014
24. Asymptotics of hierarchical clustering for growing dimension
- Author
-
Petro Borysov, James Stephen Marron, and Jan Hannig
- Subjects
Statistics and Probability ,Clustering high-dimensional data ,Numerical Analysis ,Single-linkage clustering ,Correlation clustering ,computer.software_genre ,Hierarchical clustering ,Determining the number of clusters in a data set ,CURE data clustering algorithm ,Consensus clustering ,Data mining ,Statistics, Probability and Uncertainty ,Cluster analysis ,computer ,Mathematics - Abstract
Modern day science presents many challenges to data analysts. Advances in data collection provide very large (number of observations and number of dimensions) data sets. In many areas of data analysis an informative task is to find natural separations of data into homogeneous groups, i.e. clusters. In this paper we study the asymptotic behavior of hierarchical clustering in situations where both sample size and dimension grow to infinity. We derive explicit signal vs noise boundaries between different types of clustering behaviors. We also show that the clustering behavior within the boundaries is the same across a wide spectrum of asymptotic settings.
- Published
- 2014
25. Tree-Oriented Analysis of Brain Artery Structure
- Author
-
Elizabeth Bullitt, Sean Skwerer, Vic Patrangenaru, Stephan Huckemann, Ipek Oguz, Megan Owen, James Stephen Marron, Ezra Miller, and Scott Provan
- Subjects
Statistics and Probability ,Phylogenetic tree ,medicine.diagnostic_test ,business.industry ,Applied Mathematics ,Structure (category theory) ,Pattern recognition ,Condensed Matter Physics ,Magnetic resonance angiography ,Data set ,Combinatorics ,Data point ,Modeling and Simulation ,medicine ,Pairwise comparison ,Geometry and Topology ,Computer Vision and Pattern Recognition ,Multidimensional scaling ,Tree (set theory) ,Artificial intelligence ,business ,Mathematics - Abstract
Statistical analysis of magnetic resonance angiography (MRA) brain artery trees is performed using two methods for mapping brain artery trees to points in phylogenetic treespace: cortical landmark correspondence and descendant correspondence. The differences in end-results based on these mappings are highlighted to emphasize the importance of correspondence in tree-oriented data analysis. Representation of brain artery systems as points in phylogenetic treespace, a mathematical space developed in (Billera et al. Adv. Appl. Math 27:733---767, 2001), facilitates this analysis. The phylogenetic treespace is a rich setting for tree-oriented data analysis. The Frechet sample mean or an approximation is reported. Multidimensional scaling is used to explore structure in the data set based on pairwise distances between data points. This analysis of MRA data shows a statistically significant effect of age and sex on brain artery structure. Variation in the proximity of brain arteries to the cortical surface results in strong statistical difference between sexes and statistically significant age effect. That particular observation is possible with cortical correspondence but did not show up in the descendant correspondence.
- Published
- 2014
26. Discussion: A Spatial Modeling Approach for Linguistic Object Data: Analysing Dialect Sound Variations Across Great Britain, by Shahin Tavakoli et al
- Author
-
James Stephen Marron
- Subjects
Statistics and Probability ,geography ,geography.geographical_feature_category ,Computer science ,Statistics, Probability and Uncertainty ,Object (philosophy) ,Sound (geography) ,Linguistics - Published
- 2019
27. dr4pl: A Stable Convergence Algorithm for the 4 Parameter Logistic Model
- Author
-
Hyowon An, James Stephen Marron, Aubrey Bailey, Justin T. Landis, and Dirk P. Dittmer
- Subjects
Statistics and Probability ,Numerical Analysis ,Convergence (routing) ,Applied mathematics ,Statistics, Probability and Uncertainty ,Logistic regression ,Mathematics - Published
- 2019
28. Multiresolution anomaly detection method for fractional Gaussian noise
- Author
-
Lingsong Zhang, James Stephen Marron, and Zhengyuan Zhu
- Subjects
Statistics and Probability ,Series (mathematics) ,Physics::Instrumentation and Detectors ,Detector ,computer.software_genre ,Power (physics) ,symbols.namesake ,Gaussian noise ,Outlier ,symbols ,Physics::Accelerator Physics ,Anomaly detection ,Network intrusion detection ,Data mining ,Statistics, Probability and Uncertainty ,Algorithm ,computer ,Mathematics - Abstract
Driven by network intrusion detection, we propose a MultiResolution Anomaly Detection (MRAD) method, which effectively utilizes the multiscale properties of Internet features and network anomalies. In this paper, several theoretical properties of the MRAD method are explored. A major new result is the mathematical formulation of the notion that a two-scaled MRAD method has larger power than the average power of the detection method based on the given two scales. Test threshold is also developed. Comparisons between MRAD method and other classical outlier detectors in time series are reported as well.
- Published
- 2013
29. Backwards Principal Component Analysis and Principal Nested Relations
- Author
-
James Stephen Marron and James Damon
- Subjects
Statistics and Probability ,Sequence ,Pure mathematics ,Geodesic ,Applied Mathematics ,Principal (computer security) ,Mathematical analysis ,Analogy ,Condensed Matter Physics ,Linear subspace ,Modeling and Simulation ,Principal component analysis ,Mathematics::Differential Geometry ,Geometry and Topology ,Computer Vision and Pattern Recognition ,Principal geodesic analysis ,Mathematics - Abstract
In non-Euclidean data spaces represented by manifolds (or more generally stratified spaces), analogs of principal component analysis can be more easily developed using a backwards approach. There has been a gradual evolution in the application of this idea from using increasing geodesic subspaces of submanifolds in analogy with PCA to using a "backward sequence" of a decreasing family of subspaces. We provide a version of the backwards approach by using a "nested sequence of relations" which define the decreasing sequences of subspaces which need not be geodesic. Because these are naturally inductively added in a backward sequence, they are frequently more tractable and overcome difficulties with using geodesics.
- Published
- 2013
30. Multiclass Distance-Weighted Discrimination
- Author
-
James Stephen Marron, Ying Du, Yufeng Liu, Charles M. Perou, Michael J. Todd, Hanwen Huang, and D. Neil Hayes
- Subjects
Statistics and Probability ,Computer science ,business.industry ,Fisher consistency ,Binary number ,Pattern recognition ,Machine learning ,computer.software_genre ,Multiclass classification ,Support vector machine ,Discrete Mathematics and Combinatorics ,Artificial intelligence ,Statistics, Probability and Uncertainty ,MATLAB ,business ,Classifier (UML) ,computer ,computer.programming_language - Abstract
In this article, we extend the binary distance-weighted discrimination (DWD) to the multiclass case. In addition to the usual extensions that combine several binary DWD classifiers, we propose a global multiclass DWD (MDWD) that finds a single classifier that considers all classes at once. Our theoretical results show that MDWD is Fisher consistent, even in the particularly challenging case when there is no dominating class, that is, a class with probability bigger than 0.5. The performance of different multiclass DWD methods is assessed through simulation studies and application to real microarray datasets. Comparison with the support vector machines is also provided. MATLAB implementation of the proposed methods is given in the online supplementary materials.
- Published
- 2013
31. Comparison of binary discrimination methods for high dimension low sample size data
- Author
-
James Stephen Marron and A. Bolivar-Cime
- Subjects
Statistics and Probability ,Clustering high-dimensional data ,Numerical Analysis ,Asymptotic analysis ,Gaussian ,Centroid ,Context (language use) ,symbols.namesake ,Naive Bayes classifier ,Dimension (vector space) ,Sample size determination ,Statistics ,symbols ,Applied mathematics ,Statistics, Probability and Uncertainty ,Mathematics - Abstract
A comparison of some binary discrimination methods is done in the high dimension low sample size context for Gaussian data with common diagonal covariance matrix. In particular we obtain results about the asymptotic behavior of the methods Support Vector Machine, Mean Difference (i.e. Centroid Rule), Distance Weighted Discrimination, Maximal Data Piling and Naive Bayes when the dimension d of the data sets tends to infinity and the sample sizes of the classes are fixed. It is concluded that, under appropriate conditions, the first four methods are asymptotically equivalent, but the Naive Bayes method can have a different asymptotic behavior when d tends to infinity.
- Published
- 2013
32. The Statistics and Mathematics of High Dimension Low Sample Size Asymptotics
- Author
-
Dan Shen, James Stephen Marron, Haipeng Shen, and Hongtu Zhu
- Subjects
Statistics and Probability ,education.field_of_study ,05 social sciences ,Population ,Asymptotic distribution ,Sample (statistics) ,Covariance ,01 natural sciences ,Article ,010104 statistics & probability ,Dimension (vector space) ,Sample size determination ,0502 economics and business ,Statistics ,Principal component analysis ,High-dimensional statistics ,0101 mathematics ,Statistics, Probability and Uncertainty ,education ,050205 econometrics ,Mathematics - Abstract
The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal an asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector. In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution. Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented.
- Published
- 2016
33. 'Virus hunting' using radial distance weighted discrimination
- Author
-
Jie Xiong, Dirk P. Dittmer, and James Stephen Marron
- Subjects
Statistics and Probability ,nonlinear classification ,high-dimension low-sample size data analysis ,FOS: Computer and information sciences ,Computer science ,Boundary (topology) ,Statistics - Applications ,Quantitative Biology - Quantitative Methods ,Data set ,Support vector machine ,Nonlinear system ,Kernel (linear algebra) ,Binary classification ,Modeling and Simulation ,FOS: Biological sciences ,Classifier (linguistics) ,Generalizability theory ,Applications (stat.AP) ,DNA sequencing ,Statistics, Probability and Uncertainty ,Algorithm ,Virus hunting ,Quantitative Methods (q-bio.QM) - Abstract
Motivated by the challenge of using DNA-seq data to identify viruses in human blood samples, we propose a novel classification algorithm called "Radial Distance Weighted Discrimination" (or Radial DWD). This classifier is designed for binary classification, assuming one class is surrounded by the other class in very diverse radial directions, which is seen to be typical for our virus detection data. This separation of the 2 classes in multiple radial directions naturally motivates the development of Radial DWD. While classical machine learning methods such as the Support Vector Machine and linear Distance Weighted Discrimination can sometimes give reasonable answers for a given data set, their generalizability is severely compromised because of the linear separating boundary. Radial DWD addresses this challenge by using a more appropriate (in this particular case) spherical separating boundary. Simulations show that for appropriate radial contexts, this gives much better generalizability than linear methods, and also much better than conventional kernel based (nonlinear) Support Vector Machines, because the latter methods essentially use much of the information in the data for determining the shape of the separating boundary. The effectiveness of Radial DWD is demonstrated for real virus detection., Comment: Published at http://dx.doi.org/10.1214/15-AOAS869 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2016
- Full Text
- View/download PDF
34. Long-range dependence analysis of Internet traffic
- Author
-
James Stephen Marron, Michele Trovero, F. D. Smith, Zhengyuan Zhu, Long Le, Cheolwoo Park, Vladas Pipiras, Richard Smith, Félix Hernández-Campos, and Juhyun Park
- Subjects
Statistics and Probability ,Hurst exponent ,Series (mathematics) ,Computer science ,Context (language use) ,Internet traffic ,Dependence analysis ,computer.software_genre ,Range (mathematics) ,Statistical analysis ,Data mining ,Statistics, Probability and Uncertainty ,Scaling ,computer - Abstract
Long-range-dependent time series are endemic in the statistical analysis of Internet traffic. The Hurst parameter provides a good summary of important self-similar scaling properties. We compare a number of different Hurst parameter estimation methods and some important variations. This is done in the context of a wide range of simulated, laboratory-generated, and real data sets. Important differences between the methods are highlighted. Deep insights are revealed on how well the laboratory data mimic the real data. Non-stationarities, which are local in time, are seen to be central issues and lead to both conceptual and practical recommendations.
- Published
- 2010
35. Biclustering via Sparse Singular Value Decomposition
- Author
-
James Stephen Marron, Jianhua Z. Huang, Mihee Lee, and Haipeng Shen
- Subjects
Statistics and Probability ,Lung Neoplasms ,Databases, Factual ,Iterative method ,Regularization (mathematics) ,General Biochemistry, Genetics and Molecular Biology ,Biclustering ,Matrix (mathematics) ,Singular value decomposition ,Cluster Analysis ,Humans ,Nutritional Physiological Phenomena ,Mathematics ,General Immunology and Microbiology ,business.industry ,Applied Mathematics ,Dimensionality reduction ,Linear model ,Pattern recognition ,General Medicine ,Principal component analysis ,Linear Models ,Artificial intelligence ,General Agricultural and Biological Sciences ,business ,Algorithms - Abstract
Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row-column associations within high-dimensional data matrices. SSVD seeks a low-rank, checkerboard structured matrix approximation to data matrices. The desired checkerboard structure is achieved by forcing both the left- and right-singular vectors to be sparse, that is, having many zero entries. By interpreting singular vectors as regression coefficient vectors for certain linear regressions, sparsity-inducing regularization penalties are imposed to the least squares regression to produce sparse singular vectors. An efficient iterative algorithm is proposed for computing the sparse singular vectors, along with some discussion of penalty parameter selection. A lung cancer microarray dataset and a food nutrition dataset are used to illustrate SSVD as a biclustering method. SSVD is also compared with some existing biclustering methods using simulated datasets.
- Published
- 2010
36. The maximal data piling direction for discrimination
- Author
-
Jeongyoun Ahn and James Stephen Marron
- Subjects
Statistics and Probability ,Class (set theory) ,Applied Mathematics ,General Mathematics ,Linear discriminant analysis ,Direction vector ,Agricultural and Biological Sciences (miscellaneous) ,Support vector machine ,Projection (mathematics) ,Discriminant ,Sample size determination ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Algorithm ,Subspace topology ,Mathematics - Abstract
We study a discriminant direction vector that generally exists only in high-dimension, low sample size settings. Projections of data onto this direction vector take on only two distinct values, one for each class. There exist infinitely many such directions in the subspace generated by the data; but the maximal data piling vector has the longest distance between the projections. This paper investigates mathematical properties and classification performance of this discrimination method. Copyright 2010, Oxford University Press.
- Published
- 2010
37. Robust centroid based classification with minimum error rates for high dimension, low sample size data
- Author
-
James Stephen Marron, Jiancheng Jiang, and Xuejun Jiang
- Subjects
Statistics and Probability ,Applied Mathematics ,Centroid ,Word error rate ,Boundary (topology) ,Direction vector ,Linear discriminant analysis ,Statistical classification ,Dimension (vector space) ,Sample size determination ,Statistics ,Statistics, Probability and Uncertainty ,Algorithm ,Mathematics - Abstract
A new method of statistical classification (discrimination) is proposed. The method is most effective for high dimension, low sample size data. It uses a robust mean difference as the direction vector and locates the classification boundary by minimizing the error rates. Asymptotic results for assessment and comparison to several popular methods are obtained by using a type of asymptotics of finite sample size and infinite dimensions. The value of the proposed approach is demonstrated by simulations. Real data examples are used to illustrate the performance of different classification methods.
- Published
- 2009
38. Statistical Significance of Clustering for High-Dimension, Lowâ€'Sample Size Data
- Author
-
Andrew B. Nobel, Yufeng Liu, James Stephen Marron, and David N. Hayes
- Subjects
Statistics and Probability ,Clustering high-dimensional data ,Fuzzy clustering ,Correlation clustering ,computer.software_genre ,Determining the number of clusters in a data set ,CURE data clustering algorithm ,Statistics ,Consensus clustering ,Data mining ,Statistics, Probability and Uncertainty ,Cluster analysis ,computer ,k-medians clustering ,Mathematics - Abstract
Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low–sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are “really there,” as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method...
- Published
- 2008
39. Distance-Weighted Discrimination
- Author
-
James Stephen Marron, Michael J. Todd, and Jeongyoun Ahn
- Subjects
Statistics and Probability ,business.industry ,Survey sampling ,Linear discriminant analysis ,Machine learning ,computer.software_genre ,Support vector machine ,Range (mathematics) ,Sample size determination ,Margin (machine learning) ,Econometrics ,Generalizability theory ,Artificial intelligence ,Statistics, Probability and Uncertainty ,Marginal distribution ,business ,computer ,Mathematics - Abstract
High-dimension low–sample size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, the popular support vector machine suffers from "data piling" at the margin, which can diminish generalizability. This leads naturally to the development of distance-weighted discrimination, which is based on second-order cone programming, a modern computationally intensive optimization method.
- Published
- 2007
40. The high-dimension, low-sample-size geometric representation holds under mild conditions
- Author
-
Jeongyoun Ahn, Keith M. Muller, James Stephen Marron, and Yueh-Yun Chi
- Subjects
Statistics and Probability ,Pure mathematics ,Simplex ,Series (mathematics) ,Covariance matrix ,Applied Mathematics ,General Mathematics ,Linear discriminant analysis ,Agricultural and Biological Sciences (miscellaneous) ,Dimension (vector space) ,Sample size determination ,Principal component analysis ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Mathematics ,Curse of dimensionality - Abstract
High-dimension, low-small-sample size datasets have different geometrical properties from those of traditional low-dimensional data. In their asymptotic study regarding increasing dimensionality with a fixed sample size, Hall et al. (2005) showed that each data vector is approximately located on the vertices of a regular simplex in a high-dimensional space. A perhaps unappealing aspect of their result is the underlying assumption which requires the variables, viewed as a time series, to be almost independent. We establish an equivalent geometric representation under much milder conditions using asymptotic properties of sample covariance matrices. We discuss implications of the results, such as the use of principal component analysis in a high-dimensional space, extension to the case of nonindependent samples and also the binary classification problem. Copyright 2007, Oxford University Press.
- Published
- 2007
41. Functional Data Analysis of Amplitude and Phase Variation
- Author
-
James O. Ramsay, James Stephen Marron, Anuj Srivastava, and Laura M. Sangalli
- Subjects
FOS: Computer and information sciences ,Statistics and Probability ,Dynamic time warping ,Optimization problem ,Computer science ,Function space ,General Mathematics ,Functional data analysis ,warping ,alignment ,Data structure ,Methodology (stat.ME) ,Amplitude ,registration ,dynamic time warping ,Fisher–Rao metric ,Data structure alignment ,Principal component analysis ,elastic metric ,Statistics, Probability and Uncertainty ,Algorithm ,Statistics - Methodology - Abstract
The abundance of functional observations in scientific endeavors has led to a significant development in tools for functional data analysis (FDA). This kind of data comes with several challenges: infinite-dimensionality of function spaces, observation noise, and so on. However, there is another interesting phenomena that creates problems in FDA. The functional data often comes with lateral displacements/deformations in curves, a phenomenon which is different from the height or amplitude variability and is termed phase variation. The presence of phase variability artificially often inflates data variance, blurs underlying data structures, and distorts principal components. While the separation and/or removal of phase from amplitude data is desirable, this is a difficult problem. In particular, a commonly used alignment procedure, based on minimizing the $\mathbb{L}^2$ norm between functions, does not provide satisfactory results. In this paper we motivate the importance of dealing with the phase variability and summarize several current ideas for separating phase and amplitude components. These approaches differ in the following: (1) the definition and mathematical representation of phase variability, (2) the objective functions that are used in functional data alignment, and (3) the algorithmic tools for solving estimation/optimization problems. We use simple examples to illustrate various approaches and to provide useful contrast between them., Published at http://dx.doi.org/10.1214/15-STS524 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2015
42. High dimension low sample size asymptotics of robust PCA
- Author
-
Yi-Hui Zhou and James Stephen Marron
- Subjects
Statistics and Probability ,education.field_of_study ,spike model ,business.industry ,Population ,Multivariate normal distribution ,Pattern recognition ,robustness ,Covariance ,Dimension (vector space) ,Sample size determination ,Outlier ,Principal component analysis ,Statistics ,parasitic diseases ,spherical PCA ,Artificial intelligence ,Statistics, Probability and Uncertainty ,education ,business ,Eigenvalues and eigenvectors ,Mathematics - Abstract
Conventional principal component analysis is highly susceptible to outliers. In particular, a sufficiently outlying single data point, can draw the leading principal component toward itself. In this paper, we study the effects of outliers for high dimension and low sample size data, using asymptotics. The non-robust nature of conventional principal component analysis is verified through inconsistency under multivariate Gaussian assumptions with a single spike in the covariance structure, in the presence of a contaminating outlier. In the same setting, the robust method of spherical principal components is consistent with the population eigenvector for the spike model, even in the presence of contamination.
- Published
- 2015
43. LASS: a tool for the local analysis of self-similarity
- Author
-
Stilian Stoev, Cheolwoo Park, George Michailidis, James Stephen Marron, and Murad S. Taqqu
- Subjects
Statistics and Probability ,Hurst exponent ,Stationary process ,Series (mathematics) ,Self-similarity ,Applied Mathematics ,Internet traffic ,Data structure ,Computational Mathematics ,Wavelet ,Computational Theory and Mathematics ,Local analysis ,Calculus ,Algorithm ,Mathematics - Abstract
The Hurst parameter H characterizes the degree of long-range dependence (and asymptotic self-similarity) in stationary time series. Many methods have been developed for the estimation of H from data. In practice, however, the classical estimation techniques can be severely affected by non-stationary artifacts in the time series. In fact, the assumption that the data can be modeled by a stationary process with a single Hurst exponent H may be unrealistic. This work focuses on practical issues associated with the detection of long-range dependence in Internet traffic data and proposes two tools that can be used to address some of these issues. The first is an animation tool which is used to visualize the local dependence structure. The second is a statistical tool for the local analysis of self-similarity (LASS). The LASS tool is designed to handle time series that have long-range dependence and are long enough that some parts are essentially stationary, while others exhibit non-stationarity, which is either deterministic or stochastic in nature. The tool exploits wavelets to analyze the local dependence structure in the data over a set of windows. It can be used to visualize local deviations from self-similar, long-range dependence scaling and to provide reliable local estimates of the Hurst exponents. The tool, which is illustrated by using a trace of Internet traffic measurements, can also be applied to economic time series. In addition, a median-based wavelet spectrum is introduced. It yields robust local or global estimates of the Hurst parameter that are less susceptible to local non-stationarity. The software tools are freely available and their use is described in an appendix.
- Published
- 2006
44. SiZer for jump detection
- Author
-
James Stephen Marron and C. S. Kim
- Subjects
Statistics and Probability ,Quantitative Biology::Biomolecules ,business.product_category ,Basis (linear algebra) ,Statistics ,Jump ,Step detection ,Statistical physics ,Funnel ,Statistics, Probability and Uncertainty ,business ,Mathematics ,Nonparametric regression - Abstract
SiZer is an exploratory graphical method for finding structure in data. When the structure is a jump in the underlying curve, a ‘jump funnel’ is created in the SiZer map. The shape of this funnel is analyzed. The result is the basis of a proposed variation of SiZer that is specially tuned to finding jumps.
- Published
- 2006
45. SiZer for smoothing splines
- Author
-
Jin-Ting Zhang and James Stephen Marron
- Subjects
Statistics and Probability ,Mathematical optimization ,Functional data analysis ,Scatterplot smoothing ,Data structure ,Nonparametric regression ,Computational Mathematics ,Smoothing spline ,Statistical inference ,Statistics, Probability and Uncertainty ,Additive smoothing ,Algorithm ,Smoothing ,ComputingMethodologies_COMPUTERGRAPHICS ,Mathematics - Abstract
Smoothing splines are an attractive method for scatterplot smoothing. The SiZer approach to statistical inference is adapted to this smoothing method, named SiZerSS. This allows quick and sure inference as to “which features in the smooth are really there” as opposed to “which are due to sampling artifacts”, when using smoothing splines for data analysis. Applications of SiZerSS to mode, linearity, quadraticity and monotonicity tests are illustrated using a real data example. Some small scale simulations are presented to demonstrate that the SiZerSS and the SiZerLL (the original local linear version of SiZer) often give similar performance in exploring data structure but they can not replace each other completely.
- Published
- 2005
46. Geometric Representation of High Dimension, Low Sample Size Data
- Author
-
Peter Hall, James Stephen Marron, and Amnon Neeman
- Subjects
Statistics and Probability ,Discrete mathematics ,Simplex ,Dimension (vector space) ,Sample size determination ,Statistics ,Structure (category theory) ,High-dimensional statistics ,Statistics, Probability and Uncertainty ,Type (model theory) ,Data structure ,Randomness ,Mathematics - Abstract
SummaryHigh dimension, low sample size data are emerging in various areas of science. We find a common structure underlying many such data sets by using a non-standard type of asymptotics: the dimension tends to ∞ while the sample size is fixed. Our analysis shows a tendency for the data to lie deterministically at the vertices of a regular simplex. Essentially all the randomness in the data appears only as a random rotation of this simplex. This geometric representation is used to obtain several new statistical insights.
- Published
- 2005
47. EXTREMAL DEPENDENCE: INTERNET TRAFFIC APPLICATIONS
- Author
-
Félix Hernández-Campos, Kevin Jeffay, Cheolwoo Park, Sidney I. Resnick, and James Stephen Marron
- Subjects
Statistics and Probability ,business.industry ,Applied Mathematics ,Internet traffic ,Measure (mathematics) ,File size ,Probability theory ,Heavy-tailed distribution ,Modeling and Simulation ,Statistics ,The Internet ,Statistical physics ,business ,Value (mathematics) ,Independence (probability theory) ,Mathematics - Abstract
For bivariate heavy tailed data, the extremes may carry distinctive dependence information not seen from moderate values. For example, a large value in one component may help cause a large value in the other. This is the idea behind the notion of extremal dependence. We discuss ways to detect and measure extremal dependence. We apply the techniques discussed to internet data and conclude that for files transferred, file size and throughput (the inferred rate at which the file is transferred) exhibit extremal independence.
- Published
- 2005
48. Dependent SiZer: Goodness-of-Fit Tests for Time Series Models
- Author
-
Cheolwoo Park, James Stephen Marron, and Vitaliana Rondonotti
- Subjects
Statistics and Probability ,Series (mathematics) ,Null model ,Autocovariance ,symbols.namesake ,Goodness of fit ,Gaussian noise ,Burstiness ,Statistics ,Statistical inference ,symbols ,Range (statistics) ,Applied mathematics ,Statistics, Probability and Uncertainty ,Mathematics - Abstract
In this paper, we extend SiZer (SIgnificant ZERo crossing of the derivatives) to dependent data for the purpose of goodness-of-fit tests for time series models. Dependent SiZer compares the observed data with a specific null model being tested by adjusting the statistical inference using an assumed autocovariance function. This new approach uses a SiZer type visualization to flag statistically significant differences between the data and a given null model. The power of this approach is demonstrated through some examples of time series of Internet traffic data. It is seen that such time series can have even more burstiness than is predicted by the popular, long- range dependent, Fractional Gaussian Noise model.
- Published
- 2004
49. SiZer for length biased, censored density and hazard estimation
- Author
-
Jacobo de Uña-Álvarez and James Stephen Marron
- Subjects
Statistics and Probability ,Applied Mathematics ,Statistics ,Hazard ratio ,Econometrics ,Range (statistics) ,Statistical inference ,Unbiased Estimation ,Density estimation ,Statistics, Probability and Uncertainty ,Hazard estimation ,Mathematics - Abstract
Length-biased data require appropriate modification of smooth estimates of the density and hazard rate. The SiZer map is a visual device for summarizing a range of useful statistical inferences about the shape of the density and the hazard rate curves. In this paper, SiZer is adapted to account correctly for the length biased adjustment. Both the censored and uncensored cases are treated.
- Published
- 2004
50. Semi-parametric multivariate modelling when the marginals are the same
- Author
-
James Stephen Marron, Victor Pérez-Abreu, and Miguel Nakamura
- Subjects
Statistics and Probability ,Numerical Analysis ,Multivariate statistics ,Environmental data ,Nonparametric statistics ,Measures of correlation ,Multivariate normal distribution ,Semiparametric model ,Multivariate characteristic function ,Econometrics ,Statistics::Methodology ,Applied mathematics ,Statistics, Probability and Uncertainty ,Marginal distribution ,Infinitely divisible random vectors ,Mathematics ,Curse of dimensionality ,Interpolation - Abstract
A model is developed for multivariate distributions which have nearly the same marginals, up to shift and scale. This model, based on “interpolation” of characteristic functions, gives a new notion of “correlation”. It allows straightforward nonparametric estimation of the common marginal distribution, which avoids the “curse of dimensionality” present when nonparametically estimating the full multivariate distribution. The method is illustrated with environmental monitoring network data, where multivariate modelling with common marginals is often appropriate.
- Published
- 2003
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.