497 results on '"Witten, Daniela"'
Search Results
2. Discussion of 'Data fission: splitting a single data point'
- Author
-
Neufeld, Anna, Dharamshi, Ameer, Gao, Lucy L., Witten, Daniela, and Bien, Jacob
- Subjects
Statistics - Methodology - Abstract
Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations., Comment: 18 pages, 1 figure
- Published
- 2024
3. Infer-and-widen versus split-and-condition: two tales of selective inference
- Author
-
Perry, Ronan, Xu, Zichun, McGough, Olivia, and Witten, Daniela
- Subjects
Statistics - Methodology - Abstract
Recent attention has focused on the development of methods for post-selection inference. However, the connections between these methods, and the extent to which one might be preferred to another, remain unclear. In this paper, we classify existing methods for post-selection inference into one of two frameworks: infer-and-widen or split-and-condition. The infer-and-widen framework produces confidence intervals whose midpoints are biased due to selection, and must be wide enough to account for this bias. By contrast, split-and-condition directly adjusts the intervals' midpoints to account for selection. We compare the two frameworks in three vignettes: the winner's curse, maximal contrasts, and inference after the lasso. Our results are striking: in each of these examples, a split-and-condition strategy leads to confidence intervals that are much narrower than the state-of-the-art infer-and-widen proposal, when methods are tuned to yield identical selection events. Furthermore, even an ``oracle" infer-and-widen confidence interval -- the narrowest possible interval that could be theoretically attained via infer-and-widen -- is not necessarily narrower than a feasible split-and-condition method. Taken together, these results point to split-and-condition as the most promising framework for post-selection inference in real-world settings.
- Published
- 2024
4. Inference on the proportion of variance explained in principal component analysis
- Author
-
Perry, Ronan, Panigrahi, Snigdha, Bien, Jacob, and Witten, Daniela
- Subjects
Statistics - Methodology - Abstract
Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.
- Published
- 2024
5. Negative binomial count splitting for single-cell RNA sequencing data
- Author
-
Neufeld, Anna, Popp, Joshua, Gao, Lucy L., Battle, Alexis, and Witten, Daniela
- Subjects
Statistics - Methodology ,Statistics - Applications - Abstract
The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cells twice. Poisson count splitting was recently proposed as a way to work backwards from a single observed Poisson data matrix to obtain independent Poisson training and test matrices that could have arisen from two independent sequencing experiments conducted on the same set of cells. However, the Poisson count splitting approach requires that the original data are exactly Poisson distributed: in the presence of any overdispersion, the resulting training and test datasets are not independent. In this paper, we introduce negative binomial count splitting, which extends Poisson count splitting to the more flexible negative binomial setting. Given an $n \times p$ dataset from a negative binomial distribution, we use Dirichlet-multinomial sampling to create two or more independent $n \times p$ negative binomial datasets. We show that this procedure outperforms Poisson count splitting in simulation, and apply it to validate clusters of kidney cells from a human fetal cell atlas.
- Published
- 2023
6. Revisiting inference after prediction
- Author
-
Motwani, Keshav and Witten, Daniela
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning - Abstract
Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.
- Published
- 2023
7. Generalized Data Thinning Using Sufficient Statistics
- Author
-
Dharamshi, Ameer, Neufeld, Anna, Motwani, Keshav, Gao, Lucy L., Witten, Daniela, and Bien, Jacob
- Subjects
Statistics - Methodology ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.
- Published
- 2023
8. Data thinning for convolution-closed distributions
- Author
-
Neufeld, Anna, Dharamshi, Ameer, Gao, Lucy L., and Witten, Daniela
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
- Published
- 2023
9. Inferring independent sets of Gaussian variables after thresholding correlations
- Author
-
Saha, Arkajyoti, Witten, Daniela, and Bien, Jacob
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inference, failure to account for the selection step leads, in this setting, to excessively conservative (as opposed to anti-conservative) results. Our proposed test properly accounts for the fact that the set of variables is selected from the data, and thus is not overly conservative. To develop our test, we condition on the event that the selection resulted in the set of variables in question. To achieve computational tractability, we develop a new characterization of the conditioning event in terms of the canonical correlation between the groups of random variables. In simulation studies and in the analysis of gene co-expression networks, we show that our approach has much higher power than a ``naive'' approach that ignores the effect of selection., Comment: 33 pages, 5 figures, 6 figure files (due to subfigures)
- Published
- 2022
10. Inference after latent variable estimation for single-cell RNA sequencing data
- Author
-
Neufeld, Anna, Gao, Lucy L., Popp, Joshua, Battle, Alexis, and Witten, Daniela
- Subjects
Statistics - Methodology ,Statistics - Applications - Abstract
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this paper, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells differentiating to cardiomyocytes., Comment: 43 pages, 7 figures
- Published
- 2022
11. Controlling costs: Feature selection on a budget.
- Author
-
Yu, Guo, Witten, Daniela, and Bien, Jacob
- Subjects
feature cost ,feature selection ,multiple knockoffs ,weighted false discovery proportion - Abstract
The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
- Published
- 2022
12. Selective inference for k-means clustering
- Author
-
Chen, Yiqun T. and Witten, Daniela M.
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
- Published
- 2022
13. Unsupervised Learning
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
14. Linear Model Selection and Regularization
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
15. Introduction
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
16. Resampling Methods
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
17. Support Vector Machines
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
18. Tree-Based Methods
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
19. Moving Beyond Linearity
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
20. Deep Learning
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
21. Multiple Testing
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
22. Classification
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
23. Statistical Learning
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
24. Survival Analysis and Censored Data
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
25. Linear Regression
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
- Published
- 2023
- Full Text
- View/download PDF
26. More powerful selective inference for the graph fused lasso
- Author
-
Chen, Yiqun T., Jewell, Sean W., and Witten, Daniela M.
- Subjects
Statistics - Methodology - Abstract
The graph fused lasso -- which includes as a special case the one-dimensional fused lasso -- is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fused lasso. A naive procedure such as a z-test for a difference in means will not control the selective Type I error, since the hypothesis that we are testing is itself a function of the data. In this work, we propose a new test for this task that controls the selective Type I error, and conditions on less information than existing approaches, leading to substantially higher power. We illustrate our approach in simulation and on datasets of drug overdose death rates and teenage birth rates in the contiguous United States. Our approach yields more discoveries on both datasets.
- Published
- 2021
27. Tree-Values: selective inference for regression trees
- Author
-
Neufeld, Anna C., Gao, Lucy L., and Witten, Daniela M.
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
- Published
- 2021
28. Quantifying uncertainty in spikes estimated from calcium imaging data
- Author
-
Chen, Yiqun T., Jewell, Sean W., and Witten, Daniela M.
- Subjects
Statistics - Methodology ,Statistics - Applications - Abstract
In recent years, a number of methods have been proposed to estimate the times at which a neuron spikes on the basis of calcium imaging data. However, quantifying the uncertainty associated with these estimated spikes remains an open problem. We consider a simple and well-studied model for calcium imaging data, which states that calcium decays exponentially in the absence of a spike, and instantaneously increases when a spike occurs. We wish to test the null hypothesis that the neuron did not spike -- i.e., that there was no increase in calcium -- at a particular timepoint at which a spike was estimated. In this setting, classical hypothesis tests lead to inflated Type I error, because the spike was estimated on the same data used for testing. To overcome this problem, we propose a selective inference approach. We describe an efficient algorithm to compute finite-sample p-values that control selective Type I error, and confidence intervals with correct selective coverage, for spikes estimated using a recent proposal from the literature. We apply our proposal in simulation and on calcium imaging data from the spikefinder challenge., Comment: 52 pages, 12 Figures
- Published
- 2021
29. Selective Inference for Hierarchical Clustering
- Author
-
Gao, Lucy L., Bien, Jacob, and Witten, Daniela
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data., Comment: Final accepted version
- Published
- 2020
30. Exponential inequalities for dependent V-statistics via random Fourier features
- Author
-
Shen, Yandi, Han, Fang, and Witten, Daniela
- Subjects
Mathematics - Statistics Theory ,Mathematics - Probability - Abstract
We establish exponential inequalities for a class of V-statistics under strong mixing conditions. Our theory is developed via a novel kernel expansion based on random Fourier features and the use of a probabilistic method. This type of expansion is new and useful for handling many notorious classes of kernels., Comment: This is the first part of the arxiv preprint (arXiv:1902.02761), and is to appear in Electronic Journal of Probability (EJP). The second part of the arxiv preprint will be submitted to a statistical journal
- Published
- 2020
31. Testing for a Change in Mean After Changepoint Detection
- Author
-
Jewell, Sean, Fearnhead, Paul, and Witten, Daniela
- Subjects
Statistics - Methodology - Abstract
While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, $\ell_{0}$ segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/., Comment: Main text: 28 pages, 5 figures. Supplementary Materials: 15 pages, 4 figures
- Published
- 2019
32. Controlling Costs: Feature Selection on a Budget
- Author
-
Yu, Guo, Witten, Daniela, and Bien, Jacob
- Subjects
Statistics - Methodology - Abstract
The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
- Published
- 2019
- Full Text
- View/download PDF
33. Machine learning techniques for mitoses classification.
- Author
-
Nofallah, Shima, Mehta, Sachin, Mercan, Ezgi, Knezevich, Stevan, May, Caitlin, Weaver, Donald, Witten, Daniela, Elmore, Joann, and Shapiro, Linda
- Subjects
Convolutional neural networks ,Machine learning ,Melanoma ,Mitoses ,Pathology ,Breast Neoplasms ,Female ,Humans ,Machine Learning ,Mitosis ,Neural Networks ,Computer - Abstract
BACKGROUND: Pathologists analyze biopsy material at both the cellular and structural level to determine diagnosis and cancer stage. Mitotic figures are surrogate biomarkers of cellular proliferation that can provide prognostic information; thus, their precise detection is an important factor for clinical care. Convolutional Neural Networks (CNNs) have shown remarkable performance on several recognition tasks. Utilizing CNNs for mitosis classification may aid pathologists to improve the detection accuracy. METHODS: We studied two state-of-the-art CNN-based models, ESPNet and DenseNet, for mitosis classification on six whole slide images of skin biopsies and compared their quantitative performance in terms of sensitivity, specificity, and F-score. We used raw RGB images of mitosis and non-mitosis samples with their corresponding labels as training input. In order to compare with other work, we studied the performance of these classifiers and two other architectures, ResNet and ShuffleNet, on the publicly available MITOS breast biopsy dataset and compared the performance of all four in terms of precision, recall, and F-score (which are standard for this data set), architecture, training time and inference time. RESULTS: The ESPNet and DenseNet results on our primary melanoma dataset had a sensitivity of 0.976 and 0.968, and a specificity of 0.987 and 0.995, respectively, with F-scores of .968 and .976, respectively. On the MITOS dataset, ESPNet and DenseNet showed a sensitivity of 0.866 and 0.916, and a specificity of 0.973 and 0.980, respectively. The MITOS results using DenseNet had a precision of 0.939, recall of 0.916, and F-score of 0.927. The best published result on MITOS (Saha et al. 2018) reported precision of 0.92, recall of 0.88, and F-score of 0.90. In our architecture comparisons on MITOS, we found that DenseNet beats the others in terms of F-Score (DenseNet 0.927, ESPNet 0.890, ResNet 0.865, ShuffleNet 0.847) and especially Recall (DenseNet 0.916, ESPNet 0.866, ResNet 0.807, ShuffleNet 0.753), while ResNet and ESPNet have much faster inference times (ResNet 6 s, ESPNet 8 s, DenseNet 31 s). ResNet is faster than ESPNet, but ESPNet has a higher F-Score and Recall than ResNet, making it a good compromise solution. CONCLUSION: We studied several state-of-the-art CNNs for detecting mitotic figures in whole slide biopsy images. We evaluated two CNNs on a melanoma cancer dataset and then compared four CNNs on a public breast cancer data set, using the same methodology on both. Our methodology and architecture for mitosis finding in both melanoma and breast cancer whole slide images has been thoroughly tested and is likely to be useful for finding mitoses in any whole slide biopsy images.
- Published
- 2021
34. Testing for Association in Multi-View Network Data
- Author
-
Gao, Lucy L., Witten, Daniela, and Bien, Jacob
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database (Das and Hint, 2012). We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to co-complex association data. We also extend this proposal to the setting of a network with node covariates.
- Published
- 2019
35. Optimal estimation of variance in nonparametric regression with random design
- Author
-
Shen, Yandi, Gao, Chao, Witten, Daniela, and Han, Fang
- Subjects
Mathematics - Statistics Theory - Abstract
Consider the heteroscedastic nonparametric regression model with random design \begin{align*} Y_i = f(X_i) + V^{1/2}(X_i)\varepsilon_i, \quad i=1,2,\ldots,n, \end{align*} with $f(\cdot)$ and $V(\cdot)$ $\alpha$- and $\beta$-H\"older smooth, respectively. We show that the minimax rate of estimating $V(\cdot)$ under both local and global squared risks is of the order \begin{align*} n^{-\frac{8\alpha\beta}{4\alpha\beta + 2\alpha + \beta}} \vee n^{-\frac{2\beta}{2\beta+1}}, \end{align*} where $a\vee b := \max\{a,b\}$ for any two real numbers $a,b$. This result extends the fixed design rate $n^{-4\alpha} \vee n^{-2\beta/(2\beta+1)}$ derived in Wang et al. [2008] in a non-trivial manner, as indicated by the appearances of both $\alpha$ and $\beta$ in the first term. In the special case of constant variance, we show that the minimax rate is $n^{-8\alpha/(4\alpha+1)}\vee n^{-1}$ for variance estimation, which further implies the same rate for quadratic functional estimation and thus unifies the minimax rate under the nonparametric regression model with those under the density model and the white noise model. To achieve the minimax rate, we develop a U-statistic-based local polynomial estimator and a lower bound that is constructed over a specified distribution family of randomness designed for both $\varepsilon_i$ and $X_i$., Comment: to appear in the Annals of Statistics
- Published
- 2019
36. Modeling microbial abundances and dysbiosis with beta-binomial regression
- Author
-
Martin, Bryan D., Witten, Daniela, and Willis, Amy D.
- Subjects
Statistics - Methodology - Abstract
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.
- Published
- 2019
37. Tail behavior of dependent V-statistics and its applications
- Author
-
Shen, Yandi, Han, Fang, and Witten, Daniela
- Subjects
Mathematics - Statistics Theory ,Mathematics - Probability - Abstract
We establish exponential inequalities and Cramer-type moderate deviation theorems for a class of V-statistics under strong mixing conditions. Our theory is developed via kernel expansion based on random Fourier features. This type of expansion is new and useful for handling many notorious classes of kernels. While the developed theory has a number of applications, we apply it to lasso-type semiparametric regression estimation and high-dimensional multiple hypothesis testing.
- Published
- 2019
38. Are Clusterings of Multiple Data Views Independent?
- Author
-
Gao, Lucy L., Bien, Jacob, and Witten, Daniela
- Subjects
Statistics - Methodology ,Statistics - Machine Learning - Abstract
In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this paper, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data)., Comment: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)
- Published
- 2019
39. The Multiple Random Dot Product Graph Model
- Author
-
Nielsen, Agnes Martine and Witten, Daniela
- Subjects
Statistics - Methodology - Abstract
Data in the form of graphs, or networks, arise naturally in a number of contexts; examples include social networks and biological networks. We are often faced with the availability of multiple graphs on a single set of nodes. In this article, we propose the multiple random dot product graph model for this setting. Our proposed model leads naturally to an optimization problem, which we solve using an efficient alternating minimization approach. We further use this model as the basis for a new test for the hypothesis that the graphs come from a single distribution, versus the alternative that they are drawn from different distributions. We evaluate the performance of both the fitting algorithm and the hypothesis test in several simulation settings, and demonstrate empirical improvement over existing approaches. We apply these new approaches to a Wikipedia data set and a C. elegans data set.
- Published
- 2018
40. Robust Sparse Reduced Rank Regression in High Dimensions
- Author
-
Tan, Kean Ming, Sun, Qiang, and Witten, Daniela
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning - Abstract
We propose robust sparse reduced rank regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained non-convex optimization problem, which is then solved using the alternating direction method of multipliers algorithm. We establish non-asymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded $(1+\delta)$th moment with $\delta \in (0,1)$, the rate of convergence is a function of $\delta$, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. Furthermore, the transition between the two regimes is smooth. We illustrate the performance of the proposed method via extensive numerical studies and a data application., Comment: This is a replacement of a previous article titled "Distributionally Robust Reduced Rank Regression and Principal Component Analysis in High Dimensions"
- Published
- 2018
41. Tree-Based Methods
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
42. Support Vector Machines
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
43. Survival Analysis and Censored Data
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
44. Linear Model Selection and Regularization
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
45. Multiple Testing
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
46. Moving Beyond Linearity
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
47. Classification
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
48. Resampling Methods
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
49. Statistical Learning
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
50. Linear Regression
- Author
-
James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.