Author: "Witten, Daniela" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Witten, Daniela"' showing total 497 results

Start Over Author "Witten, Daniela"

497 results on '"Witten, Daniela"'

1. Discussion of Breiman's "Two Cultures": From Two Cultures to One

Author: Neufeld, Anna and Witten, Daniela
Published: 2021
Full Text: View/download PDF

2. Discussion of 'Data fission: splitting a single data point'

Author: Neufeld, Anna, Dharamshi, Ameer, Gao, Lucy L., Witten, Daniela, and Bien, Jacob
Subjects: Statistics - Methodology
Abstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations., Comment: 18 pages, 1 figure
Published: 2024

3. Infer-and-widen versus split-and-condition: two tales of selective inference

Author: Perry, Ronan, Xu, Zichun, McGough, Olivia, and Witten, Daniela
Subjects: Statistics - Methodology
Abstract: Recent attention has focused on the development of methods for post-selection inference. However, the connections between these methods, and the extent to which one might be preferred to another, remain unclear. In this paper, we classify existing methods for post-selection inference into one of two frameworks: infer-and-widen or split-and-condition. The infer-and-widen framework produces confidence intervals whose midpoints are biased due to selection, and must be wide enough to account for this bias. By contrast, split-and-condition directly adjusts the intervals' midpoints to account for selection. We compare the two frameworks in three vignettes: the winner's curse, maximal contrasts, and inference after the lasso. Our results are striking: in each of these examples, a split-and-condition strategy leads to confidence intervals that are much narrower than the state-of-the-art infer-and-widen proposal, when methods are tuned to yield identical selection events. Furthermore, even an ``oracle" infer-and-widen confidence interval -- the narrowest possible interval that could be theoretically attained via infer-and-widen -- is not necessarily narrower than a feasible split-and-condition method. Taken together, these results point to split-and-condition as the most promising framework for post-selection inference in real-world settings.
Published: 2024

4. Inference on the proportion of variance explained in principal component analysis

Author: Perry, Ronan, Panigrahi, Snigdha, Bien, Jacob, and Witten, Daniela
Subjects: Statistics - Methodology
Abstract: Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.
Published: 2024

5. Negative binomial count splitting for single-cell RNA sequencing data

Author: Neufeld, Anna, Popp, Joshua, Gao, Lucy L., Battle, Alexis, and Witten, Daniela
Subjects: Statistics - Methodology, Statistics - Applications
Abstract: The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cells twice. Poisson count splitting was recently proposed as a way to work backwards from a single observed Poisson data matrix to obtain independent Poisson training and test matrices that could have arisen from two independent sequencing experiments conducted on the same set of cells. However, the Poisson count splitting approach requires that the original data are exactly Poisson distributed: in the presence of any overdispersion, the resulting training and test datasets are not independent. In this paper, we introduce negative binomial count splitting, which extends Poisson count splitting to the more flexible negative binomial setting. Given an $n \times p$ dataset from a negative binomial distribution, we use Dirichlet-multinomial sampling to create two or more independent $n \times p$ negative binomial datasets. We show that this procedure outperforms Poisson count splitting in simulation, and apply it to validate clusters of kidney cells from a human fetal cell atlas.
Published: 2023

6. Revisiting inference after prediction

Author: Motwani, Keshav and Witten, Daniela
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.
Published: 2023

7. Generalized Data Thinning Using Sufficient Statistics

Author: Dharamshi, Ameer, Neufeld, Anna, Motwani, Keshav, Gao, Lucy L., Witten, Daniela, and Bien, Jacob
Subjects: Statistics - Methodology, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.
Published: 2023

8. Data thinning for convolution-closed distributions

Author: Neufeld, Anna, Dharamshi, Ameer, Gao, Lucy L., and Witten, Daniela
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
Published: 2023

9. Inferring independent sets of Gaussian variables after thresholding correlations

Author: Saha, Arkajyoti, Witten, Daniela, and Bien, Jacob
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inference, failure to account for the selection step leads, in this setting, to excessively conservative (as opposed to anti-conservative) results. Our proposed test properly accounts for the fact that the set of variables is selected from the data, and thus is not overly conservative. To develop our test, we condition on the event that the selection resulted in the set of variables in question. To achieve computational tractability, we develop a new characterization of the conditioning event in terms of the canonical correlation between the groups of random variables. In simulation studies and in the analysis of gene co-expression networks, we show that our approach has much higher power than a ``naive'' approach that ignores the effect of selection., Comment: 33 pages, 5 figures, 6 figure files (due to subfigures)
Published: 2022

10. Inference after latent variable estimation for single-cell RNA sequencing data

Author: Neufeld, Anna, Gao, Lucy L., Popp, Joshua, Battle, Alexis, and Witten, Daniela
Subjects: Statistics - Methodology, Statistics - Applications
Abstract: In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this paper, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells differentiating to cardiomyocytes., Comment: 43 pages, 7 figures
Published: 2022

11. Controlling costs: Feature selection on a budget.

Author: Yu, Guo, Witten, Daniela, and Bien, Jacob
Subjects: feature cost, feature selection, multiple knockoffs, weighted false discovery proportion
Abstract: The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
Published: 2022

12. Selective inference for k-means clustering

Author: Chen, Yiqun T. and Witten, Daniela M.
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
Published: 2022

13. Unsupervised Learning

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

14. Linear Model Selection and Regularization

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

15. Introduction

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

16. Resampling Methods

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

17. Support Vector Machines

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

18. Tree-Based Methods

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

19. Moving Beyond Linearity

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

20. Deep Learning

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

21. Multiple Testing

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

22. Classification

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

23. Statistical Learning

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

24. Survival Analysis and Censored Data

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

25. Linear Regression

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Taylor, Jonathan, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, and Taylor, Jonathan
Published: 2023
Full Text: View/download PDF

26. More powerful selective inference for the graph fused lasso

Author: Chen, Yiqun T., Jewell, Sean W., and Witten, Daniela M.
Subjects: Statistics - Methodology
Abstract: The graph fused lasso -- which includes as a special case the one-dimensional fused lasso -- is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fused lasso. A naive procedure such as a z-test for a difference in means will not control the selective Type I error, since the hypothesis that we are testing is itself a function of the data. In this work, we propose a new test for this task that controls the selective Type I error, and conditions on less information than existing approaches, leading to substantially higher power. We illustrate our approach in simulation and on datasets of drug overdose death rates and teenage birth rates in the contiguous United States. Our approach yields more discoveries on both datasets.
Published: 2021

27. Tree-Values: selective inference for regression trees

Author: Neufeld, Anna C., Gao, Lucy L., and Witten, Daniela M.
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
Published: 2021

28. Quantifying uncertainty in spikes estimated from calcium imaging data

Author: Chen, Yiqun T., Jewell, Sean W., and Witten, Daniela M.
Subjects: Statistics - Methodology, Statistics - Applications
Abstract: In recent years, a number of methods have been proposed to estimate the times at which a neuron spikes on the basis of calcium imaging data. However, quantifying the uncertainty associated with these estimated spikes remains an open problem. We consider a simple and well-studied model for calcium imaging data, which states that calcium decays exponentially in the absence of a spike, and instantaneously increases when a spike occurs. We wish to test the null hypothesis that the neuron did not spike -- i.e., that there was no increase in calcium -- at a particular timepoint at which a spike was estimated. In this setting, classical hypothesis tests lead to inflated Type I error, because the spike was estimated on the same data used for testing. To overcome this problem, we propose a selective inference approach. We describe an efficient algorithm to compute finite-sample p-values that control selective Type I error, and confidence intervals with correct selective coverage, for spikes estimated using a recent proposal from the literature. We apply our proposal in simulation and on calcium imaging data from the spikefinder challenge., Comment: 52 pages, 12 Figures
Published: 2021

29. Selective Inference for Hierarchical Clustering

Author: Gao, Lucy L., Bien, Jacob, and Witten, Daniela
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data., Comment: Final accepted version
Published: 2020

30. Exponential inequalities for dependent V-statistics via random Fourier features

Author: Shen, Yandi, Han, Fang, and Witten, Daniela
Subjects: Mathematics - Statistics Theory, Mathematics - Probability
Abstract: We establish exponential inequalities for a class of V-statistics under strong mixing conditions. Our theory is developed via a novel kernel expansion based on random Fourier features and the use of a probabilistic method. This type of expansion is new and useful for handling many notorious classes of kernels., Comment: This is the first part of the arxiv preprint (arXiv:1902.02761), and is to appear in Electronic Journal of Probability (EJP). The second part of the arxiv preprint will be submitted to a statistical journal
Published: 2020

31. Testing for a Change in Mean After Changepoint Detection

Author: Jewell, Sean, Fearnhead, Paul, and Witten, Daniela
Subjects: Statistics - Methodology
Abstract: While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, $\ell_{0}$ segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/., Comment: Main text: 28 pages, 5 figures. Supplementary Materials: 15 pages, 4 figures
Published: 2019

32. Controlling Costs: Feature Selection on a Budget

Author: Yu, Guo, Witten, Daniela, and Bien, Jacob
Subjects: Statistics - Methodology
Abstract: The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
Published: 2019
Full Text: View/download PDF

33. Machine learning techniques for mitoses classification.

Author: Nofallah, Shima, Mehta, Sachin, Mercan, Ezgi, Knezevich, Stevan, May, Caitlin, Weaver, Donald, Witten, Daniela, Elmore, Joann, and Shapiro, Linda
Subjects: Convolutional neural networks, Machine learning, Melanoma, Mitoses, Pathology, Breast Neoplasms, Female, Humans, Machine Learning, Mitosis, Neural Networks, Computer
Abstract: BACKGROUND: Pathologists analyze biopsy material at both the cellular and structural level to determine diagnosis and cancer stage. Mitotic figures are surrogate biomarkers of cellular proliferation that can provide prognostic information; thus, their precise detection is an important factor for clinical care. Convolutional Neural Networks (CNNs) have shown remarkable performance on several recognition tasks. Utilizing CNNs for mitosis classification may aid pathologists to improve the detection accuracy. METHODS: We studied two state-of-the-art CNN-based models, ESPNet and DenseNet, for mitosis classification on six whole slide images of skin biopsies and compared their quantitative performance in terms of sensitivity, specificity, and F-score. We used raw RGB images of mitosis and non-mitosis samples with their corresponding labels as training input. In order to compare with other work, we studied the performance of these classifiers and two other architectures, ResNet and ShuffleNet, on the publicly available MITOS breast biopsy dataset and compared the performance of all four in terms of precision, recall, and F-score (which are standard for this data set), architecture, training time and inference time. RESULTS: The ESPNet and DenseNet results on our primary melanoma dataset had a sensitivity of 0.976 and 0.968, and a specificity of 0.987 and 0.995, respectively, with F-scores of .968 and .976, respectively. On the MITOS dataset, ESPNet and DenseNet showed a sensitivity of 0.866 and 0.916, and a specificity of 0.973 and 0.980, respectively. The MITOS results using DenseNet had a precision of 0.939, recall of 0.916, and F-score of 0.927. The best published result on MITOS (Saha et al. 2018) reported precision of 0.92, recall of 0.88, and F-score of 0.90. In our architecture comparisons on MITOS, we found that DenseNet beats the others in terms of F-Score (DenseNet 0.927, ESPNet 0.890, ResNet 0.865, ShuffleNet 0.847) and especially Recall (DenseNet 0.916, ESPNet 0.866, ResNet 0.807, ShuffleNet 0.753), while ResNet and ESPNet have much faster inference times (ResNet 6 s, ESPNet 8 s, DenseNet 31 s). ResNet is faster than ESPNet, but ESPNet has a higher F-Score and Recall than ResNet, making it a good compromise solution. CONCLUSION: We studied several state-of-the-art CNNs for detecting mitotic figures in whole slide biopsy images. We evaluated two CNNs on a melanoma cancer dataset and then compared four CNNs on a public breast cancer data set, using the same methodology on both. Our methodology and architecture for mitosis finding in both melanoma and breast cancer whole slide images has been thoroughly tested and is likely to be useful for finding mitoses in any whole slide biopsy images.
Published: 2021

34. Testing for Association in Multi-View Network Data

Author: Gao, Lucy L., Witten, Daniela, and Bien, Jacob
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database (Das and Hint, 2012). We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to co-complex association data. We also extend this proposal to the setting of a network with node covariates.
Published: 2019

35. Optimal estimation of variance in nonparametric regression with random design

Author: Shen, Yandi, Gao, Chao, Witten, Daniela, and Han, Fang
Subjects: Mathematics - Statistics Theory
Abstract: Consider the heteroscedastic nonparametric regression model with random design \begin{align*} Y_i = f(X_i) + V^{1/2}(X_i)\varepsilon_i, \quad i=1,2,\ldots,n, \end{align*} with $f(\cdot)$ and $V(\cdot)$ $\alpha$- and $\beta$-H\"older smooth, respectively. We show that the minimax rate of estimating $V(\cdot)$ under both local and global squared risks is of the order \begin{align*} n^{-\frac{8\alpha\beta}{4\alpha\beta + 2\alpha + \beta}} \vee n^{-\frac{2\beta}{2\beta+1}}, \end{align*} where $a\vee b := \max\{a,b\}$ for any two real numbers $a,b$. This result extends the fixed design rate $n^{-4\alpha} \vee n^{-2\beta/(2\beta+1)}$ derived in Wang et al. [2008] in a non-trivial manner, as indicated by the appearances of both $\alpha$ and $\beta$ in the first term. In the special case of constant variance, we show that the minimax rate is $n^{-8\alpha/(4\alpha+1)}\vee n^{-1}$ for variance estimation, which further implies the same rate for quadratic functional estimation and thus unifies the minimax rate under the nonparametric regression model with those under the density model and the white noise model. To achieve the minimax rate, we develop a U-statistic-based local polynomial estimator and a lower bound that is constructed over a specified distribution family of randomness designed for both $\varepsilon_i$ and $X_i$., Comment: to appear in the Annals of Statistics
Published: 2019

36. Modeling microbial abundances and dysbiosis with beta-binomial regression

Author: Martin, Bryan D., Witten, Daniela, and Willis, Amy D.
Subjects: Statistics - Methodology
Abstract: Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.
Published: 2019

37. Tail behavior of dependent V-statistics and its applications

Author: Shen, Yandi, Han, Fang, and Witten, Daniela
Subjects: Mathematics - Statistics Theory, Mathematics - Probability
Abstract: We establish exponential inequalities and Cramer-type moderate deviation theorems for a class of V-statistics under strong mixing conditions. Our theory is developed via kernel expansion based on random Fourier features. This type of expansion is new and useful for handling many notorious classes of kernels. While the developed theory has a number of applications, we apply it to lasso-type semiparametric regression estimation and high-dimensional multiple hypothesis testing.
Published: 2019

38. Are Clusterings of Multiple Data Views Independent?

Author: Gao, Lucy L., Bien, Jacob, and Witten, Daniela
Subjects: Statistics - Methodology, Statistics - Machine Learning
Abstract: In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this paper, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data)., Comment: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)
Published: 2019

39. The Multiple Random Dot Product Graph Model

Author: Nielsen, Agnes Martine and Witten, Daniela
Subjects: Statistics - Methodology
Abstract: Data in the form of graphs, or networks, arise naturally in a number of contexts; examples include social networks and biological networks. We are often faced with the availability of multiple graphs on a single set of nodes. In this article, we propose the multiple random dot product graph model for this setting. Our proposed model leads naturally to an optimization problem, which we solve using an efficient alternating minimization approach. We further use this model as the basis for a new test for the hypothesis that the graphs come from a single distribution, versus the alternative that they are drawn from different distributions. We evaluate the performance of both the fitting algorithm and the hypothesis test in several simulation settings, and demonstrate empirical improvement over existing approaches. We apply these new approaches to a Wikipedia data set and a C. elegans data set.
Published: 2018

40. Robust Sparse Reduced Rank Regression in High Dimensions

Author: Tan, Kean Ming, Sun, Qiang, and Witten, Daniela
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We propose robust sparse reduced rank regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained non-convex optimization problem, which is then solved using the alternating direction method of multipliers algorithm. We establish non-asymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded $(1+\delta)$th moment with $\delta \in (0,1)$, the rate of convergence is a function of $\delta$, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. Furthermore, the transition between the two regimes is smooth. We illustrate the performance of the proposed method via extensive numerical studies and a data application., Comment: This is a replacement of a previous article titled "Distributionally Robust Reduced Rank Regression and Principal Component Analysis in High Dimensions"
Published: 2018

41. Tree-Based Methods

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

42. Support Vector Machines

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

43. Survival Analysis and Censored Data

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

44. Linear Model Selection and Regularization

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

45. Multiple Testing

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

46. Moving Beyond Linearity

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

47. Classification

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

48. Resampling Methods

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

49. Statistical Learning

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

50. Linear Regression

Author: James, Gareth, Witten, Daniela, Hastie, Trevor, Tibshirani, Robert, Allen, G., Series Editor, De Veaux, R., Series Editor, Nugent, R., Series Editor, James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

497 results on '"Witten, Daniela"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources