101 results on '"van der Laan, Mark J."'
Search Results
2. Efficient nonparametric inference on the effects of stochastic interventions under two‐phase sampling, with applications to vaccine efficacy trials.
- Author
-
Hejazi, Nima S., van der Laan, Mark J., Janes, Holly E., Gilbert, Peter B., and Benkeser, David C.
- Subjects
- *
VACCINE effectiveness , *VACCINE trials , *HIV , *BIOMARKERS , *OPEN source software - Abstract
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high‐burden diseases, including human immunodeficiency virus (HIV), have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two‐phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant‐level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two‐phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshiftR package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
3. Adaptive sequential surveillance with network and temporal dependence.
- Author
-
Malenica, Ivana, Coyle, Jeremy R, van der Laan, Mark J, and Petersen, Maya L
- Subjects
- *
TIME-varying networks , *COVID-19 pandemic , *COMMUNICABLE diseases , *COLLEGE environment , *LATENT variables - Abstract
Strategic test allocation is important for control of both emerging and existing pandemics (eg, COVID-19, HIV). It supports effective epidemic control by (1) reducing transmission via identifying cases and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest (positive infection status) is often a latent variable. In addition, presence of both network and temporal dependence reduces data to a single observation. In this work, we study an adaptive sequential design, which allows for unspecified dependence among individuals and across time. Our causal parameter is the mean latent outcome we would have obtained, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. The key strength of the method is that we do not have to model network and time dependence: a short-term performance Online Super Learner is used to select among dependence models and randomization schemes. The proposed strategy learns the optimal choice of testing over time while adapting to the current state of the outbreak and learning across samples, through time, or both. We demonstrate the superior performance of the proposed strategy in an agent-based simulation modeling a residential university environment during the COVID-19 pandemic. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data.
- Author
-
Ju, Cheng, van der Laan, Mark J., Wyss, Richard, Franklin, Jessica M., Schneeweiss, Sebastian, and Häggström, Jenny
- Subjects
- *
CAUSAL models , *PROPENSITY score matching , *COLLABORATIVE learning , *NONSTEROIDAL anti-inflammatory agents , *ELECTRONIC health records - Abstract
Propensity score-based estimators are increasingly used for causal inference in observational studies. However, model selection for propensity score estimation in high-dimensional data has received little attention. In these settings, propensity score models have traditionally been selected based on the goodness-of-fit for the treatment mechanism itself, without consideration of the causal parameter of interest. Collaborative minimum loss-based estimation is a novel methodology for causal inference that takes into account information on the causal parameter of interest when selecting a propensity score model. This "collaborative learning" considers variable associations with both treatment and outcome when selecting a propensity score model in order to minimize a bias-variance tradeoff in the estimated treatment effect. In this study, we introduce a novel approach for collaborative model selection when using the LASSO estimator for propensity score estimation in high-dimensional covariate settings. To demonstrate the importance of selecting the propensity score model collaboratively, we designed quasi-experiments based on a real electronic healthcare database, where only the potential outcomes were manually generated, and the treatment and baseline covariates remained unchanged. Results showed that the collaborative minimum loss-based estimation algorithm outperformed other competing estimators for both point estimation and confidence interval coverage. In addition, the propensity score model selected by collaborative minimum loss-based estimation could be applied to other propensity score-based estimators, which also resulted in substantive improvement for both point estimation and confidence interval coverage. We illustrate the discussed concepts through an empirical example comparing the effects of non-selective nonsteroidal anti-inflammatory drugs with selective COX-2 inhibitors on gastrointestinal complications in a population of Medicare beneficiaries. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. Estimation of time‐specific intervention effects on continuously distributed time‐to‐event outcomes by targeted maximum likelihood estimation.
- Author
-
Rytgaard, Helene C. W., Eriksson, Frank, and van der Laan, Mark J.
- Subjects
- *
MAXIMUM likelihood statistics , *NONPARAMETRIC estimation , *CAUSAL models , *CAUSAL inference , *STATISTICAL models , *POISSON regression , *FOLLICULAR lymphoma - Abstract
This work considers targeted maximum likelihood estimation (TMLE) of treatment effects on absolute risk and survival probabilities in classical time‐to‐event settings characterized by right‐censoring and competing risks. TMLE is a general methodology combining flexible ensemble learning and semiparametric efficiency theory in a two‐step procedure for substitution estimation of causal parameters. We specialize and extend the continuous‐time TMLE methods for competing risks settings, proposing a targeting algorithm that iteratively updates cause‐specific hazards to solve the efficient influence curve equation for the target parameter. As part of the work, we further detail and implement the recently proposed highly adaptive lasso estimator for continuous‐time conditional hazards with L1‐penalized Poisson regression. The resulting estimation procedure benefits from relying solely on very mild nonparametric restrictions on the statistical model, thus providing a novel tool for machine‐learning‐based semiparametric causal inference for continuous‐time time‐to‐event data. We apply the methods to a publicly available dataset on follicular cell lymphoma where subjects are followed over time until disease relapse or death without relapse. The data display important time‐varying effects that can be captured by the highly adaptive lasso. In our simulations that are designed to imitate the data, we compare our methods to a similar approach based on random survival forests and to the discrete‐time TMLE. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. Doubly robust inference for targeted minimum loss-based estimation in randomized trials with missing outcome data.
- Author
-
Díaz, Iván and van der Laan, Mark J.
- Abstract
Missing outcome data is a crucial threat to the validity of treatment effect estimates from randomized trials. The outcome distributions of participants with missing and observed data are often different, which increases bias. Causal inference methods may aid in reducing the bias and improving efficiency by incorporating baseline variables into the analysis. In particular, doubly robust estimators incorporate 2 nuisance parameters: the outcome regression and the missingness mechanism (ie, the probability of missingness conditional on treatment assignment and baseline variables), to adjust for differences in the observed and unobserved groups that can be explained by observed covariates. To consistently estimate the treatment effect, one of these nuisance parameters must be consistently estimated. Traditionally, nuisance parameters are estimated using parametric models, which often precludes consistency, particularly in moderate to high dimensions. Recent research on missing data has focused on data-adaptive estimation to help achieve consistency, but the large sample properties of such methods are poorly understood. In this article, we discuss a doubly robust estimator that is consistent and asymptotically normal under data-adaptive estimation of the nuisance parameters. We provide a formula for an asymptotically exact confidence interval under minimal assumptions. We show that our proposed estimator has smaller finite-sample bias compared to standard doubly robust estimators. We present a simulation study demonstrating the enhanced performance of our estimators in terms of bias, efficiency, and coverage of the confidence intervals. We present the results of an illustrative example: a randomized, double-blind phase 2/3 trial of antiretroviral therapy in HIV-infected persons. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
7. Evaluating the impact of treating the optimal subgroup.
- Author
-
Luedtke, Alexander R. and van der Laan, Mark J.
- Subjects
- *
TREATMENT effectiveness , *PROGNOSIS , *DATA , *INDIVIDUALIZED medicine , *NONPARAMETRIC statistics , *EXPERIMENTAL design - Abstract
Suppose we have a binary treatment used to influence an outcome. Given data from an observational or controlled study, we wish to determine whether or not there exists some subset of observed covariates in which the treatment is more effective than the standard practice of no treatment. Furthermore, we wish to quantify the improvement in population mean outcome that will be seen if this subgroup receives treatment and the rest of the population remains untreated. We show that this problem is surprisingly challenging given how often it is an (at least implicit) study objective. Blindly applying standard techniques fails to yield any apparent asymptotic results, while using existing techniques to confront the non-regularity does not necessarily help at distributions where there is no treatment effect. Here, we describe an approach to estimate the impact of treating the subgroup which benefits from treatment that is valid in a nonparametric model and is able to deal with the case where there is no treatment effect. The approach is a slight modification of an approach that recently appeared in the individualized medicine literature. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
8. Efficient targeted learning of heterogeneous treatment effects for multiple subgroups.
- Author
-
Wei, Waverly, Petersen, Maya, van der Laan, Mark J, Zheng, Zeyu, Wu, Chong, and Wang, Jingshen
- Subjects
- *
DISEASE risk factors , *MEDICAL sciences - Abstract
In biomedical science, analyzing treatment effect heterogeneity plays an essential role in assisting personalized medicine. The main goals of analyzing treatment effect heterogeneity include estimating treatment effects in clinically relevant subgroups and predicting whether a patient subpopulation might benefit from a particular treatment. Conventional approaches often evaluate the subgroup treatment effects via parametric modeling and can thus be susceptible to model mis‐specifications. In this paper, we take a model‐free semiparametric perspective and aim to efficiently evaluate the heterogeneous treatment effects of multiple subgroups simultaneously under the one‐step targeted maximum‐likelihood estimation (TMLE) framework. When the number of subgroups is large, we further expand this path of research by looking at a variation of the one‐step TMLE that is robust to the presence of small estimated propensity scores in finite samples. From our simulations, our method demonstrates substantial finite sample improvements compared to conventional methods. In a case study, our method unveils the potential treatment effect heterogeneity of rs12916‐T allele (a proxy for statin usage) in decreasing Alzheimer's disease risk. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Defining and estimating effects in cluster randomized trials: A methods comparison.
- Author
-
Benitez, Alejandra, Petersen, Maya L., van der Laan, Mark J., Santos, Nicole, Butrick, Elizabeth, Walker, Dilys, Ghosh, Rakesh, Otieno, Phelgona, Waiswa, Peter, and Balzer, Laura B.
- Subjects
- *
CLUSTER randomized controlled trials , *MAXIMUM likelihood statistics , *GENERALIZED estimating equations , *PREMATURE labor , *COMMUNITIES - Abstract
Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (eg, at the individual‐level or at the cluster‐level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we present a general framework to formally define an array of causal effects in terms of summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of CRT estimators, including the t‐test, generalized estimating equations (GEE), augmented‐GEE, and targeted maximum likelihood estimation (TMLE). Using finite sample simulations, we illustrate the practical performance of these estimators for different causal effects and when, as commonly occurs, there are limited numbers of clusters of different sizes. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real‐world impact of varying cluster sizes and targeting effects at the cluster‐level or at the individual‐level. Specifically, the relative effect of the PTBi intervention was 0.81 at the cluster‐level, corresponding to a 19% reduction in outcome incidence, and was 0.66 at the individual‐level, corresponding to a 34% reduction in outcome risk. Given its flexibility to estimate a variety of user‐specified effects and ability to adaptively adjust for covariates for precision gains while maintaining Type‐I error control, we conclude TMLE is a promising tool for CRT analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Measuring changes in transmission of neglected tropical diseases, malaria, and enteric pathogens from quantitative antibody levels.
- Author
-
Arnold, Benjamin F., van der Laan, Mark J., Hubbard, Alan E., Steel, Cathy, Kubofcik, Joseph, Hamlin, Katy L., Moss, Delynn M., Nutman, Thomas B., Priest, Jeffrey W., and Lammie, Patrick J.
- Subjects
- *
TROPICAL medicine , *MALARIA , *PATHOGENIC microorganisms , *IMMUNOGLOBULINS , *IMMUNOLOGY - Abstract
Background: Serological antibody levels are a sensitive marker of pathogen exposure, and advances in multiplex assays have created enormous potential for large-scale, integrated infectious disease surveillance. Most methods to analyze antibody measurements reduce quantitative antibody levels to seropositive and seronegative groups, but this can be difficult for many pathogens and may provide lower resolution information than quantitative levels. Analysis methods have predominantly maintained a single disease focus, yet integrated surveillance platforms would benefit from methodologies that work across diverse pathogens included in multiplex assays. Methods/Principal findings: We developed an approach to measure changes in transmission from quantitative antibody levels that can be applied to diverse pathogens of global importance. We compared age-dependent immunoglobulin G curves in repeated cross-sectional surveys between populations with differences in transmission for multiple pathogens, including: lymphatic filariasis (Wuchereria bancrofti) measured before and after mass drug administration on Mauke, Cook Islands, malaria (Plasmodium falciparum) before and after a combined insecticide and mass drug administration intervention in the Garki project, Nigeria, and enteric protozoans (Cryptosporidium parvum, Giardia intestinalis, Entamoeba histolytica), bacteria (enterotoxigenic Escherichia coli, Salmonella spp.), and viruses (norovirus groups I and II) in children living in Haiti and the USA. Age-dependent antibody curves fit with ensemble machine learning followed a characteristic shape across pathogens that aligned with predictions from basic mechanisms of humoral immunity. Differences in pathogen transmission led to shifts in fitted antibody curves that were remarkably consistent across pathogens, assays, and populations. Mean antibody levels correlated strongly with traditional measures of transmission intensity, such as the entomological inoculation rate for P. falciparum (Spearman’s rho = 0.75). In both high- and low transmission settings, mean antibody curves revealed changes in population mean antibody levels that were masked by seroprevalence measures because changes took place above or below the seropositivity cutoff. Conclusions/Significance: Age-dependent antibody curves and summary means provided a robust and sensitive measure of changes in transmission, with greatest sensitivity among young children. The method generalizes to pathogens that can be measured in high-throughput, multiplex serological assays, and scales to surveillance activities that require high spatiotemporal resolution. Our results suggest quantitative antibody levels will be particularly useful to measure differences in exposure for pathogens that elicit a transient antibody response or for monitoring populations with very high- or very low transmission, when seroprevalence is less informative. The approach represents a new opportunity to conduct integrated serological surveillance for neglected tropical diseases, malaria, and other infectious diseases with well-defined antigen targets. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. Adaptive pre-specification in randomized trials with and without pair-matching.
- Author
-
Balzer, Laura B., van der Laan, Mark J., and Petersen, Maya L.
- Subjects
- *
HIV prevention , *CLINICAL trials , *EXPERIMENTAL design , *PROBABILITY theory , *RESEARCH funding - Abstract
In randomized trials, adjustment for measured covariates during the analysis can reduce variance and increase power. To avoid misleading inference, the analysis plan must be pre-specified. However, it is often unclear a priori which baseline covariates (if any) should be adjusted for in the analysis. Consider, for example, the Sustainable East Africa Research in Community Health (SEARCH) trial for HIV prevention and treatment. There are 16 matched pairs of communities and many potential adjustment variables, including region, HIV prevalence, male circumcision coverage, and measures of community-level viral load. In this paper, we propose a rigorous procedure to data-adaptively select the adjustment set, which maximizes the efficiency of the analysis. Specifically, we use cross-validation to select from a pre-specified library the candidate targeted maximum likelihood estimator (TMLE) that minimizes the estimated variance. For further gains in precision, we also propose a collaborative procedure for estimating the known exposure mechanism. Our small sample simulations demonstrate the promise of the methodology to maximize study power, while maintaining nominal confidence interval coverage. We show how our procedure can be tailored to the scientific question (intervention effect for the study sample vs. for the target population) and study design (pair-matched or not). Copyright © 2016 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
12. Consistent causal effect estimation under dual misspecification and implications for confounder selection procedures.
- Author
-
Gruber, Susan and van der Laan, Mark J.
- Subjects
- *
CAUSAL models , *ESTIMATION theory , *MAXIMUM likelihood statistics , *LINEAR statistical models , *MATHEMATICAL models , *ATTRIBUTION (Social psychology) , *PROBABILITY theory , *RESEARCH funding , *STATISTICS , *DATA analysis , *STATISTICAL models - Abstract
In a previously published article in this journal, Vansteeland et al. [Stat Methods Med Res. Epub ahead of print 12 November 2010. DOI: 10.1177/0962280210387717] address confounder selection in the context of causal effect estimation in observational studies. They discuss several selection strategies and propose a procedure whose performance is guided by the quality of the exposure effect estimator. The authors note that when a particular linearity condition is met, consistent estimation of the target parameter can be achieved even under dual misspecification of models for the association of confounders with exposure and outcome and demonstrate the performance of their procedure relative to other estimators when this condition holds. Our earlier published work on collaborative targeted minimum loss based learning provides a general theoretical framework for effective confounder selection that explains the findings of Vansteelandt et al. and underscores the appropriateness of their suggestions that a confounder selection procedure should be concerned with directly targeting the quality of the estimate and that desirable estimators produce valid confidence intervals and are robust to dual misspecification. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
13. Nonparametric inverse‐probability‐weighted estimators based on the highly adaptive lasso.
- Author
-
Ertefaie, Ashkan, Hejazi, Nima S., and van der Laan, Mark J.
- Subjects
- *
STATISTICAL models , *CAUSAL inference , *STATISTICAL weighting , *SELECTION bias (Statistics) , *ESTIMATES - Abstract
Inverse‐probability‐weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudopopulation in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse‐probability‐weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at nearly n−1/3$ n^{-1/3}$‐rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse‐probability‐weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large‐scale epidemiologic study. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. Discussion of Identification, Estimation and Approximation of Risk under Interventions that Depend on the Natural Value of Treatment Using Observational Data, by Jessica Young, Miguel Hernán, and James Robins.
- Author
-
van der Laan, Mark J., Luedtke, Alexander R., and Díaz, Iván
- Subjects
- *
STOCHASTIC processes , *APPROXIMATION theory - Abstract
Young, Hernán, and Robins consider the mean outcome under a dynamic intervention that may rely on the natural value of treatment. They first identify this value with a statistical target parameter, and then show that this statistical target parameter can also be identified with a causal parameter which gives the mean outcome under a stochastic intervention. The authors then describe estimation strategies for these quantities. Here we augment the authors' insightful discussion by sharing our experiences in situations where two causal questions lead to the same statistical estimand, or the newer problem that arises in the study of data adaptive parameters, where two statistical estimands can lead to the same estimation problem. Given a statistical estimation problem, we encourage others to always use a robust estimation framework where the data generating distribution truly belongs to the statistical model. We close with a discussion of a framework which has these properties. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
15. Cross-Validated Loss-based Covariance Matrix Estimator Selection in High Dimensions.
- Author
-
Boileau, Philippe, Hejazi, Nima S., van der Laan, Mark J., and Dudoit, Sandrine
- Subjects
- *
COVARIANCE matrices , *MATRIX functions , *SAMPLE size (Statistics) - Abstract
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In simulation studies, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimensionality reduction application to single-cell transcriptome sequencing data. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
16. A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology.
- Author
-
Hejazi, Nima S, Boileau, Philippe, van der Laan, Mark J, and Hubbard, Alan E
- Subjects
- *
COMPUTATIONAL biology , *BIOAVAILABILITY , *BIOLOGY , *GENERALIZATION , *PARAMETRIC modeling , *SMOKING statistics , *BONFERRONI correction - Abstract
The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. Subsemble: an ensemble method for combining subset-specific algorithm fits.
- Author
-
Sapp, Stephanie, van der Laan, Mark J., and Canny, John
- Subjects
- *
SET theory , *STATISTICAL ensembles , *COMPUTER simulation , *MACHINE learning , *BIG data - Abstract
Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble: a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits a specified underlying algorithm on each subset, and uses a clever form ofV-fold cross-validation to output a prediction function that combines the subset-specific fits. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsemble can be a beneficial tool for small- to moderate-sized data sets, and often has better prediction performance than the underlying algorithm fit just once on the full data set. We also describe how to include Subsemble as a candidate in a SuperLearner library, providing a practical way to evaluate the performance of Subsemble relative to the underlying algorithm fit just once on the full data set. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
18. An Application of Targeted Maximum Likelihood Estimation to the Meta-Analysis of Safety Data.
- Author
-
Gruber, Susan and van der Laan, Mark J.
- Subjects
- *
META-analysis , *CLINICAL trials , *MAXIMUM likelihood statistics , *SIMULATION methods & models , *MORTALITY - Abstract
Safety analysis to estimate the effect of a treatment on an adverse event poses a challenging statistical problem even in randomized controlled trials because these events are typically rare, so studies originally powered for efficacy are underpowered for safety outcomes. A meta-analysis of data pooled across multiple studies may increase power, but missingness in the outcome or failed randomization can introduce bias. This article illustrates how targeted maximum likelihood estimation (TMLE) can be applied in a meta-analysis to reduce bias in causal effect estimates, and compares performance with other estimators in the literature. A simulation study in which missingness in the outcome is at random or completely at random highlights the differences in estimators with respect to the potential gains in bias and efficiency. Risk difference, relative risk, and odds ratio of the effect of treatment on 30-daymortality are estimated from data from eight randomized controlled trials. When an outcome event is rare there may be little opportunity to improve efficiency, and associations between covariates and the outcome may be hard to detect. TMLE attempts to exploit the available information to either meet or exceed the performance of a less sophisticated estimator. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
19. Comparison of Targeted Maximum Likelihood and Shrinkage Estimators of Parameters in Gene Networks.
- Author
-
Geeven, Geert, van der Laan, Mark J., and de Gunst, Mathisca C.M.
- Subjects
- *
GENE regulatory networks , *GENE expression , *GENETIC regulation , *GENES , *NUCLEOTIDE sequence , *GENETIC engineering - Abstract
Gene regulatory networks, in which edges between nodes describe interactions between transcription factors (TFs) and their target genes, model regulatory interactions that determine the cell-type and condition-specific expression of genes. Regression methods can be used to identify TF-target gene interactions from gene expression and DNA sequence data. The response variable, i.e. observed gene expression, is modeled as a function of many predictor variables simultaneously. In practice, it is generally not possible to select a single model that clearly achieves the best fit to the observed experimental data and the selected models typically contain overlapping sets of predictor variables. Moreover, parameters that represent the marginal effect of the individual predictors are not always present. In this paper, we use the statistical framework of estimation of variable importance to define variable importance as a parameter of interest and study two different estimators of this parameter in the context of gene regulatory networks. On yeast data we show that the resulting parameter has a biologically appealing interpretation. We apply the proposed methodology on mammalian gene expression data to gain insight into the temporal activity of TFs that underly gene expression changes in F11 cells in response to Forskolin stimulation. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
20. Repeated Measures Semiparametric Regression Using Targeted Maximum Likelihood Methodology with Application to Transcription Factor Activity Discovery.
- Author
-
Tuglus, Catherine and van der Laan, Mark J.
- Subjects
- *
ESTIMATION theory , *TRANSCRIPTION factors , *SIMULATION methods & models , *STATISTICAL correlation , *CELL cycle , *NUCLEOTIDES , *GENETIC polymorphisms - Abstract
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations. The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5. The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
21. Using Regression Models to Analyze Randomized Trials: Asymptotically Valid Hypothesis Tests Despite Incorrectly Specified Models.
- Author
-
Rosenblum, Michael and van der Laan, Mark J.
- Subjects
- *
REGRESSION analysis , *CAUSATION (Philosophy) , *CLINICAL trials , *MATHEMATICAL models , *STATISTICAL hypothesis testing - Abstract
Regression models are often used to test for cause-effect relationships from data collected in randomized trials or experiments. This practice has deservedly come under heavy scrutiny, because commonly used models such as linear and logistic regression will often not capture the actual relationships between variables, and incorrectly specified models potentially lead to incorrect conclusions. In this article, we focus on hypothesis tests of whether the treatment given in a randomized trial has any effect on the mean of the primary outcome, within strata of baseline variables such as age, sex, and health status. Our primary concern is ensuring that such hypothesis tests have correct type I error for large samples. Our main result is that for a surprisingly large class of commonly used regression models, standard regression-based hypothesis tests (but using robust variance estimators) are guaranteed to have correct type I error for large samples, even when the models are incorrectly specified. To the best of our knowledge, this robustness of such model-based hypothesis tests to incorrectly specified models was previously unknown for Poisson regression models and for other commonly used models we consider. Our results have practical implications for understanding the reliability of commonly used, model-based tests for analyzing randomized trials. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
22. Modified FDR Controlling Procedure for Multi-Stage Analyses.
- Author
-
Tuglus, Catherine and van der Laan, Mark J.
- Subjects
- *
REGRESSION analysis , *MATHEMATICAL statistics , *STATISTICAL hypothesis testing , *ERROR rates , *GENOMICS , *MOLECULAR genetics - Abstract
Multiple testing has become an integral component in genomic analyses involving microarray experiments where a large number of hypotheses are tested simultaneously. However, before applying more computationally intensive methods, it is often desirable to complete an initial truncation of the variable set using a simpler and faster supervised method such as univariate regression. Once such a truncation is completed, multiple testing methods applied to any subsequent analysis no longer control the appropriate Type I error rates. Here we propose a modified marginal Benjamini & Hochberg step-up FDR controlling procedure for multi-stage analyses (FDR-MSA), which correctly controls Type I error in terms of the entire variable set when only a subset of the initial set of variables is tested. The method is presented with respect to a variable importance application. As the initial subset size increases, we observe convergence to the standard Benjamini & Hochberg step-up FDR controlling multiple testing procedures. We demonstrate the power and Type I error control through simulation and application to the Golub Leukemia data from 1999. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
23. Analyzing sequentially randomized trials based on causal effect models for realistic individualized treatment rules.
- Author
-
Bembom, Oliver and van der Laan, Mark J.
- Abstract
In this paper, we argue that causal effect models for realistic individualized treatment rules represent an attractive tool for analyzing sequentially randomized trials. Unlike a number of methods proposed previously, this approach does not rely on the assumption that intermediate outcomes are discrete or that models for the distributions of these intermediate outcomes given the observed past are correctly specified. In addition, it generalizes the methodology for performing pairwise comparisons between individualized treatment rules by allowing the user to posit a marginal structural model for all candidate treatment rules simultaneously. This is particularly useful if the number of such rules is large, in which case an approach based on individual pairwise comparisons would be likely to suffer from too much sampling variability to provide an informative answer. In addition, such causal effect models represent an interesting alternative to methods previously proposed for selecting an optimal individualized treatment rule in that they immediately give the user a sense of how the optimal outcome is estimated to change in the neighborhood of the identified optimum. We discuss an inverse-probability-of-treatment-weighted (IPTW) estimator for these causal effect models, which is straightforward to implement using standard statistical software, and develop an approach for constructing valid asymptotic confidence intervals based on the influence curve of this estimator. The methodology is illustrated in two simulation studies that are intended to mimic an HIV/AIDS trial. Copyright © 2008 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
24. Supervised Distance Matrices.
- Author
-
Pollard, Katherine S. and van der Laan, Mark J.
- Subjects
- *
CONCEPTS , *MATRICES (Mathematics) , *REGRESSION analysis , *GENE expression , *DATA analysis - Abstract
We introduce a novel statistical concept, called a supervised distance matrix, which quantifies pairwise similarity between variables in terms of their association with an outcome. Supervised distance matrices are derived in two stages. First, the observed data is transformed based on particular working models for association. Examples of transformations include residuals or influence curves from regression models. In the second stage, a choice of distance measure is used to compute all pairwise distances between variables in the transformed data. We present consistent estimators of the resulting distance matrix, including an inverse probability of censoring weighted estimator for use with right-censored outcomes. Supervised distance matrices can be used with standard (unsupervised) clustering algorithms to identify groups of similarly predictive variables and to discover subpopulations of related samples. This approach is illustrated using simulations and an analysis of gene expression data with a censored survival outcome. The proposed methods are widely applicable in genomics and other fields where high-dimensional data is collected on each subject. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
25. Statistical Methods for Analyzing Sequentially Randomized Trials.
- Author
-
Bembom, Oliver and van der Laan, Mark J.
- Subjects
- *
THERAPEUTICS , *RANDOMIZED controlled trials , *MEDICAL experimentation on humans , *MEDICAL research , *STATISTICS - Abstract
In this issue of the Journal, Thall et al. present the results of a clinical trial that makes use of sequential randomization, a novel trial design that allows the investigator to study adaptive treatment strategies. Our aim is to complement this groundbreaking work by reviewing the current state of the art of statistical methods available for such analyses. Using the data collected by Thall et al. as an example, we focus on two different approaches for estimating the success rates of different adaptive treatment strategies of interest. By emphasizing the intuitive appeal and straightforward implementation of these methods and illustrating the striking findings to which these methods can lead, we hope to convince the reader that this novel trial design provides a rich source of information that is made readily accessible through current analytical approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
26. Super Learner.
- Author
-
Van der Laan, Mark J., Polley, Eric C., and Hubbard, Alan E.
- Subjects
- *
MACHINE learning , *ARTIFICIAL intelligence , *FORECASTING , *ALGORITHMS , *REGRESSION analysis , *ANALYSIS of covariance - Abstract
When trying to learn a model for the prediction of an outcome given a set of covariates, a statistician has many estimation procedures in their toolbox. A few examples of these candidate learners are: least squares, least angle regression, random forests, and spline regression. Previous articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007)) theoretically validated the use of cross validation to select an optimal learner among many candidate learners. Motivated by this use of cross validation, we propose a new prediction method for creating a weighted combination of many candidate learners to build the super learner. This article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold cross-validation to select weights to combine an initial set of candidate learners. In addition, this paper contains a practical demonstration of the adaptivity of this so called super learner to various true data generating distributions. This approach for construction of a super learner generalizes to any parameter which can be defined as a minimizer of a loss function. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
27. A Comparison of Methods to Control Type I Errors in Microarray Studies.
- Author
-
Chen, Jinsong, Van der Laan, Mark J., Hubbard, Alan E., and Smith, Martyn T.
- Subjects
- *
ERROR rates , *STATISTICAL bootstrapping , *PERMUTATIONS , *ALGEBRA , *METHODOLOGY - Abstract
Microarray studies often need to simultaneously examine thousands of genes to determine which are differentially expressed. One main challenge in those studies is to find suitable multiple testing procedures that provide accurate control of the error rates of interest and meanwhile are most powerful, that is, they return the longest list of truly interesting genes among competitors. Many multiple testing methods have been developed recently for microarray data analysis, especially resampling based methods, such as permutation methods, the null-centered and scaled bootstrap (NCSB) method, and the quantile transformed-bootstrap-distribution (QTBD) method. Each of these methods has its own merits and limitations. Theoretically permutation methods can fail to provide accurate control of Type I errors when the so-called subset pivotality condition is violated. The NCSB method does not suffer from that limitation, but an impractical number of bootstrap samples are often needed to get proper control of Type I errors. The newly developed QTBD method has the virtues of providing accurate control of Type I errors under few restrictions. However, the relative practical performance of the above three types of multiple testing\ methods remains unresolved. This paper compares the above three resampling based methods according to the control of family wise error rates (FWER) through data simulations. Results show that among the three resampling based methods, the QTBD method provides relatively accurate and powerful control in more general circumstances. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
28. Locally Efficient Estimation With Bivariate Right-Censored Data.
- Author
-
QUALE, CHRISTOPHER M., VAN DER LAAN, MARK J., and ROBINS, JAMES R.
- Subjects
- *
STATISTICS , *ESTIMATION theory , *SURVIVAL analysis (Biometry) , *MATHEMATICAL variables , *ASYMPTOTES , *ALGORITHMS - Abstract
Estimation of the survival curve for independently right-censored bivariate failure time data is a problem that has been studied extensively over the past 20 years. In this article we propose a new class of estimators for the bivariate survivor function based on locally efficient (LE) estimation. The LE estimator takes bivariate estimators Fn and Gn of the distributions of the time variables (T1, T2) and the censoring variables (C1, C2), and maps them to the resulting estimator ŜLE• If Fn and Gn are appropriate consistent estimators of F and G, then ŜLE will be nonparametrically efficient (thus the term "locally efficient"). However, if either Fn or Gn (but not both) is not a consistent estimator of F or G, then ŜLE will still be consistent and asymptotically normally distributed. We propose a locally efficient estimator that uses a consistent, nonparametric estimator for G and allows the user to supply lower-dimensional (semiparametric or parametric) working model for F. Because the estimator that we choose for G is consistent, the resulting LE estimator will always be consistent and c normal, and our simulation studies have indicated that using a lower-dimensional model for F gives excellent small-sample performance. In addition, our algorithm for calculation of the efficient influence curve at true distributions for F and G computes the efficiency bound for the model that can be used to calculate relative efficiencies for any bivariate estimator. In this article we introduce the LE estimator for bivariate right-censored data, present an asymptotic result, present the results of simulation studies, and perform a brief data analysis illustrating the use of the LE estimator. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
29. Quantile-Function Based Null Distribution in Resampling Based Multiple Testing.
- Author
-
Van der Laan, Mark J. and Hubbard, Alan E.
- Subjects
- *
STATISTICS , *HYPOTHESIS , *STATISTICAL sampling , *DISTRIBUTION (Probability theory) , *MARGINAL distributions , *MATHEMATICAL statistics - Abstract
Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. Methods based on marginal null distributions (i.e., marginal p-values) are attractive since the marginal p-values can be based on a user supplied choice of marginal null distributions and they are computationally trivial, but they, by necessity, are known to either be conservative or to rely on assumptions about the dependence structure between the test-statistics. Re-sampling based multiple testing (Westfall and Young, 1993) involves sampling from a joint null distribution of the test-statistics, and controlling (possibly in a, for example, step-down fashion) the user supplied type-I error rate under this joint null distribution for the test-statistics. A generally asymptotically valid null distribution avoiding the need for the subset pivotality condition for the vector of test-statistics was proposed in Pollard, van der Laan (2003) for null hypotheses about general real valued parameters. This null distribution was generalized in Dudoit, vanderLaan, Pollard (2004) to general null hypotheses and test-statistics. In ongoing recent work van der Laan, Hubbard (2005), we propose a new generally asymptotically valid null distribution for the test-statistics and a corresponding bootstrap estimate, whose marginal distributions are user supplied, and can thus be set equal to the (most powerful) marginal null distributions one would use in univariate testing to obtain a p-value. Previous proposed null distributions either relied on a restrictive subset pivotality condition (Westfall and Young) or did not guarantee this latter property (Dudoit, vanderLaan, Pollard, 2004). It is argued and illustrated that the resulting new re-sampling based multiple testing methods provide more accurate control of the wished Type-I error in finite samples and are more powerful. We establish formal results and investigate the practical performance of this methodology in a simulation and data analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
30. Empirical Bayes and Resampling Based Multiple Testing Procedure Controlling Tail Probability of the Proportion of False Positives.
- Author
-
Van der Laan, Mark J., Birkner, Merrill D., and Hubbard, Alan E.
- Subjects
- *
STATISTICS , *DISTRIBUTION (Probability theory) , *MATHEMATICS , *PROBABILITY theory , *METHODOLOGY - Abstract
Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. In this article we propose a new re-sampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers. The procedure involves1) specifying a conditional distribution for a guessed set of true null hypotheses, given the data, which asymptotically is degenerate at the true set of null hypotheses, and 2) specifying a generally valid null distribution for the vector of test-statistics proposed in Pollard & van der Laan (2003), and generalized in our subsequent article Dudoit, vander Laan, & Pollard (2004), van der Laan, Dudoit, & Pollard (2004), and van der Laan, Dudoit, & Pollard (2004b). Ingredient 1) is established by fitting the empirical Bayes two component mixture model (Efron (2001b)) to the data to obtain an upper bound for marginal posterior probabilities of the null being true, given the data. We establish the finite sample rational behind our proposal, and prove that this new multiple testing procedure asymptotically controls the wished tail probability for the proportion of false positives under general data generating distributions. In addition, we provide simulation studies establishing that this method is generally more powerful in finite samples than our previously proposed augmentation multiple testing procedure (van der Laan, Dudoit, &Pollard (2004b)) and competing procedures from the literature. Finally, we illustrate our methodology with a data analysis. [ABSTRACT FROM PUBLISHER]
- Published
- 2005
- Full Text
- View/download PDF
31. Choice of a null distribution in resampling-based multiple testing
- Author
-
Pollard, Katherine S. and van der Laan, Mark J.
- Subjects
- *
RESAMPLING (Statistics) , *DISTRIBUTION (Probability theory) , *NONPARAMETRIC statistics , *MATHEMATICS - Abstract
This paper investigates different choices of test statistic null distribution for resampling-based multiple testing in the context of single parameter hypotheses. We show that the test statistic null distribution for strongly controlling type I error may be obtained by projecting the true test statistic distribution onto the space of mean zero distributions. For common choices of test statistics, this distribution is asymptotically multivariate normal with the covariance of the vector influence curve for the parameter estimator. Applying the ordinary non-parametric or model-based bootstrap to mean zero centered test statistics produces an estimated test statistic null distribution which provides asymptotic strong control. In contrast, the usual practice of obtaining an estimated test statistic null distribution via an estimated data null distribution (e.g. null restricted bootstrap) only provides an asymptotically correct test statistic null distribution if the covariance of the vector influence curve is the same under the chosen data null distribution as under the true data distribution. This condition is the formal analogue of the subset pivotality condition (Westfall and Young, Resampling-based Multiple Testing: Examples and Methods for
p -value adjustment, Wiley, New York, 1993). We demonstrate the use of our proposed ordinary bootstrap null distribution with a single-step multiple testing method which is equivalent to constructing an error-specific confidence region for the true parameter and checking if it contains the hypothesized value. We also study the two sample problem and show that the permutation method produces an asymptotically correct null distribution if (i) the sample sizes are equal or (ii) the populations have the same covariance structure. [Copyright &y& Elsevier]- Published
- 2004
- Full Text
- View/download PDF
32. Asymptotic Optimality of Likelihood-Based Cross-Validation.
- Author
-
van der Laan, Mark J., Dudoit, Sandrine, and Keles, Sunduz
- Abstract
Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihood-based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold cross-validation). This result implies that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood-based cross-validation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihood-based cross-validation in the context of regulatory motif detection in DNA sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
33. Deletion/Substitution/Addition Algorithm in Learning with Applications in Genomics.
- Author
-
Sinisi, Sandra E. and Van der Laan, Mark J.
- Subjects
- *
GENOMICS , *MOLECULAR genetics , *GENE expression , *POLYNOMIALS , *ALGORITHMS - Abstract
van der Laan and Dudoit (2003) provide a road map for estimation and performance assessment where a parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using a loss function. After brie y reviewing this approach, this article proposes a general deletion/substitution/addition algorithm for minimizing, over subsets of variables (e.g., basis functions), the empirical risk of subset-specific estimators of the parameter of interest. This algorithm provides us with a new class of loss-based cross-validated algorithms in prediction of univariate outcomes, which can be extended to handle multivariate outcomes, conditional density and hazard estimation, and censored outcomes such as survival. In the context of regression, using polynomial basis functions, we study the properties of the deletion/substitution/addition algorithm in simulations and apply the method to detect transcription factor binding sites in yeast gene expression experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
34. Augmentation Procedures for Control of the Generalized Family-Wise Error Rate and Tail Probabilities for the Proportion of False Positives.
- Author
-
Van der Laan, Mark J., Dudoit, Sandrine, and Pollard, Katherine S.
- Subjects
- *
PROBABILITY theory , *ERROR analysis in mathematics , *MATHEMATICAL statistics , *STATISTICS , *MATHEMATICS - Abstract
This article shows that any single-step or stepwise multiple testing procedure (asymp-totically) controlling the family-wise error rate (FWER) can be augmented into procedures that (asymptotically) control tail probabilities for the number of false positives and the proportion of false positives among the rejected hypotheses. Specifically, given any procedure that (asymptotically) controls the FWER at level alpha, we propose simple augmentation procedures that provide (asymptotic) level-alpha control of: (i) the generalized family-wise error rate, i.e., the tail probability, gFWER(k), that the number of Type I errors exceeds a user-supplied integer k, and (ii) the tail probability, TPPFP(q), that the proportion of Type I errors among the rejected hypotheses exceeds a user-supplied value 0
- Published
- 2004
- Full Text
- View/download PDF
35. Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate.
- Author
-
Van der Laan, Mark J., Dudoit, Sandrine, and Pollard, Katherine S.
- Subjects
- *
MAXIMA & minima , *STATISTICAL bootstrapping , *DISTRIBUTION (Probability theory) , *MATHEMATICS , *STATISTICS - Abstract
The present article proposes two step-down multiple testing procedures for asymptotic control of the family-wise error rate (FWER): the first procedure is based on maxima of test statistics (step-down maxT), while the second relies on minima of unadjusted p-values (step-down minP). A key feature of our approach is the characterization and construction of a test statistics null distribution (rather than data generating null distribution) for deriving cut-off for these test statistics (i.e., rejection regions) and the resulting adjusted p-values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which the step-down maxT and minP procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality. Inspired by this general characterization, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics. Step-down procedures based on consistent estimators of the null distribution are shown to also provide asymptotic control of the Type I error rate. A general bootstrap algorithm is supplied to conveniently obtain consistent estimators of the null distribution. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
36. Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates.
- Author
-
Dudoit, Sandrine, Van der Laan, Mark J., and Pollard, Katherine S.
- Subjects
- *
STATISTICS , *PROBABILITY theory , *SCALING (Social sciences) , *SOCIAL science methodology - Abstract
The present article proposes general single-step multiple testing procedures for controlling Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors, such as the generalized family-wise error rate. A key feature of our approach is the test statistics null distribution (rather than data generating null distribution) used to derive cut-off (i.e., rejection regions) for these test statistics and the resulting adjusted p-values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which single-step common-quantile and common-cut-off procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality. Inspired by this general characterization of a null distribution, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics. In the special case of family-wise error rate (FWER) control, our method yields the single-step minP and maxT procedures, based on minima of unadjusted p-values and maxima of test statistics, respectively, with the important distinction in the choice of null distribution. Single-step procedures based on consistent estimators of the null distribution are shown to also provide asymptotic con- trol of the Type I error rate. A general bootstrap algorithm is supplied to conveniently obtain consistent estimators of the null distribution. The special cases of t- and F-statistics are discussed in detail. The companion articles focus on step-down multiple testing procedures for control of the FWER (van der Laan et al., 2004b) and on augmentations of FWER-controlling methods to control error rates such as tail probabilities for the number of false positives and for the proportion of false positives among the rejected hypothes... [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
37. Nonlinearity in Demographic and Behavioral Determinants of Morbidity.
- Author
-
Norris, Jean C., Van der laan, Mark J., Lane, Sylvia, Anderson, James N., and Block, Gladys
- Subjects
- *
DISEASES , *DEMOGRAPHY , *HUMAN behavior , *INCOME , *HEALTH - Abstract
Objective. To examine nonlinearity of determinants of morbidity in the United States Data Sources. A secondary analysis of data on individuals with dietary data from the Cancer Epidemiology Supplement and National Health Interview Survey (NHIS) 1987, a cross‐sectional, stratified random sample of the U.S. population (n=22,080). Study Design. A statistical exploration using additive multiple regression models. Methods. A Morbidity Index (0–30 points), derived from 1987 National Health Interview Survey data, combines number of conditions, hospitalizations, sick days, doctor visits, and degree of disability. Behavioral (health habits) variables were added to multivariate models containing demographic terms, with Morbidity Index and Self‐assessed Health outcomes (n=17,612). Tables and graphs compare models of morbidity with self‐assessed health models, with and without behavioral terms. Graphs illustrate curvilinear relationships. Principal Findings. Morbidity and health are associated nonlinearly with age, race, education, and income, as well as alcohol, diet change, vitamin supplement use, body mass index (BMI), marital status/living arrangement, and smoking. Diet change and supplement use, education, income, race/ethnicity, and age relate differently to self‐assessed health status than to morbidity. Morbidity is strongly associated with income up to about $15,000 above poverty. Additional income predicts no further reduction in morbidity. Better health is strongly related to both higher income and education. After controlling for income, black race does not predict morbidity, but remains associated with lower self‐assessed health. Conclusions. Good health habits, as captured in these models, are associated with a 10–20‐year delay in onset and progression of morbidity. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
38. A NEW PARTITIONING AROUND MEDOIDS ALGORITHM.
- Author
-
Van Der Laan, Mark J., Pollard, Katherine S., and Bryan, Jennifer
- Subjects
- *
PARTITIONS (Mathematics) , *ALGORITHMS - Abstract
Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
39. SMOOTH ESTIMATION OF A MONOTONE DENSITY.
- Author
-
Van Der Vaart, Aad W. and Van Der Laan, Mark J.
- Subjects
- *
MONOTONIC functions , *KERNEL functions - Abstract
We investigate the interplay of smoothness and monotonicity assumptions when estimating a density from a sample of observations. The nonparametric maximum likelihood estimator of a decreasing density on the positive half line attains a rate of convergence of $n^{-1/3}$ at a fixed point t if the density has a negative derivative at t . The same rate is obtained by a kernel estimator of bandwidth $n^{-1/3}$ , but the limit distributions are different. If the density is both differentiable at t and known to be monotone, then a third estimator is obtained by isotonization of a kernel estimator. We show that this again attains the rate of convergence $n^{-1/3}$ , and compare the limit distributions of the three types of estimators. It is shown that both isotonization and smoothing lead to a more concentrated limit distribution and we study the dependence on the proportionality constant in the bandwidth. We also show that isotonization does not change the limit behaviour of a kernel estimator with a bandwidth larger than $n^{-1/3}$ , in the case that the density is known to have more than one derivative. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
40. Locally Efficient Estimation of a Multivariate Survival Function in Longitudinal Studies.
- Author
-
Van Der Laan, Mark J., Hubbard, Alan E., and Robins, James M.
- Subjects
- *
MULTIVARIATE analysis , *PROBABILITY theory , *ESTIMATION theory , *COMPUTER assisted instruction , *MODEL cars (Toys) , *HAZARDS , *STATISTICS - Abstract
We consider estimation of the joint distribution of multivariate survival times T = (T1.....Tk, which are subject to right-censoring a common censoring variable C. Two estimators are proposed: an initial inverse-probability-of-censoring weighted (IPCW) estimator and a one-step estimator. Both estimators incorporate information on variable time-independent and time-dependent prognostic factor (covariate) data. The IPCW estimator is consistent and asymptotically normal (CAN) under coarsening at random (CAR) and a correct specification of a model for the hazard of censoring given the post covariate and failure data. The one-step estimator is a locally efficient doubly robust estimator. That is, (i) it is CAN under the assumption of CAR and either (but ant necessarily both) correct specification of a model for the hazard of censoring given the past or correct specification of a model for the conditional distribution of T given past failure and covariate information, and (ii) it is efficient when both these models are correctly specified. The proposed methodology does not require that the time variables T1.....Tk be ordered, although our methods cover this important special case. In particular, our estimators can be used to estimate the gap time distributions associated with an ordered series of events. The proposed methodology is an improvement over currently available approaches in a number of ways. Specifically, when censoring and failure are dependent because the hazard of censoring depends on both past failure and covariate history, our one-step estimator is the only estimator with the double-robustness property. When censoring can be assumed to be independent of the failure and covariate processes, our locally efficient one-step estimator, unlike the maximum likelihood estimator (MLE) of van der Laan but like the estimators of Dabrowska. Prentice and Cai and Bickel. does not require smoothing and so will perform well in moderate size samples even if k is large, say 7 or 8: furthermore, ualike all previous estimators, our estimator exploits the information available in past covariate as well as failure history and so will be efficient (nearly efficient) even where the components of T are highly dependent, whenever the specified model for the conditional distribution of T given past failure and covariate information is correct (nearly correct). We examine the finite sample performance of our estimators in a simulation study. Finally, we apply our estimators to data on time to wound excision and lime to wound infection in a population of burn victims. [ABSTRACT FROM AUTHOR]
- Published
- 2002
- Full Text
- View/download PDF
41. The NPMLE for Doubly Censored Current Status Data.
- Author
-
van der Laan, Mark J. and Jewell, Nicholas P.
- Subjects
- *
ESTIMATION theory , *ALGORITHMS , *MATHEMATICAL models - Abstract
Examines the non-parametric maximum likelihood estimator for doubly censored current status data. Application of doubly censored current status data on AIDS partner studies; Distribution of interest in the current status model; Use of iterative weighted pool-adjacent-violator-algorithm to compute the estimator.
- Published
- 2001
- Full Text
- View/download PDF
42. The nonparametric maximum likelihood estimator in a class of doubly censored current status data models with application to partner studies.
- Author
-
Van Der Laan, Mark J. and Andrews, Chris
- Subjects
- *
ALGORITHMS , *HIV , *HETEROSEXUALS , *CONFIDENCE intervals , *STATISTICAL sampling - Abstract
The California Partners' Study is an ongoing investigation of heterosexual HIV-transmission in partners of infected index cases (Padian et al., 1987; Shiboski & Jewell, 1990). We consider a nonparametric maximum likelihood estimator of the distribution of the time T the partner is exposed to an infected index partner until infection based on doubly censored current status data. We provide a modified iterative weighted pool adjacent violator algorithm for computation of the nonparametric maximum likelihood estimator and show that the estimator converges at rate n-1/3. In addition, we propose confidence intervals for smooth functionals of the distribution of T. Simulations show good performance of the algorithm and confidence intervals. We apply our methodology to the California Partners' Study and discuss the implications of our results for doubly censored current status data models with other known distributions of the origin. [ABSTRACT FROM PUBLISHER]
- Published
- 2000
- Full Text
- View/download PDF
43. Locally efficient estimation of the survival distribution with right-censored data and covariates when collection of data is delayed.
- Author
-
VAN DER LAAN, MARK J. and HUBBARD, ALAN E.
- Subjects
- *
KAPLAN-Meier estimator , *NONPARAMETRIC estimation , *ASYMPTOTIC efficiencies , *PROPORTIONAL hazards models , *DISTRIBUTION (Probability theory) - Abstract
For many sources of survival data, there is a delay between the recording of vital status and its availability to the analyst, and the Kaplan-Meier estimator is typically inconsistent in these situations. In this paper we identify the optimal estimation problem. As a result of the curse of dimensionality, no globally efficient nonparametric estimator exists with a good practical performance at moderate sample sizes. Following the approach of Robins & Rotnitzky (1992), given a correctly specified model for the hazard of censoring conditional on the delay process and T, we propose a closed-form one-step estimator of the distribution of T whose asymptotic variance attains the efficiency bound, if we can correctly specify a lower-dimensional working model for the conditional distribution of T given the ascertainment process. The estimator remains consistent and asymptotically normal even if this latter submodel is misspecified. In particular, if we choose as working model independence between T and the ascertainment process, then the estimator is efficient when this holds and remains consistent and asymptotically linear otherwise. Moreover, we incorporate in our data structure a covariate process that is observed during the follow-up time and is reported with the same delays. We propose closed-form locally efficient estimators of the type described above which use all the data and allow for dependent censoring. [ABSTRACT FROM AUTHOR]
- Published
- 1998
- Full Text
- View/download PDF
44. Efficient estimation of the lifetime and disease onset distribution.
- Author
-
VAN DER LAAN, MARK J., JEWELL, NICHOLAS P., and PETERSON, DERICK R.
- Subjects
- *
ESTIMATION theory , *MATHEMATICAL statistics , *STATISTICAL correlation , *NONPARAMETRIC statistics , *STATISTICS - Abstract
We study efficient nonparametric maximum likelihood estimation of the distribution of onset and lifetime associated with an irreversible disease that is only detectable at sacrifice or death. We show that, if the onset distribution is continuous, then estimation of the lifetime distribution cannot be improved by using current status information on the time till onset. In this case the Kaplan-Meier estimator of the lifetime distribution is asymptotically efficient. The nonparametric maximum likelihood estimator tries to use current status information on the time till onset, but it is asymptotically equivalent to the Kaplan-Meier estimator, and it is outperformed by the latter in simulations. The nonparametric maximum likelihood estimator of the onset distribution is shown to be an iteratively reweighted least squares estimator which can be computed with the weighted pool-adjacent-violators algorithm. We show that it is unnecessary to estimate the weights iteratively since an initial estimate cannot be improved. This insight leads to a simple, explicit estimator which improves on the nonparametric maximum likelihood estimator of the onset distribution. The results are verified with simulations, and a data analysis example is provided. [ABSTRACT FROM AUTHOR]
- Published
- 1997
- Full Text
- View/download PDF
45. The Two-interval Line-segment Problem.
- Author
-
Van Der Laan, Mark J.
- Subjects
- *
ESTIMATION theory , *EQUATIONS , *STATISTICS - Abstract
ABSTRACT. In this paper we define and study the non-parametric maximum Likelihood estimator (NPMLE) in the one-dimensional line-segment problem, where we observe line-segments on the real line through an interval with a gap which is smaller than the two remaining intervals. We define the self-consistency equations for the NPMLE and provide a quick algorithm for solving them. We prove supremum norm weak convergence to a Gaussian process and efficiency of the NPMLE. The problem has a geological application in the study of the lifespan of species. [ABSTRACT FROM AUTHOR]
- Published
- 1998
- Full Text
- View/download PDF
46. Locally Efficient Estimation With Current Status Data and Time-Dependent Covariates.
- Author
-
Van Der Laan, Mark J. and Robins, James M.
- Subjects
- *
FAILURE time data analysis , *STATISTICS - Abstract
Introduces an inverse probability of censoring weighted estimator of the distribution of a failure time variable T and of smooth functionals of this distribution that are guaranteed to be consistent and asymptotically normal. Inverse probability of censoring weighted estimators; Construction of a locally efficient one-step estimator.
- Published
- 1998
- Full Text
- View/download PDF
47. Causal effects in longitudinal studies: Definition and maximum likelihood estimation
- Author
-
Neugebauer, Romain and van der Laan, Mark J.
- Subjects
- *
LONGITUDINAL method , *METHODOLOGY , *ASSOCIATIONS, institutions, etc. - Abstract
Abstract: Recently, a nonparametric marginal structural model (NPMSM) approach to Causal Inference has been proposed [Neugebauer, R., van der Laan, M., 2006. Nonparametric causal effects based on marginal structural models. J. Statist. Plann. Inference (in press), www http://www.sciencedirect.com/science/journal/03783758 .] as an appealing practical alternative to the original parametric MSM (PMSM) approach introduced by Robins [Robins, J., 1998a. Marginal structural models. In: 1997 Proceedings of the American Statistical Association, American Statistical Association, Alexandria, VA, pp. 1–10]. The new MSM-based causal inference methodology generalizes the concept of causal effects: the proposed nonparametric causal effects are interpreted as summary measures of the causal effects defined with PMSMs. In addition, causal inference with NPMSM does not rely on the assumed correct specification of a parametric MSM but instead defines causal effects based on a user-specified working causal model which can be willingly misspecified. The NPMSM approach was developed for studies with point treatment data or with longitudinal data where the outcome is not time-dependent (typically collected at the end of data collection). In this paper, we generalize this approach to longitudinal studies where the outcome is time-dependent, i.e. collected throughout the span of the studies, and address the subsequent estimation inconsistency which could easily arise from a hasty generalization of the algorithm for maximum likelihood estimation. More generally, we provide an overview of the multiple causal effect representations which have been developed based on MSMs in longitudinal studies. [Copyright &y& Elsevier]
- Published
- 2006
- Full Text
- View/download PDF
48. G-computation estimation for causal inference with complex longitudinal data
- Author
-
Neugebauer, Romain and van der Laan, Mark J.
- Subjects
- *
LONGITUDINAL method , *ALGORITHMS , *MONTE Carlo method , *ESTIMATION theory - Abstract
Abstract: In a companion paper, [Neugebauer, R., van der Laan, M.J., 2006b. Causal effects in longitudinal studies: definition and maximum likelihood estimation. Comput. Stat. Data. Anal., this issue, doi:10.1016/j.csda.2006.06.013], we provided an overview of causal effect definition with marginal structural models (MSMs) in longitudinal studies. A parametric MSM (PMSM) and a non-parametric MSM (NPMSM) approach were described for the representation of causal effects in pooled or stratified analyses of treatment effects on time-dependent outcomes. Maximum likelihood estimation, also referred to as G-computation estimation, was detailed for these causal effects. In this paper, we develop new algorithms for the implementation of the G-computation estimators of both NPMSM and PMSM causal effects. Current algorithms rely on Monte Carlo simulation of all possible treatment-specific outcomes, also referred to as counterfactuals or potential outcomes. This task becomes computationally impracticable (a) in studies with a continuous treatment, and/or (b) in longitudinal studies with long follow-up with or without time-dependent outcomes. The proposed algorithms address this important computing limitation inherent to G-computation estimation in most longitudinal studies. Finally, practical considerations about the proposed algorithms lead to a further generalization of the definition of NPMSM causal effects in order to allow more reliable applications of these methodologies to a broader range of real-life studies. Results are illustrated with two simulation studies. [Copyright &y& Elsevier]
- Published
- 2006
- Full Text
- View/download PDF
49. A semiparametric model selection criterion with applications to the marginal structural model
- Author
-
Brookhart, M. Alan and van der Laan, Mark J.
- Subjects
- *
HIV-positive men , *ESTIMATION theory , *ECONOMETRIC models , *ALGORITHMS - Abstract
Abstract: Estimators of the parameter of interest in semiparametric models often depend on a guessed model for the nuisance parameter. The choice of the model for the nuisance parameter can affect both the finite sample bias and efficiency of the resulting estimator of the parameter of interest. In this paper we propose a finite sample criterion based on cross validation that can be used to select a nuisance parameter model from a list of candidate models. We show that expected value of this criterion is minimized by the nuisance parameter model that yields the estimator of the parameter of interest with the smallest mean-squared error relative to the expected value of an initial consistent reference estimator. In a simulation study, we examine the performance of this criterion for selecting a model for a treatment mechanism in a marginal structural model (MSM) of point treatment data. For situations where all possible models cannot be evaluated, we outline a forward/backward model selection algorithm based on the cross validation criterion proposed in this paper and show how it can be used to select models for multiple nuisance parameters. Finally, we apply the forward model selection algorithm to a MSM analysis of the relationship between boiled water use and gastrointestinal illness in HIV positive men. [Copyright &y& Elsevier]
- Published
- 2006
- Full Text
- View/download PDF
50. Fitting of mixtures with unspecified number of components using cross validation distance estimate
- Author
-
Miloslavsky, Maja and van der Laan, Mark J.
- Subjects
- *
FUNCTIONALS , *PRINCIPAL components analysis - Abstract
Estimation of the number of mixture components
(k) is an unsolved problem. Available methods for estimation ofk include bootstrapping the likelihood ratio test statistic and optimizing a variety of validity functionals. We investigate the minimization of distance between fitted mixture model and the true density as a method for estimatingk . The distances considered are Kullback–Leibler(KL) andL2 . We estimate these distances using cross validation. A reliable estimate ofk is obtained by voting ofB estimates ofk corresponding toB cross validation estimates of distance. This estimation method withKL distance is very similar to Monte Carlo cross validated likelihood method discussed by Smyth (Statist. Computing 10(1) (2000) 63). With focus on univariate normal mixtures, we present simulation studies that compare the cross validated distance method with Akaika''s Information Criterion (AIC), Bayesian Information Criterion/Minimum description criterion (BIC/MDL), and Information Complexity (ICOMP). We also apply the cross validation estimate of distance approach along with AIC, BIC/MDL and ICOMP approach, to data from an osteoporosis drug trial in order to find groups that differentially respond to treatment. In our closing remarks, we highlight the general applicability of our method to choose between any set of estimators of a particular parameter of interest, assuming the presence of an approximately unbiased estimator. [Copyright &y& Elsevier]- Published
- 2003
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.