Vegetation/Ecosystem Modeling and Analysis Project (VEMAP) Phase 2 model experiments investigated the response of biogeochemical and dynamic global vegetation models (DGVMs) to differences in climate over the conterminous United States. This was accomplished by simulating ecosystem processes using historical climate and atmospheric CO2 records from 1895–1993. We evaluated the behavior of six models (Biome-BGC, Century, GTEC, LPJ, MC1, and TEM) by comparing simulated runoff in 13 watersheds to gauged streamflow from the Hydro-Climatic Data Network. Metrics used to assess the “goodness of fit” between simulated and observed values were: (1) Pearson's r to evaluate the overall data set, (2) Kendall's τ to gauge seasonality trends as derived from a time-series analysis of monthly runoff, and (3) three measures of absolute and relative error.We found small differences in performance among the six models over all watersheds. However, the models yielded highly divergent results depending upon the watershed analyzed. Performance of the ensemble of models in a watershed was positively correlated with observed streamflow: models in the wettest watersheds in this study were associated with the highest model correlations and largest absolute errors, and models in the driest watersheds were associated with the lowest correlations and smallest absolute errors. Mean relative error was small and nearly constant across watersheds. A bias estimator showed that the models tended to underestimate runoff in wet watersheds and overestimate runoff in dry watersheds. Analysis of long-term trends in runoff using a moving-average approach demonstrated the ability of the models to reproduce temporal variation in observed data, even though quantitative differences among models were large.Models relying on prescribed vegetation (Biome-BGC, Century, and TEM) outperformed the two DGVMs (LPJ and MC1); GTEC gave the poorest fit to observations due to the absence of an evaporation function and a snow routine. Across all 13 watersheds, TEM ranked the highest in model performance. The validation results presented here suggest that improvements in the simulation of hydrologic processes in land-surface models will come, in part, from a more realistic representation of subgrid-scale soil moisture and from a more detailed understanding and representation of subsurface processes.