Start Over

Significance of changes in medium-range forecast scores

Authors :: Alan J. Geer
Source :: Tellus: Series A, Dynamic Meteorology and Oceanography, Vol 68, Iss 0, Pp 1-21 (2016)
Publication Year :: 2016
Publisher :: Stockholm University Press, 2016.
Abstract: The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.