Start Over

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Authors :: Qiang Ning
Ben Bogin
Sihao Chen
Hannaneh Hajishirzi
Ben Zhou
Eric Wallace
Phoebe Mulcaire
Dheeru Dua
Kevin Lin
Ananth Gottumukkala
Jonathan Berant
Sanjay Subramanian
Ally Zhang
Victoria Basmov
Noah A. Smith
Pradeep Dasigi
Nitish Gupta
Jiangming Liu
Daniel Khashabi
Matt Gardner
Sameer Singh
Gabriel Ilharco
Reut Tsarfaty
Yoav Artzi
Nelson F. Liu
Yanai Elazar
Source :: Findings of the Association for Computational Linguistics: EMNLP 2020, EMNLP (Findings)
Publisher :: Association for Computational Linguistics
Abstract: Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Details

Language :: English
Database :: OpenAIRE
Journal :: Findings of the Association for Computational Linguistics: EMNLP 2020, EMNLP (Findings)
Accession number :: edsair.doi.dedup.....b920a7df716c795505051e9ad280a1b0
Full Text :: https://doi.org/10.18653/v1/2020.findings-emnlp.117