Evaluating NLP Models via Contrast Sets

Current NLP models are often “cheating” on supervised learning tasks by exploiting correlations that arise from the particularities of the dataset. Therefore they often fail to learn the original intent of the dataset creators. This paper argues that NLP models should be evaluated on Contrast Sets, which are hand-crafted perturbations by the dataset authors that capture their intent in a meaningful way.

Abstract:
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset’s intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Authors: Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, Ben Zhou

Links:
YouTube:
Twitter:
BitChute:
Minds:

source

What's Hot

Remaining Windsurf team and tech acquired by Cognition, makers of Devin: ‘We’re friends with Anthropic again’

Following YouTube, Meta announces crackdown on ‘unoriginal’ Facebook content

DeepMind’s Veo3 AI – The New King Is Here!

Evaluating NLP Models via Contrast Sets

Yannic Kilcher Live Stream

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Murujuga Rock Art in Australia Receives UNESCO World Heritage Status

‘Earth Room’ Caretaker Dies at 70

Homeland Security Targets Chicago’s National Museum of Puerto Rican Arts & Culture

1,600-Year-Old Tomb of Mayan City’s Founding King Discovered in Belize

Remaining Windsurf team and tech acquired by Cognition, makers of Devin: ‘We’re friends with Anthropic again’

Following YouTube, Meta announces crackdown on ‘unoriginal’ Facebook content

DeepMind’s Veo3 AI – The New King Is Here!

What's Hot

Evaluating NLP Models via Contrast Sets

Related Posts

Subscribe to Updates