Natural language evaluation with humans in the loop and statistical estimators

Chaganty, Arun Tejasvi

Natural language evaluation with humans in the loop and statistical estimators

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fbz575tb8993" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: In natural language tasks such as knowledge base population, text summarization or open-response question answering, a significant challenge is simply evaluating the performance of automated systems because of the large diversity of possible outputs. Existing fully-automatic methods for evaluating these systems rely on an incomplete set of annotated references which lead to systematic biases against certain system improvements: in other words, genuinely good ideas are systematically discarded simply because of limitations in our evaluation methodology. As a result, human evaluation, which can be prohibitively expensive, has remained the most trusted mode of evaluation for these tasks. In this work, we show how one can decrease the costs of incorporating human feedback through the design of appropriate statistical estimators. First, we consider the "finite incompleteness" setting where the output space is too large to exhaustively annotate, but we may still expect significant overlap between the output of different systems. Naively combining annotations from different systems leads to a representation bias. Here, we show that the cost of obtaining human feedback can be significantly amortized by using a novel importance-reweighted estimator. We apply this estimator to design a new evaluation methodology for knowledge base population and empirically show that the cost of evaluating precision and recall within this framework can be reduced by a factor of 4. Next, we consider the "infinite incompleteness" setting wherein few, if any, systems ever produce identical output. Traditionally, the community has relied on similarity-based automatic metrics such as BLEU or ROUGE to compare the outputs produced by different systems. Unfortunately, these metrics have been shown to correlate poorly with human judgment and thus introduce bias in evaluation. We derive an unbiased estimator that optimally combines these automatic metrics with human feedback. Our theoretical results allow us to characterize potential cost reductions only in terms of the tasks' subjectivity, measured by inter-annotator variance, and the automatic metrics' quality, measured by correlation with human judgments. On two popular natural language generation tasks, question answering and summarization, we empirically show that currently we can achieve at most a 7-13% reduction in cost on two tasks, exposing fundamental limitations in debiasing current automatic metrics. Finally, we turn our attention to incompleteness in the training data, particularly in low-resource settings. Here, our machine learning systems simply have not seen sufficient training data for a particular phenomenon to accurately make predictions for them at test time. To tackle this incarnation of incompleteness, we train a system ``on-the-job'' by requesting for human feedback in real-time while the model is deployed to economically fill in holes in the training data and thus resolve uncertainty in the model. Our key idea here is to cast the problem as a stochastic game based on Bayesian decision theory, which allows us to balance latency, cost, and accuracy objectives in a principled way. When tested on three classification tasks---named-entity recognition, sentiment classification, and image classification---we obtain an order of magnitude reduction in cost compared to full human annotation even when starting from zero training examples, while also boosting performance relative to a classical supervised model on the expert-provided labels.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2018; ©2018
Publication date	2018; 2018
Issuance	monographic
Language	English

Creators/Contributors

Author	Chaganty, Arun Tejasvi
Degree supervisor	Liang, Percy
Thesis advisor	Liang, Percy
Thesis advisor	Bernstein, Michael S, 1984-
Thesis advisor	Manning, Christopher D
Degree committee member	Bernstein, Michael S, 1984-
Degree committee member	Manning, Christopher D
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Arun Tejasvi Chaganty.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2018.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

View in SearchWorks

Loading usage metrics...