Natural language evaluation with humans in the loop and statistical estimators

Placeholder Show Content

Abstract/Contents

Abstract
In natural language tasks such as knowledge base population, text summarization or open-response question answering, a significant challenge is simply evaluating the performance of automated systems because of the large diversity of possible outputs. Existing fully-automatic methods for evaluating these systems rely on an incomplete set of annotated references which lead to systematic biases against certain system improvements: in other words, genuinely good ideas are systematically discarded simply because of limitations in our evaluation methodology. As a result, human evaluation, which can be prohibitively expensive, has remained the most trusted mode of evaluation for these tasks. In this work, we show how one can decrease the costs of incorporating human feedback through the design of appropriate statistical estimators. First, we consider the "finite incompleteness" setting where the output space is too large to exhaustively annotate, but we may still expect significant overlap between the output of different systems. Naively combining annotations from different systems leads to a representation bias. Here, we show that the cost of obtaining human feedback can be significantly amortized by using a novel importance-reweighted estimator. We apply this estimator to design a new evaluation methodology for knowledge base population and empirically show that the cost of evaluating precision and recall within this framework can be reduced by a factor of 4. Next, we consider the "infinite incompleteness" setting wherein few, if any, systems ever produce identical output. Traditionally, the community has relied on similarity-based automatic metrics such as BLEU or ROUGE to compare the outputs produced by different systems. Unfortunately, these metrics have been shown to correlate poorly with human judgment and thus introduce bias in evaluation. We derive an unbiased estimator that optimally combines these automatic metrics with human feedback. Our theoretical results allow us to characterize potential cost reductions only in terms of the tasks' subjectivity, measured by inter-annotator variance, and the automatic metrics' quality, measured by correlation with human judgments. On two popular natural language generation tasks, question answering and summarization, we empirically show that currently we can achieve at most a 7-13% reduction in cost on two tasks, exposing fundamental limitations in debiasing current automatic metrics. Finally, we turn our attention to incompleteness in the training data, particularly in low-resource settings. Here, our machine learning systems simply have not seen sufficient training data for a particular phenomenon to accurately make predictions for them at test time. To tackle this incarnation of incompleteness, we train a system ``on-the-job'' by requesting for human feedback in real-time while the model is deployed to economically fill in holes in the training data and thus resolve uncertainty in the model. Our key idea here is to cast the problem as a stochastic game based on Bayesian decision theory, which allows us to balance latency, cost, and accuracy objectives in a principled way. When tested on three classification tasks---named-entity recognition, sentiment classification, and image classification---we obtain an order of magnitude reduction in cost compared to full human annotation even when starting from zero training examples, while also boosting performance relative to a classical supervised model on the expert-provided labels.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2018; ©2018
Publication date 2018; 2018
Issuance monographic
Language English

Creators/Contributors

Author Chaganty, Arun Tejasvi
Degree supervisor Liang, Percy
Thesis advisor Liang, Percy
Thesis advisor Bernstein, Michael S, 1984-
Thesis advisor Manning, Christopher D
Degree committee member Bernstein, Michael S, 1984-
Degree committee member Manning, Christopher D
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Arun Tejasvi Chaganty.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2018.
Location electronic resource

Access conditions

Copyright
© 2018 by Arun Tejasvi Chaganty
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...