Natural language interfaces for semi-structured web pages

Placeholder Show Content

Abstract/Contents

Abstract
Question answering (QA) systems take natural language questions and then compute answers based on a knowledge source. This dissertation focuses on improving QA systems along two axes. First, instead of operating on knowledge sources with a fixed schema such as a database, we propose to use web pages, which contain a large amount of up-to-date open-domain information (high BREADTH), as the knowledge source. Second, we want the QA system to understand more complex questions and perform different types of multistep reasoning to compute the answer (high DEPTH). Unlike most previous works on retrieval-based QA (which operate on open-domain unstructured text but target only factoid questions) and knowledge-based QA (which can handle compositional questions but on knowledge sources with fixed schemata), we aim to address the two axes simultaneously. One important aspect of web pages is that they are semi-structured: they contain structural constructs such as tables and template-generated product listings, but the schemata of such structures are not known in advance by the QA system. To explore the semi-structured nature of web pages, we first investigate the task of extracting a list of entities from the web page based on the natural language specification (e.g., from "(What are) hiking trails near Baltimore", extract the trail names from a table column). Then, to increase the complexity of the questions, we next study the task of answering complex questions on open-domain semistructured web tables using question-answer pairs as supervision (e.g., answering "Where did the last 1st place finish occur?" in an athlete's statistics table). To handle compositional questions with different types of operations, we frame the task as learning a semantic parser, which maps questions into compositional logical forms that can be executed to get the answer. Our semantic parser can answer complex questions on unseen web tables and achieves an accuracy of 43.7% on the dataset. Overall, we show that while the unknown schema of the tables (increased BREADTH) and complexity in the questions (increased DEPTH) lead to an exploding search space of logical forms, our proposed methods control the search space to a manageable size, enabling us to train a QA system that can operate on open-domain web pages. The resulting QA system can potentially enable virtual assistants, search engines, and other similar products to handle a much wider range of user's utterances.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Pasupat, Panupong
Degree supervisor Liang, Percy
Thesis advisor Liang, Percy
Thesis advisor Jurafsky, Dan, 1962-
Thesis advisor Manning, Christopher D
Degree committee member Jurafsky, Dan, 1962-
Degree committee member Manning, Christopher D
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Panupong Pasupat.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Panupong Pasupat
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...