Question answering over the semantic web with neural semantic parsing
- The Semantic Web adds meaning to the vast web information, enabling machines to not only understand the data but also reason, infer, and make logical connections across diverse domains. For example, Schema.org creates a standardized vocabulary that enables webmasters to embed structured data on their web pages for applications like search engines; Wikidata, the largest open knowledge graph with data about various entities, concepts, and facts, acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia. However, accessing the Semantic Web is challenging for average users: unlike traditional web browsing, navigating the Semantic Web requires expertise in ontology and query languages. As a result, the Semantic Web remains confined to technical experts, researchers, and specialized applications, limiting its accessibility to a broader population. To bridge the gap between user intents and structured data, semantic parsing has been used to build question-answering systems. Semantic parsing is the task to convert natural language into logical forms such as query languages, which can be executed to extract information. Traditional approaches to building such systems require substantial amounts of manually annotated data, often making them impractical and costly to deploy at scale. This thesis introduces a novel synthesis-based methodology to generate large high-quality training data based on the schema and data values given a database or knowledge graph. This approach significantly reduces the cost of building a question-answering agent for the Semantic Web. To provide full coverage of the query space, we propose a comprehensive template system to synthesize a sample of all possible queries that can be composed with the basic database algebra operators given a schema. The template set is built based on natural language grammar, capturing a wide range of variety including different sentence purposes and parts of speech, minimizing the manual annotation effort and the reliance on paraphrasing. We created the Schema2QA benchmark by crawling Schema.org data from websites in 6 domains, and we show that the semantic parsers trained with our synthetic data and few-shot human paraphrases achieve 64% to 75% accuracy on these domains, outperforming major commercial assistants on these long-tail complex questions by at least 18% while remaining comparable accuracy on the more popular questions. To further reduce the manual effort, we propose a neural property annotator and a filtered neural paraphraser to the pipeline, enabling bootstrapping a question-answering agent from the database schema fully automatically. We show that the automatically generated parsers achieve an average of 62.9% accuracy, only 6.4% lower than models trained with expert annotations and human paraphrase data, surpassing the state-of-the-art zero-shot models. To answer open-domain questions, we build a question-answering agent based on the information from Wikidata, the largest open knowledge base for world knowledge. To tackle the scale and sparsity of information in Wikidata, we propose a simplified abstract query representation and an entity-linking approach with failure recovery to improve the performance of the model. Evaluated on WikiWebQuestions, the first large Wikidata semantic parsing dataset with real-world human-written questions we created, our approach, WikiSP, achieves 69% answer accuracy on the dev set and 59% on the test set, establishing a strong baseline for the benchmark. In addition, we show that we can pair WikiSP with GPT-3 to provide a combination of verifiable results and qualified guesses that can provide useful answers to 97% of the questions. In summary, this thesis proposes a novel synthesis-based methodology for question-answering semantic parsing. By leveraging the schema and data values within the databases and knowledge bases, our methodology can substantially reduce the costs of building question-answering agents, thereby paving the way to access to the Semantic Web for a broader population with natural language.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Degree committee member
|Stanford University, School of Engineering
|Stanford University, Computer Science Department
|Statement of responsibility
|Submitted to the Computer Science Department.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Silei Xu
- This work is licensed under a Creative Commons Attribution Share Alike 3.0 Unported license (CC BY-SA).
Also listed in
Loading usage metrics...