Exploiting structured data for robust and adaptable natural language representations

Placeholder Show Content

Abstract/Contents

Abstract
The foundations of many recent machine learning successes are natural language representations pretrained over vast amounts of unstructured text. Over the past several decades, natural language representations have been trained on increasingly larger datasets, with the most recent representations trained on over one trillion tokens. However, despite this immense scale, existing representations continue to face long-standing challenges, such as capturing rare, or long-tail, knowledge and adapting to natural language feedback. A key bottleneck is that current representations rely on memorizing knowledge in unstructured data, and thus are ultimately limited by the knowledge present in unstructured data. Unstructured data has limited facts about many entities (people, places, or things), as well as limited domain-specific data, like goal-oriented conversations. In this thesis, we exploit a largely untapped and carefully curated resource---structured data---to improve natural language representations. Structured data includes knowledge graphs and item collections (e.g., playlists) that contain rich relationships between entities, such as the birthplace of an artist, all versions of a single song, or all songs by the same artist. These relationships can be challenging to learn from unstructured data, as they may occur infrequently, or may not even exist, in unstructured data. Yet, structured data comes with limitations: humans communicate in unstructured natural language---not structured queries---and structured data also can be incomplete and noisy. Motivated by the complementary knowledge in unstructured and structured data, we present three techniques that combine structured data with unstructured data for training natural language representations. Our techniques span the three main components of a machine learning pipeline: the training data, the model architecture, and the training objective. First, with TalkTheWalk, we use structured data to generate unstructured training data for conversational recommendation systems. By training a conversational music recommendation system over the synthetic data, we demonstrate how structured data can help improve adaptability over standard recommendation baselines. Next with Bootleg, we introduce a Transformer-based architecture that leverages structured data to learn key reasoning patterns from unstructured text for named entity disambiguation. We demonstrate that learning these reasoning patterns leads to significant lift on disambiguating entities that rarely or never occur in text, and we discuss our results applying Bootleg to a production assistant task at a major technology company. Finally, with TABi, we add structured data as supervision in a contrastive loss function to improve robustness, while using more general-purpose models. We validate that TABi not only improves rare entity retrieval, but also performs strongly in settings with incomplete and noisy structured data. The three techniques introduced in this thesis---TalkTheWalk, Bootleg, and TABi---demonstrate that training approaches that combine structured data with unstructured data can enable more robust and adaptable natural language representations.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Leszczynski, Megan Eileen
Degree supervisor Ré, Christopher
Thesis advisor Ré, Christopher
Thesis advisor Lam, Monica S
Thesis advisor Manning, Christopher D
Degree committee member Lam, Monica S
Degree committee member Manning, Christopher D
Associated with Stanford University, School of Engineering
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Megan Eileen Leszczynski.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/mw196yh2577

Access conditions

Copyright
© 2023 by Megan Eileen Leszczynski
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...