Exploiting structured data for robust and adaptable natural language representations
Abstract/Contents
- Abstract
- The foundations of many recent machine learning successes are natural language representations pretrained over vast amounts of unstructured text. Over the past several decades, natural language representations have been trained on increasingly larger datasets, with the most recent representations trained on over one trillion tokens. However, despite this immense scale, existing representations continue to face long-standing challenges, such as capturing rare, or long-tail, knowledge and adapting to natural language feedback. A key bottleneck is that current representations rely on memorizing knowledge in unstructured data, and thus are ultimately limited by the knowledge present in unstructured data. Unstructured data has limited facts about many entities (people, places, or things), as well as limited domain-specific data, like goal-oriented conversations. In this thesis, we exploit a largely untapped and carefully curated resource---structured data---to improve natural language representations. Structured data includes knowledge graphs and item collections (e.g., playlists) that contain rich relationships between entities, such as the birthplace of an artist, all versions of a single song, or all songs by the same artist. These relationships can be challenging to learn from unstructured data, as they may occur infrequently, or may not even exist, in unstructured data. Yet, structured data comes with limitations: humans communicate in unstructured natural language---not structured queries---and structured data also can be incomplete and noisy. Motivated by the complementary knowledge in unstructured and structured data, we present three techniques that combine structured data with unstructured data for training natural language representations. Our techniques span the three main components of a machine learning pipeline: the training data, the model architecture, and the training objective. First, with TalkTheWalk, we use structured data to generate unstructured training data for conversational recommendation systems. By training a conversational music recommendation system over the synthetic data, we demonstrate how structured data can help improve adaptability over standard recommendation baselines. Next with Bootleg, we introduce a Transformer-based architecture that leverages structured data to learn key reasoning patterns from unstructured text for named entity disambiguation. We demonstrate that learning these reasoning patterns leads to significant lift on disambiguating entities that rarely or never occur in text, and we discuss our results applying Bootleg to a production assistant task at a major technology company. Finally, with TABi, we add structured data as supervision in a contrastive loss function to improve robustness, while using more general-purpose models. We validate that TABi not only improves rare entity retrieval, but also performs strongly in settings with incomplete and noisy structured data. The three techniques introduced in this thesis---TalkTheWalk, Bootleg, and TABi---demonstrate that training approaches that combine structured data with unstructured data can enable more robust and adaptable natural language representations.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2023; ©2023 |
Publication date | 2023; 2023 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Leszczynski, Megan Eileen |
---|---|
Degree supervisor | Ré, Christopher |
Thesis advisor | Ré, Christopher |
Thesis advisor | Lam, Monica S |
Thesis advisor | Manning, Christopher D |
Degree committee member | Lam, Monica S |
Degree committee member | Manning, Christopher D |
Associated with | Stanford University, School of Engineering |
Associated with | Stanford University, Computer Science Department |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Megan Eileen Leszczynski. |
---|---|
Note | Submitted to the Computer Science Department. |
Thesis | Thesis Ph.D. Stanford University 2023. |
Location | https://purl.stanford.edu/mw196yh2577 |
Access conditions
- Copyright
- © 2023 by Megan Eileen Leszczynski
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...