Exploiting structured data for robust and adaptable natural language representations

Leszczynski, Megan Eileen

Exploiting structured data for robust and adaptable natural language representations

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fmw196yh2577" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The foundations of many recent machine learning successes are natural language representations pretrained over vast amounts of unstructured text. Over the past several decades, natural language representations have been trained on increasingly larger datasets, with the most recent representations trained on over one trillion tokens. However, despite this immense scale, existing representations continue to face long-standing challenges, such as capturing rare, or long-tail, knowledge and adapting to natural language feedback. A key bottleneck is that current representations rely on memorizing knowledge in unstructured data, and thus are ultimately limited by the knowledge present in unstructured data. Unstructured data has limited facts about many entities (people, places, or things), as well as limited domain-specific data, like goal-oriented conversations. In this thesis, we exploit a largely untapped and carefully curated resource---structured data---to improve natural language representations. Structured data includes knowledge graphs and item collections (e.g., playlists) that contain rich relationships between entities, such as the birthplace of an artist, all versions of a single song, or all songs by the same artist. These relationships can be challenging to learn from unstructured data, as they may occur infrequently, or may not even exist, in unstructured data. Yet, structured data comes with limitations: humans communicate in unstructured natural language---not structured queries---and structured data also can be incomplete and noisy. Motivated by the complementary knowledge in unstructured and structured data, we present three techniques that combine structured data with unstructured data for training natural language representations. Our techniques span the three main components of a machine learning pipeline: the training data, the model architecture, and the training objective. First, with TalkTheWalk, we use structured data to generate unstructured training data for conversational recommendation systems. By training a conversational music recommendation system over the synthetic data, we demonstrate how structured data can help improve adaptability over standard recommendation baselines. Next with Bootleg, we introduce a Transformer-based architecture that leverages structured data to learn key reasoning patterns from unstructured text for named entity disambiguation. We demonstrate that learning these reasoning patterns leads to significant lift on disambiguating entities that rarely or never occur in text, and we discuss our results applying Bootleg to a production assistant task at a major technology company. Finally, with TABi, we add structured data as supervision in a contrastive loss function to improve robustness, while using more general-purpose models. We validate that TABi not only improves rare entity retrieval, but also performs strongly in settings with incomplete and noisy structured data. The three techniques introduced in this thesis---TalkTheWalk, Bootleg, and TABi---demonstrate that training approaches that combine structured data with unstructured data can enable more robust and adaptable natural language representations.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2023; ©2023
Publication date	2023; 2023
Issuance	monographic
Language	English

Creators/Contributors

Author	Leszczynski, Megan Eileen
Degree supervisor	Ré, Christopher
Thesis advisor	Ré, Christopher
Thesis advisor	Lam, Monica S
Thesis advisor	Manning, Christopher D
Degree committee member	Lam, Monica S
Degree committee member	Manning, Christopher D
Associated with	Stanford University, School of Engineering
Associated with	Stanford University, Computer Science Department

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Megan Eileen Leszczynski.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2023.
Location	https://purl.stanford.edu/mw196yh2577

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...