Fusing multimodal knowledge in language models

Yasunaga, Michihiro

Fusing multimodal knowledge in language models

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fdz688yd5162" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Language models, such as GPT-4, have the capability to generate textual responses to user queries. They are used across various tasks, including question answering, translation, summarization, and personal assistance. However, to create more versatile AI assistants, these models need to handle more diverse and complex tasks involving domain or visual knowledge, such as answering medical questions and explaining or generating images. This necessity motivates the development of models that can access and leverage diverse knowledge sources beyond text, such as databases and images. In this thesis, we aim to develop language models capable of using multimodal knowledge, encompassing text, knowledge graphs, and images, to address various user queries. Text provides broad and contextually rich knowledge, knowledge graphs often supply structured domain knowledge, and images facilitate various visual applications. This thesis consists of five chapters. The first chapter introduces methods for language models to efficiently learn knowledge from textual data. Specifically, we train language models on a sequence of multiple related documents, encouraging them to learn and reason about knowledge with long-range dependencies. This approach yields strong performance on complex long-context and multi-step reasoning tasks. In the second chapter, we introduce methods that enable language models to harness knowledge graph information. Specifically, we develop a new model architecture, a hybrid of language models and graph neural networks, along with a training objective that fuses text and knowledge graph representations. This method demonstrates strong performance on tasks involving domain knowledge, such as medical question answering. In the third chapter, to empower language models to use and generate visual content alongside textual information, we design unified multimodal models capable of encoding, retrieving, and decoding interleaved sequences of text and images. The model employs a retriever to fetch textual or visual knowledge and integrates it into a multimodal Transformer that encodes and decodes both text and images using token representations. Finally, in the forth and fifth chapters, we demonstrate the application of textual, structured, and visual knowledge fusion techniques to solve practical healthcare tasks, including clinical trial outcome prediction and multimodal medical question answering. In summary, this thesis builds models capable of comprehending and generating multimodal content, spanning text, knowledge graphs, and images.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2024; ©2024
Publication date	2024; 2024
Issuance	monographic
Language	English

Creators/Contributors

Author	Yasunaga, Michihiro
Degree supervisor	Leskovec, Jurij
Degree supervisor	Liang, Percy
Thesis advisor	Leskovec, Jurij
Thesis advisor	Liang, Percy
Thesis advisor	Manning, Christopher D
Degree committee member	Manning, Christopher D
Associated with	Stanford University, School of Engineering
Associated with	Stanford University, Computer Science Department

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Michihiro Yasunaga.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2024.
Location	https://purl.stanford.edu/dz688yd5162

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...