Fusing multimodal knowledge in language models

Placeholder Show Content

Abstract/Contents

Abstract
Language models, such as GPT-4, have the capability to generate textual responses to user queries. They are used across various tasks, including question answering, translation, summarization, and personal assistance. However, to create more versatile AI assistants, these models need to handle more diverse and complex tasks involving domain or visual knowledge, such as answering medical questions and explaining or generating images. This necessity motivates the development of models that can access and leverage diverse knowledge sources beyond text, such as databases and images. In this thesis, we aim to develop language models capable of using multimodal knowledge, encompassing text, knowledge graphs, and images, to address various user queries. Text provides broad and contextually rich knowledge, knowledge graphs often supply structured domain knowledge, and images facilitate various visual applications. This thesis consists of five chapters. The first chapter introduces methods for language models to efficiently learn knowledge from textual data. Specifically, we train language models on a sequence of multiple related documents, encouraging them to learn and reason about knowledge with long-range dependencies. This approach yields strong performance on complex long-context and multi-step reasoning tasks. In the second chapter, we introduce methods that enable language models to harness knowledge graph information. Specifically, we develop a new model architecture, a hybrid of language models and graph neural networks, along with a training objective that fuses text and knowledge graph representations. This method demonstrates strong performance on tasks involving domain knowledge, such as medical question answering. In the third chapter, to empower language models to use and generate visual content alongside textual information, we design unified multimodal models capable of encoding, retrieving, and decoding interleaved sequences of text and images. The model employs a retriever to fetch textual or visual knowledge and integrates it into a multimodal Transformer that encodes and decodes both text and images using token representations. Finally, in the forth and fifth chapters, we demonstrate the application of textual, structured, and visual knowledge fusion techniques to solve practical healthcare tasks, including clinical trial outcome prediction and multimodal medical question answering. In summary, this thesis builds models capable of comprehending and generating multimodal content, spanning text, knowledge graphs, and images.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2024; ©2024
Publication date 2024; 2024
Issuance monographic
Language English

Creators/Contributors

Author Yasunaga, Michihiro
Degree supervisor Leskovec, Jurij
Degree supervisor Liang, Percy
Thesis advisor Leskovec, Jurij
Thesis advisor Liang, Percy
Thesis advisor Manning, Christopher D
Degree committee member Manning, Christopher D
Associated with Stanford University, School of Engineering
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Michihiro Yasunaga.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2024.
Location https://purl.stanford.edu/dz688yd5162

Access conditions

Copyright
© 2024 by Michihiro Yasunaga
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...