Fusing multimodal knowledge in language models
Abstract/Contents
- Abstract
- Language models, such as GPT-4, have the capability to generate textual responses to user queries. They are used across various tasks, including question answering, translation, summarization, and personal assistance. However, to create more versatile AI assistants, these models need to handle more diverse and complex tasks involving domain or visual knowledge, such as answering medical questions and explaining or generating images. This necessity motivates the development of models that can access and leverage diverse knowledge sources beyond text, such as databases and images. In this thesis, we aim to develop language models capable of using multimodal knowledge, encompassing text, knowledge graphs, and images, to address various user queries. Text provides broad and contextually rich knowledge, knowledge graphs often supply structured domain knowledge, and images facilitate various visual applications. This thesis consists of five chapters. The first chapter introduces methods for language models to efficiently learn knowledge from textual data. Specifically, we train language models on a sequence of multiple related documents, encouraging them to learn and reason about knowledge with long-range dependencies. This approach yields strong performance on complex long-context and multi-step reasoning tasks. In the second chapter, we introduce methods that enable language models to harness knowledge graph information. Specifically, we develop a new model architecture, a hybrid of language models and graph neural networks, along with a training objective that fuses text and knowledge graph representations. This method demonstrates strong performance on tasks involving domain knowledge, such as medical question answering. In the third chapter, to empower language models to use and generate visual content alongside textual information, we design unified multimodal models capable of encoding, retrieving, and decoding interleaved sequences of text and images. The model employs a retriever to fetch textual or visual knowledge and integrates it into a multimodal Transformer that encodes and decodes both text and images using token representations. Finally, in the forth and fifth chapters, we demonstrate the application of textual, structured, and visual knowledge fusion techniques to solve practical healthcare tasks, including clinical trial outcome prediction and multimodal medical question answering. In summary, this thesis builds models capable of comprehending and generating multimodal content, spanning text, knowledge graphs, and images.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2024; ©2024 |
Publication date | 2024; 2024 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Yasunaga, Michihiro |
---|---|
Degree supervisor | Leskovec, Jurij |
Degree supervisor | Liang, Percy |
Thesis advisor | Leskovec, Jurij |
Thesis advisor | Liang, Percy |
Thesis advisor | Manning, Christopher D |
Degree committee member | Manning, Christopher D |
Associated with | Stanford University, School of Engineering |
Associated with | Stanford University, Computer Science Department |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Michihiro Yasunaga. |
---|---|
Note | Submitted to the Computer Science Department. |
Thesis | Thesis Ph.D. Stanford University 2024. |
Location | https://purl.stanford.edu/dz688yd5162 |
Access conditions
- Copyright
- © 2024 by Michihiro Yasunaga
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...