Compositional representation learning : from reasoning to synthesis
Abstract/Contents
- Abstract
- The world we live in is inherently compositional: Just like a sentence is built upon phrases and words, a visual scene comprises a collection of interacting objects and entities, which in turn are derived from the sum of their parts. This compositionality plays a pivotal role in our ability to understand the world, organize the acquired knowledge through a rich set of concepts, and readily adapt them to novel situations and environments. Essentially, it is considered one of the fundamental building blocks of human intelligence. What does compositionality mean in the context of machine intelligence? How can we encourage neural networks to develop structured understanding of our surroundings? And how can we apply this knowledge to improve in downstream tasks of vision and language? These are the key questions that are explored in the dissertation. We discuss manners and mechanisms to encourage multimodal neural networks to learn compositional scene representations, and in turn operate over them in a compositional manner, which we leverage for two downstream goals: multimodal reasoning, where we introduce models that can draw a sequence of inferences about visual scenes so to answer textual questions about them; and visual synthesis, where a model can inversely generate pictures depicting multi-object scenes from scratch. We fulfill these aims by incorporating a graphical structure into neural networks, which consists of nodes and edges, respectively meant to capture the objects within the scene and the relations among them. We demonstrate how these graph-based structural priors endow neural networks with multiple desirable properties, including: data efficiency, achieved by decomposing a given task into a series of subtasks, each of which could be learned more easily; generalization, where a model can recombine known concepts together in novel ways; controllability, where modifying respective components of the model's latent representation can selectively induce intended modifications within its output; and interpretability of the computational process the model follows, either to create an image or to reason over it. Throughout this work, we study the interplay and analogies between synthesis and reasoning, and show how, with the right inductive biases incorporated, the former capability could foster the latter.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2023; ©2023 |
Publication date | 2023; 2023 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Arad, Dor |
---|---|
Degree supervisor | Bernstein, Michael S, 1984- |
Degree supervisor | McClelland, James L |
Thesis advisor | Bernstein, Michael S, 1984- |
Thesis advisor | McClelland, James L |
Thesis advisor | Guibas, Leonidas J |
Thesis advisor | Leskovec, Jurij |
Thesis advisor | Liang, Percy |
Degree committee member | Guibas, Leonidas J |
Degree committee member | Leskovec, Jurij |
Degree committee member | Liang, Percy |
Associated with | Stanford University, School of Engineering |
Associated with | Stanford University, Computer Science Department |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Drew A. Hudson (Dor Arad). |
---|---|
Note | Submitted to the Computer Science Department. |
Thesis | Thesis Ph.D. Stanford University 2023. |
Location | https://purl.stanford.edu/fp269yy9833 |
Access conditions
- Copyright
- © 2023 by Dor Arad
- License
- This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).
Also listed in
Loading usage metrics...