Compositional representation learning : from reasoning to synthesis

Placeholder Show Content

Abstract/Contents

Abstract
The world we live in is inherently compositional: Just like a sentence is built upon phrases and words, a visual scene comprises a collection of interacting objects and entities, which in turn are derived from the sum of their parts. This compositionality plays a pivotal role in our ability to understand the world, organize the acquired knowledge through a rich set of concepts, and readily adapt them to novel situations and environments. Essentially, it is considered one of the fundamental building blocks of human intelligence. What does compositionality mean in the context of machine intelligence? How can we encourage neural networks to develop structured understanding of our surroundings? And how can we apply this knowledge to improve in downstream tasks of vision and language? These are the key questions that are explored in the dissertation. We discuss manners and mechanisms to encourage multimodal neural networks to learn compositional scene representations, and in turn operate over them in a compositional manner, which we leverage for two downstream goals: multimodal reasoning, where we introduce models that can draw a sequence of inferences about visual scenes so to answer textual questions about them; and visual synthesis, where a model can inversely generate pictures depicting multi-object scenes from scratch. We fulfill these aims by incorporating a graphical structure into neural networks, which consists of nodes and edges, respectively meant to capture the objects within the scene and the relations among them. We demonstrate how these graph-based structural priors endow neural networks with multiple desirable properties, including: data efficiency, achieved by decomposing a given task into a series of subtasks, each of which could be learned more easily; generalization, where a model can recombine known concepts together in novel ways; controllability, where modifying respective components of the model's latent representation can selectively induce intended modifications within its output; and interpretability of the computational process the model follows, either to create an image or to reason over it. Throughout this work, we study the interplay and analogies between synthesis and reasoning, and show how, with the right inductive biases incorporated, the former capability could foster the latter.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Arad, Dor
Degree supervisor Bernstein, Michael S, 1984-
Degree supervisor McClelland, James L
Thesis advisor Bernstein, Michael S, 1984-
Thesis advisor McClelland, James L
Thesis advisor Guibas, Leonidas J
Thesis advisor Leskovec, Jurij
Thesis advisor Liang, Percy
Degree committee member Guibas, Leonidas J
Degree committee member Leskovec, Jurij
Degree committee member Liang, Percy
Associated with Stanford University, School of Engineering
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Drew A. Hudson (Dor Arad).
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/fp269yy9833

Access conditions

Copyright
© 2023 by Dor Arad
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...