Compositional visual intelligence

Johnson, Justin

Compositional visual intelligence

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fkp451bm8485" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The field of computer vision has made enormous progress in the last few years, largely due to convolutional neural networks. Despite success on traditional computer vision tasks, our systems are still a long way from the general visual intelligence of people. An important facet of visual intelligence is composition - understanding of the whole derives from an understanding of the parts. To achieve the goal of compositional visual intelligence, we must explore new computer vision tasks, create new datasets, and develop new models that exploit compositionality. In this dissertation I will discuss my work on three different computer vision tasks involving language, where embracing compositionality helps us build systems with richer visual intelligence. I will first discuss image captioning: traditional systems generate short sentences describing images, but by decomposing images into regions and descriptions into phrases we can that generate two types of richer descriptions: dense captions and paragraphs. Second, I will discuss visual question answering: existing datasets consist primarily of short, simple questions; to study more complex questions requiring com- positional reasoning, we introduce a new benchark dataset where existing methods fall short. We then propose an explicitly compositional model for visual question an- swering that internally converts questions to functional programs, and executes these programs by composing neural modules. Third, I will discuss text-to-image: existing systems can retrieve or generate simple images of a single object conditioned on text descriptions, but struggle with more complex descriptions. By replacing freeform natural language with compositional scene graphs of objects and relationships, we can retrieve and generate complex images containing multiple objects.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2018; ©2018
Publication date	2018; 2018
Issuance	monographic
Language	English

Creators/Contributors

Author	Johnson, Justin
Degree supervisor	Li, Fei Fei, 1976-
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Goodman, Noah
Thesis advisor	Ré, Christopher
Degree committee member	Goodman, Noah
Degree committee member	Ré, Christopher
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Justin Johnson.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2018.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...