Connecting images and natural language

Placeholder Show Content

Abstract/Contents

Abstract
A long-standing goal in the field of artificial intelligence is to develop agents that can perceive and understand the rich visual world around us and who can communicate with us about it in natural language. Significant strides have been made towards this goal over the last few years due to simultaneous advances in computing infrastructure, data gathering and algorithms. The progress has been especially rapid in visual recognition, where computers can now classify images into categories with a performance that rivals that of humans, or even surpasses it in some cases such as classifying breeds of dogs. However, despite much encouraging progress, most of the advances in visual recognition still take place in the context of assigning one or a few discrete labels to an image (e.g. person, boat, keyboard, etc.). In this dissertation we develop models and techniques that allow us to connect the domain of visual data and the domain of natural language utterances, enabling translation between elements of the two domains. In particular, first we introduce a model that embeds both images and sentences into a common multi-modal embedding space. This space then allows us to identify images that depict an arbitrary sentence description and conversely, we can identify sentences that describe any image. Second, we develop an image captioning model that takes an image and directly generates a sentence description without being constrained a finite collection of human-written sentences to choose from. Lastly, we describe a model that can take an image and both localize and describe all if its salient parts. We demonstrate that this model can also be used backwards to take any arbitrary description (e.g. white tennis shoes) and efficiently localize the described concept in a large collection of images. We argue that these models, the techniques they take advantage of internally and the interactions they enable are a stepping stone towards artificial intelligence and that connecting images and natural language offers many practical benefits and immediate valuable applications. From the modeling perspective, instead of designing and staging explicit algorithms to process images and sentences in complex processing pipelines, our contribution lies in the design of hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances with a single network. Therefore, the computational processing of images, sentences, and the structure of the multimodal embeddings that associate them emerges automatically during the process of optimizing a loss function with respect to the network's parameters over training datasets of images and their captions. This approach enjoys many of the benefits of neural networks including the use of simple, homogeneous computations that are easy to parallelize on hardware, and strong performance due to end-to-end training that formulates the problem as a single optimization problem in which all components of the model share the same end objective. We show that our models advance the state of the art on tasks that require joint processing of images and natural language and that we can design the architectures in ways that facilitate interpretable visual inspection of the network's predictions.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Karpathy, Andrej
Associated with Stanford University, Department of Computer Science.
Primary advisor Li, Fei Fei, 1976-
Thesis advisor Li, Fei Fei, 1976-
Thesis advisor Liang, Percy
Thesis advisor Manning, Christopher D
Advisor Liang, Percy
Advisor Manning, Christopher D

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Andrej Karpathy.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Andrej Karpathy
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...