Connecting images and natural language

Karpathy, Andrej; Stanford University, Department of Computer Science.

Connecting images and natural language

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwf528qt3314" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: A long-standing goal in the field of artificial intelligence is to develop agents that can perceive and understand the rich visual world around us and who can communicate with us about it in natural language. Significant strides have been made towards this goal over the last few years due to simultaneous advances in computing infrastructure, data gathering and algorithms. The progress has been especially rapid in visual recognition, where computers can now classify images into categories with a performance that rivals that of humans, or even surpasses it in some cases such as classifying breeds of dogs. However, despite much encouraging progress, most of the advances in visual recognition still take place in the context of assigning one or a few discrete labels to an image (e.g. person, boat, keyboard, etc.). In this dissertation we develop models and techniques that allow us to connect the domain of visual data and the domain of natural language utterances, enabling translation between elements of the two domains. In particular, first we introduce a model that embeds both images and sentences into a common multi-modal embedding space. This space then allows us to identify images that depict an arbitrary sentence description and conversely, we can identify sentences that describe any image. Second, we develop an image captioning model that takes an image and directly generates a sentence description without being constrained a finite collection of human-written sentences to choose from. Lastly, we describe a model that can take an image and both localize and describe all if its salient parts. We demonstrate that this model can also be used backwards to take any arbitrary description (e.g. white tennis shoes) and efficiently localize the described concept in a large collection of images. We argue that these models, the techniques they take advantage of internally and the interactions they enable are a stepping stone towards artificial intelligence and that connecting images and natural language offers many practical benefits and immediate valuable applications. From the modeling perspective, instead of designing and staging explicit algorithms to process images and sentences in complex processing pipelines, our contribution lies in the design of hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances with a single network. Therefore, the computational processing of images, sentences, and the structure of the multimodal embeddings that associate them emerges automatically during the process of optimizing a loss function with respect to the network's parameters over training datasets of images and their captions. This approach enjoys many of the benefits of neural networks including the use of simple, homogeneous computations that are easy to parallelize on hardware, and strong performance due to end-to-end training that formulates the problem as a single optimization problem in which all components of the model share the same end objective. We show that our models advance the state of the art on tasks that require joint processing of images and natural language and that we can design the architectures in ways that facilitate interpretable visual inspection of the network's predictions.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Karpathy, Andrej
Associated with	Stanford University, Department of Computer Science.
Primary advisor	Li, Fei Fei, 1976-
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Liang, Percy
Thesis advisor	Manning, Christopher D
Advisor	Liang, Percy
Advisor	Manning, Christopher D

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Andrej Karpathy.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...