Visual intelligence through human learning

Placeholder Show Content

Abstract/Contents

Abstract
At the core of human development is the ability to adapt to new, previously unseen stimuli. We comprehend new situations as a composition of previously seen information and ask one another for clarification when we encounter new concepts. Yet, this ability to go beyond the confounds of their training data remains an open challenge for artificial intelligence. My research designs visual intelligence to reason over new compositions and acquire new concepts by interacting with people. I draw on ideas from human learning---from human cognition and human interaction---to deliver representations, training frameworks, models, evaluation protocols, and interactions for computer vision. My dissertation will explore some the challenges associated with existing vision methods and present the two following lines of work: Drawing on human cognition, I will introduce scene graphs, a cognitively-grounded, compositional visual representation. With the scene graph representation, I will show that it possible for models to learn from a finite set of situations outlined in their training data and allows models to recognize new composition of previous seen concepts. I will build scene graph models that can recognize new visual relationships with as few as $10$ labels per relationship. Finally, I will demonstrate how scene graphs can be used to improve core computer vision tasks such as action recognition, improving existing baselines with as few as $5$ training examples. Since our introduction of scene graphs, the computer vision community has developed hundreds of scene graph models and utilized scene graphs to achieve state-of-the-art results across multiple core tasks, including object localization, captioning, image generation, question answering, 3D understanding, and spatio-temporal action recognition. Drawing on human interaction, I will introduce a framework for socially situated learning. This framework pushes agents beyond traditional active learning paradigms and enables learning from human interactions in social environments. Using this framework, I will design a real-world deployment of a socially situated agent; our agent learns to acquire new concepts by asking people targeted questions on social media about the contents of the photos they upload. By interacting with over $230K$ people over $8$ months, our agent learns to recognize hundreds of new concepts. Finally, To promote pro-social human-computer interactions, I will demonstrate the importance of choosing appropriate metaphors to describe intelligent systems. Together, this dissertation exhibits the benefits of drawing on ideas from human learning to develop better visual intelligence. My research connects ideas from cognitive science and social psychology with advances in computer vision, natural language processing, machine learning, and human-computer interaction.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Krishna, Ranjay
Degree supervisor Bernstein, Michael S, 1984-
Degree supervisor Li, Fei Fei, 1976-
Thesis advisor Bernstein, Michael S, 1984-
Thesis advisor Li, Fei Fei, 1976-
Thesis advisor Agrawala, Maneesh
Thesis advisor Manning, Christopher D
Degree committee member Agrawala, Maneesh
Degree committee member Manning, Christopher D
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Ranjay Krishna.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/df658ht9106

Access conditions

Copyright
© 2021 by Ranjay Krishna
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...