Understanding human actions in still images

Yao, Bangpeng; Stanford University, Computer Science Department.

Understanding human actions in still images

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fhb303pj9151" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Many human actions, such as "playing violin'' and "taking a photo'', can be well described by still images, because of the specific spatial relationship between humans and objects, as well as the specific human and object poses involved in these actions. Recognizing human actions in still images will potentially provide useful information in image indexing and visual search, since a large proportion of available images contain people. Progress on action recognition is also beneficial to object and scene understanding, given the frequent human-object and human-scene interactions. Further, as video processing algorithms often rely on some form of initialization from individual video frames, understanding human actions in still images will help recognize human actions in videos. However, understanding human actions in still images is a challenging task, because of the large appearance and pose variation in both humans and objects even for the same action. In the first part of this thesis, we treat action understanding as an image classification task, where the goal is to correctly assign a class label such as "playing violin'' or "reading book'' to each human. Compared with traditional vision tasks such as object recognition, we show that it is critical to utilize detailed and structured visual information for action classification. To this end, we extract dense and structured visual descriptors for image representation, and propose to combine randomization and discrimination for image classification. The performance of our classification system can be further improved by integrating with other high-level features such as action attributes and objects. The second part of this thesis aims at having a deeper understanding of human actions. Considering the specific types of human-object interactions for each action, we first propose a conditional random field model which allows objects and human poses to serve as context of each other, and hence mutually improve each other's recognition results. Then, we move on to discover object functionality in a weakly supervised setting. For example, given a set of images containing human-violin interactions, where a human is either playing violin or holding a violin but not playing, our method builds a model of "playing violin'' that corresponds to the functionality of the object, and clusters the input images accordingly. Finally, we summarize our work and show our vision and preliminary results of how our work can benefit some new vision tasks, including fine-grained object recognition, video event categorization, and social role understanding.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2013
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Yao, Bangpeng
Associated with	Stanford University, Computer Science Department.
Primary advisor	Li, Fei Fei, 1976-
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Koller, Daphne
Thesis advisor	Liang, Percy
Thesis advisor	Savarese, Silvio
Advisor	Koller, Daphne
Advisor	Liang, Percy
Advisor	Savarese, Silvio

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Bangpeng Yao.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2013.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...