Understanding human actions in still images

Placeholder Show Content


Many human actions, such as "playing violin'' and "taking a photo'', can be well described by still images, because of the specific spatial relationship between humans and objects, as well as the specific human and object poses involved in these actions. Recognizing human actions in still images will potentially provide useful information in image indexing and visual search, since a large proportion of available images contain people. Progress on action recognition is also beneficial to object and scene understanding, given the frequent human-object and human-scene interactions. Further, as video processing algorithms often rely on some form of initialization from individual video frames, understanding human actions in still images will help recognize human actions in videos. However, understanding human actions in still images is a challenging task, because of the large appearance and pose variation in both humans and objects even for the same action. In the first part of this thesis, we treat action understanding as an image classification task, where the goal is to correctly assign a class label such as "playing violin'' or "reading book'' to each human. Compared with traditional vision tasks such as object recognition, we show that it is critical to utilize detailed and structured visual information for action classification. To this end, we extract dense and structured visual descriptors for image representation, and propose to combine randomization and discrimination for image classification. The performance of our classification system can be further improved by integrating with other high-level features such as action attributes and objects. The second part of this thesis aims at having a deeper understanding of human actions. Considering the specific types of human-object interactions for each action, we first propose a conditional random field model which allows objects and human poses to serve as context of each other, and hence mutually improve each other's recognition results. Then, we move on to discover object functionality in a weakly supervised setting. For example, given a set of images containing human-violin interactions, where a human is either playing violin or holding a violin but not playing, our method builds a model of "playing violin'' that corresponds to the functionality of the object, and clusters the input images accordingly. Finally, we summarize our work and show our vision and preliminary results of how our work can benefit some new vision tasks, including fine-grained object recognition, video event categorization, and social role understanding.


Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English


Associated with Yao, Bangpeng
Associated with Stanford University, Computer Science Department.
Primary advisor Li, Fei Fei, 1976-
Thesis advisor Li, Fei Fei, 1976-
Thesis advisor Koller, Daphne
Thesis advisor Liang, Percy
Thesis advisor Savarese, Silvio
Advisor Koller, Daphne
Advisor Liang, Percy
Advisor Savarese, Silvio


Genre Theses

Bibliographic information

Statement of responsibility Bangpeng Yao.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2013.
Location electronic resource

Access conditions

© 2013 by Bangpeng Yao
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...