Latent variable models for visual activity understanding

Placeholder Show Content


One of the important goals of computer vision is to categorize and understand human actions in images and video. The ability to automatically solve this problem opens the door to a host of impactful applications such as search and retrieval, surveillance, medical research, automatic annotation, and human-computer interfaces. Approaches for action recognition have involved increasingly rich modeling with hidden variables, which are often required to capture the properties of and interactions between components in a scene that distinguish one action from another. Learning models with hidden variables accurately and efficiently can be a difficult problem that poses great computational challenges. In this work, we address two related problems: developing computational methods to accurately and efficiently learn models with latent variables from data, and constructing models of sufficient richness to solve high-level human action recognition tasks. To address the first problem, we turn to Self-Paced Learning, which is designed to avoid bad local minima while learning latent variable models. We show that we can use Self-Paced Learning in combination with data with varying levels of annotation to achieve superior levels of performance. To address the second problem, we propose a latent variable model that explicitly represents human pose, object trajectories, and the interactions between them in video sequences. Since labeling all of these components in training data is onerous and such labels are not available in test data, the model uses latent variables and takes advantage of data with varying levels of annotation. It also takes advantage of recent progress in both the quality of combined video and depth sensors and the accuracy of pose trackers that are based on these measurements. With these technologies in hand, we are able to leverage accurate pose trajectories in our model without the need for any additional annotation or human intervention. By combining a pose-aware action model with successful discriminative techniques in a single joint model, we are able to recognize complex, fine-grained human action involving the manipulation of objects in realistic action sequences. For our adaptation of Self-Paced Learning to diversely and noisily labeled datasets, we demonstrate that we can improve on the results of a state-of-the-art action recognition technique with still images by augmenting a labeled dataset with images gathered from the internet without any annotation. Furthermore, to showcase both the ability of our human action model to capture complex human actions and the efficacy of our learning approach, we introduce a novel Cooking Action Dataset and show that our model outperforms existing state-of-the-art techniques.


Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English


Associated with Packer, Benjamin
Associated with Stanford University, Department of Computer Science.
Primary advisor Koller, Daphne
Thesis advisor Koller, Daphne
Thesis advisor Li, Fei Fei, 1976-
Thesis advisor Ng, Andrew Y, 1976-
Advisor Li, Fei Fei, 1976-
Advisor Ng, Andrew Y, 1976-


Genre Theses

Bibliographic information

Statement of responsibility Benjamin Packer.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

© 2015 by Benjamin Duman Packer
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...