Latent variable models for visual activity understanding

Packer, Benjamin; Stanford University, Department of Computer Science.

Latent variable models for visual activity understanding

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fcd380xb8559" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: One of the important goals of computer vision is to categorize and understand human actions in images and video. The ability to automatically solve this problem opens the door to a host of impactful applications such as search and retrieval, surveillance, medical research, automatic annotation, and human-computer interfaces. Approaches for action recognition have involved increasingly rich modeling with hidden variables, which are often required to capture the properties of and interactions between components in a scene that distinguish one action from another. Learning models with hidden variables accurately and efficiently can be a difficult problem that poses great computational challenges. In this work, we address two related problems: developing computational methods to accurately and efficiently learn models with latent variables from data, and constructing models of sufficient richness to solve high-level human action recognition tasks. To address the first problem, we turn to Self-Paced Learning, which is designed to avoid bad local minima while learning latent variable models. We show that we can use Self-Paced Learning in combination with data with varying levels of annotation to achieve superior levels of performance. To address the second problem, we propose a latent variable model that explicitly represents human pose, object trajectories, and the interactions between them in video sequences. Since labeling all of these components in training data is onerous and such labels are not available in test data, the model uses latent variables and takes advantage of data with varying levels of annotation. It also takes advantage of recent progress in both the quality of combined video and depth sensors and the accuracy of pose trackers that are based on these measurements. With these technologies in hand, we are able to leverage accurate pose trajectories in our model without the need for any additional annotation or human intervention. By combining a pose-aware action model with successful discriminative techniques in a single joint model, we are able to recognize complex, fine-grained human action involving the manipulation of objects in realistic action sequences. For our adaptation of Self-Paced Learning to diversely and noisily labeled datasets, we demonstrate that we can improve on the results of a state-of-the-art action recognition technique with still images by augmenting a labeled dataset with images gathered from the internet without any annotation. Furthermore, to showcase both the ability of our human action model to capture complex human actions and the efficacy of our learning approach, we introduce a novel Cooking Action Dataset and show that our model outperforms existing state-of-the-art techniques.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2015
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Packer, Benjamin
Associated with	Stanford University, Department of Computer Science.
Primary advisor	Koller, Daphne
Thesis advisor	Koller, Daphne
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Ng, Andrew Y, 1976-
Advisor	Li, Fei Fei, 1976-
Advisor	Ng, Andrew Y, 1976-

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Benjamin Packer.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2015.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...