Towards comprehensive action understanding in videos

Placeholder Show Content

Abstract/Contents

Abstract
An enormous amount of videos are created, spread, and watched daily. In the ocean of videos, the actions and activities of humans are often the pivots. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to healthcare, security system, and human-robot interactions. For these applications to be realized, action understanding must go beyond simply answering "what is the action", but more comprehensive. An intelligent agent should be able to know "who/where is the actor", "what/where is the object", "what interaction is happening between the actor and the object", "when does an action start and end", and more. Achieving comprehensive action understanding is non-trivial since the need for data and labels combinatorially increases when trying to solve multiple problems, not to mention that video data and labels are expensive to collect, store, and consume. Therefore, to obtain comprehensive action understanding, we not only need to perform multiple tasks but also have to ensure data efficiency. In this dissertation, we discuss three questions to realize data-efficient and comprehensive action understanding. How to reduce the need for data and labels? How to perform multiple tasks without combinatorial growth of data? How to solve new problems efficiently with some other problems solved? For the first question, our works on few-shot video classification and semi-supervised temporal action proposals introduce video-specific techniques and strategies for learning with less supervision. For the second question, we demonstrate how to avoid enumerating all combinations of categories from subtasks by knowledge disentanglement in a study on actor-action segmentation. For the third question, we propose constructing compositional representation from human-object relationships in videos, and such representation leads to better generalizability in action recognition models.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Ji, Jingwei
Degree supervisor Li, Fei Fei, 1976-
Degree supervisor Niebles, Juan
Thesis advisor Li, Fei Fei, 1976-
Thesis advisor Niebles, Juan
Thesis advisor Guibas, Leonidas J
Thesis advisor Savarese, Silvio
Thesis advisor Yeung, Serena
Degree committee member Guibas, Leonidas J
Degree committee member Savarese, Silvio
Degree committee member Yeung, Serena
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Jingwei Ji.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/wc099nh9969

Access conditions

Copyright
© 2021 by Jingwei Ji
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...