Towards comprehensive action understanding in videos

Ji, Jingwei

Towards comprehensive action understanding in videos

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwc099nh9969" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: An enormous amount of videos are created, spread, and watched daily. In the ocean of videos, the actions and activities of humans are often the pivots. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to healthcare, security system, and human-robot interactions. For these applications to be realized, action understanding must go beyond simply answering "what is the action", but more comprehensive. An intelligent agent should be able to know "who/where is the actor", "what/where is the object", "what interaction is happening between the actor and the object", "when does an action start and end", and more. Achieving comprehensive action understanding is non-trivial since the need for data and labels combinatorially increases when trying to solve multiple problems, not to mention that video data and labels are expensive to collect, store, and consume. Therefore, to obtain comprehensive action understanding, we not only need to perform multiple tasks but also have to ensure data efficiency. In this dissertation, we discuss three questions to realize data-efficient and comprehensive action understanding. How to reduce the need for data and labels? How to perform multiple tasks without combinatorial growth of data? How to solve new problems efficiently with some other problems solved? For the first question, our works on few-shot video classification and semi-supervised temporal action proposals introduce video-specific techniques and strategies for learning with less supervision. For the second question, we demonstrate how to avoid enumerating all combinations of categories from subtasks by knowledge disentanglement in a study on actor-action segmentation. For the third question, we propose constructing compositional representation from human-object relationships in videos, and such representation leads to better generalizability in action recognition models.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2021; ©2021
Publication date	2021; 2021
Issuance	monographic
Language	English

Creators/Contributors

Author	Ji, Jingwei
Degree supervisor	Li, Fei Fei, 1976-
Degree supervisor	Niebles, Juan
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Niebles, Juan
Thesis advisor	Guibas, Leonidas J
Thesis advisor	Savarese, Silvio
Thesis advisor	Yeung, Serena
Degree committee member	Guibas, Leonidas J
Degree committee member	Savarese, Silvio
Degree committee member	Yeung, Serena
Associated with	Stanford University, Department of Electrical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Jingwei Ji.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2021.
Location	https://purl.stanford.edu/wc099nh9969

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...