Human-centric video understanding with weak supervision

Ramanathan, Vignesh; Stanford University, Department of Electrical Engineering.

Human-centric video understanding with weak supervision

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fdh540vh2879" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: A large fraction of videos such as entertainment, sports and surveillance videos are centered around people. We need efficient ways to index such content, i.e., understand and describe people: Who are they? What are their roles? What are their actions and intentions? One major challenge is that, training computer vision models for these tasks typically requires extensive spatial and temporal annotations. Such annotations are often very expensive and difficult to collect at the scale of thousands of videos. We could handle this problem by learning from weakly labeled videos, which are readily available and cheaper to collect. However, in such videos the person-labels are not spatially/temporally localized. In this thesis, we will present models which can learn from weakly labeled videos by automatically aligning the labels with the right people in the video to identify their (i) names (ii) social roles and (iii) actions. In the first part of this thesis, we consider the problem of identifying the names of people in weakly labeled videos. In particular, we deal with one widely available source of weakly labeled videos in the form of TV episodes. These videos are only accompanied by TV-scripts, which provide a noisy description of the characters appearing in different parts of the episodes. The descriptions are often not well aligned with the video, making the task more challenging. Further, people in the script are not only mentioned by name but also by pronouns such as "he", "she" and nominals such as "doctor", "teacher" etc. This adds to the ambiguity in aligning human mentions in the script with their actual appearance in the video. We address these problems by proposing a joint optimization framework for resolving name references in the text (coreference resolution) and name assignments in video. This joint model leads to better performance in both tasks and is evaluated on a dataset of 19 TV-episodes. The second part of this thesis tackles the problem of identifying the social roles of people in weakly labeled videos. People play very specific roles in social events such as weddings, birthdays, award ceremonies. In the absence of names associated with the people, a natural way to describe people is through the role they play. We provide a graphical model which can automatically cluster people appearing in different videos of a social event into social roles. We explore different person-specific and inter-person interaction features which are informative about the role of the person. We evaluate the proposed model on a dataset of 4 different event classes against various standard baseline methods. Lastly, we consider the problem of identifying the actions and events associated with people in weakly labeled videos. This is perhaps the most generic label which can be used to describe people in all videos. As a first step, we describe a method for learning video embeddings which can be used as a good feature representation for action recognition models. We learn a temporal embedding only using the implicit weak label that videos are sequences of temporally and semantically coherent images. These embeddings are able to capture semantic context, which results in better performance for a wide variety of standard tasks in video. Next, we describe an event recognition model which can learn from natural language description of training videos. Finally, we describe an attention based model for identifying actions in multi-person events like basketball. In addition to identifying the action, the model also localizes the actor responsible for the action. We achieve this without using any explicit actor localization information during training or testing, resulting in a weakly supervised setting. We also provide a new basketball video dataset for training and evaluation.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Ramanathan, Vignesh
Associated with	Stanford University, Department of Electrical Engineering.
Primary advisor	Li, Fei Fei, 1976-
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Girod, Bernd
Thesis advisor	Wetzstein, Gordon
Advisor	Girod, Bernd
Advisor	Wetzstein, Gordon

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Vignesh Ramanathan.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...