Deep learning for understanding dynamic visual data

Liu, Xingyu, (Researcher in artificial intelligence)

Deep learning for understanding dynamic visual data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fkd454cw3240" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Teaching machines to interpret the visual observations of our dynamic world as humans do is a central topic in Artificial Intelligence. The goal is to process various types of visual data and generate symbolic or numerical descriptions similar to human understanding to support decision making of autonomous agents. Compared to an individual visual snapshot, a dynamic visual data sequence accumulates more relevant information over time, allows motion information to be leveraged, and therefore potentially enables better generation of such descriptions. The recent success of deep learning inspires us to utilize deep neural networks to analyze the complex patterns of dynamic visual data, in contrast to traditional approaches which rely on hand-crafted spatiotemporal descriptors. Different from previous related deep learning methods, in this thesis, we argue that the correspondences of positions across frames are the dynamic component of visual data and should be modeled by the deep network architectures. We discuss the design philosophies for the deep architecture in terms of selecting correspondence candidates, generating representations from the candidates through learning, and deploying the network to various applications. Accordingly, we present four deep learning methods for processing and understanding dynamic visual data. The processed visual data modality covers two or multiple frames of 2D RGB images or 3D point clouds. We start by introducing FlowNet3D, a deep neural network for estimating scene flow between point clouds at consecutive timestamps in an end-to-end fashion. Our method lets points in one point cloud find correspondence candidates in another point cloud to learn the true correspondences and shows great advantages while being evaluated on existing benchmarks. We then present CPNet and MeteorNet, two deep learning backbone architectures that learn representations for RGB videos and 3D point cloud sequences respectively. Both methods effectively learns temporal relations by proposing and aggregating correspondence candidates. We showcase their leading performance on tasks including action recognition, semantic segmentation and scene flow estimation. We also describe KeyPose, a deep learning architecture for estimating 3D keypoint locations of objects from stereo RGB images, as well as a new dataset for studying transparent objects. Through extensive experiments, we demonstrate that estimating 3D object poses by modeling correspondences in stereo images has advantage over depth-based methods. This thesis concludes with a discussion on other potential application domains and directions for future research

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Liu, Xingyu, (Researcher in artificial intelligence)
Degree supervisor	Bohg, Jeannette, 1981-
Thesis advisor	Bohg, Jeannette, 1981-
Thesis advisor	Finn, Chelsea
Thesis advisor	Yeung, Serena
Degree committee member	Finn, Chelsea
Degree committee member	Yeung, Serena
Associated with	Stanford University, Department of Electrical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Xingyu Liu
Note	Submitted to the Department of Electrical Engineering
Thesis	Thesis Ph.D. Stanford University 2019
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...