Deep learning for understanding dynamic visual data

Placeholder Show Content

Abstract/Contents

Abstract
Teaching machines to interpret the visual observations of our dynamic world as humans do is a central topic in Artificial Intelligence. The goal is to process various types of visual data and generate symbolic or numerical descriptions similar to human understanding to support decision making of autonomous agents. Compared to an individual visual snapshot, a dynamic visual data sequence accumulates more relevant information over time, allows motion information to be leveraged, and therefore potentially enables better generation of such descriptions. The recent success of deep learning inspires us to utilize deep neural networks to analyze the complex patterns of dynamic visual data, in contrast to traditional approaches which rely on hand-crafted spatiotemporal descriptors. Different from previous related deep learning methods, in this thesis, we argue that the correspondences of positions across frames are the dynamic component of visual data and should be modeled by the deep network architectures. We discuss the design philosophies for the deep architecture in terms of selecting correspondence candidates, generating representations from the candidates through learning, and deploying the network to various applications. Accordingly, we present four deep learning methods for processing and understanding dynamic visual data. The processed visual data modality covers two or multiple frames of 2D RGB images or 3D point clouds. We start by introducing FlowNet3D, a deep neural network for estimating scene flow between point clouds at consecutive timestamps in an end-to-end fashion. Our method lets points in one point cloud find correspondence candidates in another point cloud to learn the true correspondences and shows great advantages while being evaluated on existing benchmarks. We then present CPNet and MeteorNet, two deep learning backbone architectures that learn representations for RGB videos and 3D point cloud sequences respectively. Both methods effectively learns temporal relations by proposing and aggregating correspondence candidates. We showcase their leading performance on tasks including action recognition, semantic segmentation and scene flow estimation. We also describe KeyPose, a deep learning architecture for estimating 3D keypoint locations of objects from stereo RGB images, as well as a new dataset for studying transparent objects. Through extensive experiments, we demonstrate that estimating 3D object poses by modeling correspondences in stereo images has advantage over depth-based methods. This thesis concludes with a discussion on other potential application domains and directions for future research

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Liu, Xingyu, (Researcher in artificial intelligence)
Degree supervisor Bohg, Jeannette, 1981-
Thesis advisor Bohg, Jeannette, 1981-
Thesis advisor Finn, Chelsea
Thesis advisor Yeung, Serena
Degree committee member Finn, Chelsea
Degree committee member Yeung, Serena
Associated with Stanford University, Department of Electrical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Xingyu Liu
Note Submitted to the Department of Electrical Engineering
Thesis Thesis Ph.D. Stanford University 2019
Location electronic resource

Access conditions

Copyright
© 2019 by Xingyu Liu
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...