Learning to generate and differentiate 3D objects using geometry & language

Placeholder Show Content

Abstract/Contents

Abstract
The physical world surrounding us is extremely complex, with a myriad of unexplained phenomena that seem at times mysterious or even magical. In our quest to understand, analyze and in the end, improve our interactions with our surroundings, we decompose this complex world into tangible entities we call objects. From Plato's ancient Theory of Forms to the modern rules of Object-Oriented Programming, objects with their associated classes and abstractions, have been a pillar of analysis and philosophy. At the same time, human intelligence flourishes and demonstrates much of its elegance in another human construct: that of natural languages. Humans have developed their languages to enable them to efficiently communicate with each other for almost anything conceivable: from never-seen imaginative scenarios to pragmatic nuisances regarding their surrounding objects. My vision and motivation behind this thesis lie in bridging (a modest bit) the gap between these two constructs, language and object entities, in modern-day computers via learning algorithms. In this way, this thesis aims at contributing a step forward in the advancement of Artificial Intelligence by introducing to the research community, smarter, latent, and oftentimes multi-modal representations of 3D objects, that enhance their capacity to reason about them, with (or without) the aid of language. Specifically, this thesis aims at introducing new methods and new problems at the intersection of the computer science sub-fields of 3D Vision and computational Linguistics. It starts and dedicates about half of its contents by establishing several novel (deep) Generative Neural Networks that can generate/reconstruct/represent common three-dimensional objects (e.g., a 3D point cloud of chair). These networks give rise to object representations that can improve some of the machines' objects-oriented analytical capacities: e.g., to better classify the objects of a collection, or generate novel object instances, by combining a priori known object-parts, or by meaningful "latent" interpolations among specified objects. The second half of the thesis, taps on these object representations to introduce new problems and machine learning-based solutions for discriminative object-centric language-comprehension ("listening"), and language-production ("speaking"). In this way, the second half complements and extends the first part of the thesis, by exploring multi-modal, language-aware, object representations that enable a machine to listen or speak about object properties similar to humans. In summary, the three most salient contributions of this thesis are the following. First, it introduces the first Generative Adversarial Network concerning the shape of everyday objects captured via 3D point clouds and appropriate (and widely adopted) evaluation metrics. Second, it introduces the problem and deep-learning-based solutions, for comprehending or generating linguistic references concerning the shape of common objects, in contrastive contexts i.e., talk about how a chair is different from two similar ones. Last, it explores a less controlled and harder scenario of object-based reference in the wild. Namely, it introduces the problem and methods for language comprehension concerning properties of real-world objects residing inside real-world 3D scenes, e.g., it builds machines that can understand language concerning, say, the texture of an object or its spatial arrangement. During the journey it took to establish these contributions, we published and explored some highly relevant ideas, parts of which will be used to make a more complete exposition. In short, these papers concern two high-level concepts. First, the creation of "latent spaces" that are aware of the part-based structure of 3D objects, e.g., the legs vs.~the back of a chair. Second, the creation of latent spaces that exploit known correspondences among objects of a collection, e.g., dense pointwise mappings, which can enhance the latent representation capacity in capturing geometric- shape-differences among objects. As we show with the primary works presented in this thesis, object-centric referential language contains a significant amount of part-based and fine-grained shape understanding -- naturally calling for a conceptually deep object learning and justifying the ongoing need for the development of many types of Generative Networks to capture it fully.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Achlioptas, Panagiotis
Degree supervisor Guibas, Leonidas J
Thesis advisor Guibas, Leonidas J
Thesis advisor Ermon, Stefano
Thesis advisor Savarese, Silvio
Degree committee member Ermon, Stefano
Degree committee member Savarese, Silvio
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Panagiotis (Panos) Achlioptas.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/sr155wq1248

Access conditions

Copyright
© 2021 by Panagiotis Achlioptas

Also listed in

Loading usage metrics...