Investigations of factors that affect unsupervised learning of 3D object representations
- Humans have an amazing ability to learn to recognize objects across transformations that present very different retinal stimuli, such as changes in size, illumination, and rotations in space. Such identity-preserving image transformations (DiCarlo, Zoccolan, & Rust, 2012) put extraordinary pressure on our visual system because the computations needed to assign vastly different 2D images of an object to the same identity are non-trivial. However, both behavioral (Biederman & Cooper, 1991a, 1991b; Fiser & Biederman, 1995; Potter, 1976; Thorpe, Fize, & Marlot, 1996) and neural (Hung, Kreiman, Poggio, & DiCarlo, 2005) evidence suggest that the visual system solves this problem accurately and rapidly. While rotations in the image plane preserve the visible features, rotations in-depth may reveal new features of an opaque object and thus present the most difficult transformation for the visual system to resolve, because the resulting 2D image from an in-depth rotation may be unrecoverable from the original image. Thus, understanding how people achieve viewpoint invariance, or the ability to recognize objects from different views and rotations, is key to understanding the visual object recognition system. There is a general consensus that learning is an important component for developing viewpoint invariant object recognition (Logothetis and Pauls, 1992; Tarr and Pinker, 1989). Many studies show that learning can occur in an unsupervised way just from viewing example images of new objects (Edelman and Bulthoff, 1992; Tarr and Pinker, 1989). Two major theories regarding how the visual system achieves viewpoint invariance -- 3D-based theories (Biederman, 1987) and view-based theories (Ullman and Basri, 1989) -- recognize the importance of learning in achieving viewpoint invariant object recognition. However, they differ in what information is used during learning and what representation is consequently built. For example, view-based theories consider spatial and temporal continuities as necessary glue for linking multiple views of an object during unsupervised learning, but 3D-based theories consider feature information to be more important. They also differ on whether the object representation that is built after learning is 3D based or view based. To address these gaps in the published literature, I examined two core questions: What kind of spatial and temporal information in the visual input during unsupervised learning is critical for achieving viewpoint invariant recognition? And what kind of object representation is generated during the learning process? In Chapter 1, I will present a theoretical overview of the issues. Section 1 reviews theories and computational models of viewpoint invariant recognition, with a focus on the debate between 3D-based theories and view-based theories; Section 2 reviews psychophysical and neural evidence supporting each theory; and Section 3 discusses the predictions of the learning mechanisms of each of the competing theories. Chapter 2 presents results from a series of experiments that investigated the spatio-temporal information in the visual input during unsupervised learning that is key for learning the 3D structure of novel objects. Chapter 3 presents data from a series of experiments that examine how the format of the visual information during unsupervised learning affects learning the 3D structure of novel objects. Finally, in Chapter 4, I will discuss the theoretical implications of the findings presented in Chapters 2 & 3, and propose a new framework based on these results.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Department of Psychology.
|Wagner, Anthony David
|Wandell, Brian A
|Wagner, Anthony David
|Wandell, Brian A
|Statement of responsibility
|Submitted to the Department of Psychology.
|Thesis (Ph.D.)--Stanford University, 2016.
- © 2016 by Moqian Tian
- This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).
Also listed in
Loading usage metrics...