Multimodal object representation learning in haptic, auditory, and visual domains

Placeholder Show Content

Abstract/Contents

Abstract
Humans frequently use all their senses to understand and interact with their environments. Our multi-modal mental priors of how objects and materials respond to physical interactions enable us to succeed in many of our everyday tasks. For example, to find a glass in the back of a dark cluttered cabinet, we heavily rely on our senses of touch and hearing as well as our prior knowledge of how a glass feels and sounds. This observation about human behavior motivates us to develop effective ways of modeling the multi-modal signals of vision, haptics, and audio. Such models have applications for robotics as well as for Augmented and Virtual Reality (AR/VR). Given that our real life experiences are multi-modal, effective AR/VR environments should be multi-modal as well. With the commercialization of several AR/VR devices over the past few decades, a variety of applications in areas such as e-commerce, gaming, education, and medicine has emerged. However, current AR/VR environments lack rich multi-modal sensory responses, which reduces the realism of these environments. For a model to efficiently render appropriate multi-modal signals in response to user interactions, it needs to encode this data in low-dimensional representations. This motivates us to develop effective ways of learning representations of these different modalities, which is a challenging goal. From a modeling standpoint, visual cues of an object and its haptic and auditory feedback are heterogeneous, requiring domain-specific knowledge to design the appropriate perceptual modules for each. Furthermore, these representations should ideally be either task agnostic or easily generalizable to new tasks and scenarios since collecting a new dataset per task or object is expensive and impossible to scale. This motivates us to explore physically interpretable and object aware representations. In this dissertation, we demonstrate how object-aware learning-based representations can be used for learning appropriate representations in different modalities. In the first part, we focus on the modality of touch and use deep-learning based methods for haptic texture rendering. We present a learned action-conditional model for haptic textures that uses data from a vision-based tactile sensor (GelSight) and a user's action as input. This model predicts an induced acceleration that is used to provide haptic vibration feedback to a user to induce the sensation of a virtual texture. We show that our model outperforms previous state-of-the-art methods. In the second part of this thesis, we explore processing audio signals. We develop a fully differentiable model for rendering and identification of impact sounds called DiffImpact. DiffImpact models impact sounds that rigid objects make upon contact by extracting physically interpretable parameters of force profile, modal response, background noise, and environmental response from their audio spectrograms. Lastly, we demonstrate the utility of self-supervised object-aware representations based on Slot Attention for downstream robotic applications specially sample-efficient visuomotor control in multi-object scenes. We conclude this dissertation by discussing future directions to extend Slot Attention-based representations to include modalities of touch and audio as well as autonomous collection of a multi-modal object dataset.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2022; ©2022
Publication date 2022; 2022
Issuance monographic
Language English

Creators/Contributors

Author Heravi, Negin
Degree supervisor Bohg, Jeannette, 1981-
Degree supervisor Okamura, Allison
Thesis advisor Bohg, Jeannette, 1981-
Thesis advisor Okamura, Allison
Thesis advisor Heather Culbertson, PhD
Degree committee member Heather Culbertson, PhD
Associated with Stanford University, Department of Mechanical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Negin Heravi.
Note Submitted to the Department of Mechanical Engineering.
Thesis Thesis Ph.D. Stanford University 2022.
Location https://purl.stanford.edu/sj589ft0971

Access conditions

Copyright
© 2022 by Negin Heravi
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...