Multimodal object representation learning in haptic, auditory, and visual domains

Heravi, Negin

Multimodal object representation learning in haptic, auditory, and visual domains

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fsj589ft0971" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Humans frequently use all their senses to understand and interact with their environments. Our multi-modal mental priors of how objects and materials respond to physical interactions enable us to succeed in many of our everyday tasks. For example, to find a glass in the back of a dark cluttered cabinet, we heavily rely on our senses of touch and hearing as well as our prior knowledge of how a glass feels and sounds. This observation about human behavior motivates us to develop effective ways of modeling the multi-modal signals of vision, haptics, and audio. Such models have applications for robotics as well as for Augmented and Virtual Reality (AR/VR). Given that our real life experiences are multi-modal, effective AR/VR environments should be multi-modal as well. With the commercialization of several AR/VR devices over the past few decades, a variety of applications in areas such as e-commerce, gaming, education, and medicine has emerged. However, current AR/VR environments lack rich multi-modal sensory responses, which reduces the realism of these environments. For a model to efficiently render appropriate multi-modal signals in response to user interactions, it needs to encode this data in low-dimensional representations. This motivates us to develop effective ways of learning representations of these different modalities, which is a challenging goal. From a modeling standpoint, visual cues of an object and its haptic and auditory feedback are heterogeneous, requiring domain-specific knowledge to design the appropriate perceptual modules for each. Furthermore, these representations should ideally be either task agnostic or easily generalizable to new tasks and scenarios since collecting a new dataset per task or object is expensive and impossible to scale. This motivates us to explore physically interpretable and object aware representations. In this dissertation, we demonstrate how object-aware learning-based representations can be used for learning appropriate representations in different modalities. In the first part, we focus on the modality of touch and use deep-learning based methods for haptic texture rendering. We present a learned action-conditional model for haptic textures that uses data from a vision-based tactile sensor (GelSight) and a user's action as input. This model predicts an induced acceleration that is used to provide haptic vibration feedback to a user to induce the sensation of a virtual texture. We show that our model outperforms previous state-of-the-art methods. In the second part of this thesis, we explore processing audio signals. We develop a fully differentiable model for rendering and identification of impact sounds called DiffImpact. DiffImpact models impact sounds that rigid objects make upon contact by extracting physically interpretable parameters of force profile, modal response, background noise, and environmental response from their audio spectrograms. Lastly, we demonstrate the utility of self-supervised object-aware representations based on Slot Attention for downstream robotic applications specially sample-efficient visuomotor control in multi-object scenes. We conclude this dissertation by discussing future directions to extend Slot Attention-based representations to include modalities of touch and audio as well as autonomous collection of a multi-modal object dataset.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2022; ©2022
Publication date	2022; 2022
Issuance	monographic
Language	English

Creators/Contributors

Author	Heravi, Negin
Degree supervisor	Bohg, Jeannette, 1981-
Degree supervisor	Okamura, Allison
Thesis advisor	Bohg, Jeannette, 1981-
Thesis advisor	Okamura, Allison
Thesis advisor	Heather Culbertson, PhD
Degree committee member	Heather Culbertson, PhD
Associated with	Stanford University, Department of Mechanical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Negin Heravi.
Note	Submitted to the Department of Mechanical Engineering.
Thesis	Thesis Ph.D. Stanford University 2022.
Location	https://purl.stanford.edu/sj589ft0971

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...