Generating content for AI, and AI-generated content synthetic data for and from machine learning
- Over the past decade, deep learning has demonstrated remarkable capabilities, showcasing a strong reliance on large-scale data. This period witnessed the scaling law taking effect on large language models, diffusion models, and segmentation models, achieving performances akin to human-level capabilities by harnessing large-scale real-world datasets. Nevertheless, when it comes to embodied AI, generating real-world data becomes challenging and costly due to the fragility and limited speed of robots. In contrast, synthetic data and simulated environments offer an efficient and cost-effective means to train robotic agents and generate precise supervision. Consequently, synthetic data emerges as a promising solution for scalable and high-quality training data for embodied AI agents. Given the ultimate goal of enabling agents to generalize to real-world scenarios, the primary challenge surrounding synthetic data lies in capturing real-world distributions accurately by bridging the simulation-to-reality gap (Sim2Real gap). Addressing this challenge involves two key aspects: generating synthetic data and effectively utilizing it. In this dissertation, I present two lines of work contributing to each of these facets. The first line of work tackles the generation of synthetic data. I introduce two methods aimed at producing synthetic data that align with real-world distributions. First, I propose iGibson, a simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. iGibson built simulated scenes that are interactive replicas of 3D scanned real-world homes, aligning the distribution of objects, layout, and interactability to those of the real world. To further facilitate the training of more robust embodied agents, iGibson offers domain randomization procedures for materials (both visual appearances and dynamics properties) and object shapes while respecting the distribution of object placements and preserving intractability. Second, I propose GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create photo-realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to iGibson which involves significant human efforts in building simulation environments, GINA-3D leveraged generative models that directly learn to generate synthetic assets that match the distribution of real-world observations. I present evidence that GINA-3D achieves state-of-the-art performance in quality and diversity for both generated visual and geometrical qualities. The second line of work focuses on utilizing synthetic data effectively. By designing robust learning algorithms and models, we can create models that generalize better to real-world scenarios. First, I introduce Taskonomy, which leveraged synthetic data to build a computational taxonomy and a map of the task space. By leveraging task relationships, I show evidence that we can use at least an order of magnitude less data than fully-supervised methods, and the models can generalize well to real-world datasets. Then, I introduce Situational Fusion, which trains an agent to fuse a large set of visual representations that correspond to diverse visual perception abilities. The approach leads to a significantly improved performance in novel environments over real-world pretrained baseline and other fusion methods. Finally, I introduce ACID, action-conditional implicit visual dynamics for deformable object manipulation. By leveraging large-scale synthetic data and unique supervision for geometry and dynamics, I can apply the simulation-trained ACID model directly to real-world objects and show success in manipulating them into target configurations. To summarize, this dissertation studies how synthetic data and simulated environments present promising solutions for embodied AI. The crux of the presented works lies in the careful generation of synthetic data and its effective utilization. Through strategies like simulation environment, generative models, task transfer learning, visuomotor policy training and visual dynamics modeling, we can bridge the gap between simulated training and real-world application, paving the way for more capable and versatile embodied AI agents.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Stanford University, School of Engineering
|Stanford University, Computer Science Department
|Statement of responsibility
|Submitted to the Computer Science Department.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Bokui Shen
- This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).
Also listed in
Loading usage metrics...