Resource-efficient execution of deep learning computations

Placeholder Show Content

Abstract/Contents

Abstract
Deep Learning models have enabled state-of-the-art results across a broad range of applications. Training these models, however, is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. As Moore's Law slows down, numerous parallel accelerators have been introduced to meet this new computational demand. This dissertation shows how model- and hardware-aware optimizations in software systems can help intelligently navigate this heterogeneity. In particular, it demonstrates how careful automated scheduling of computation across levels of the software stack can be used to perform distributed training and resource allocation more efficiently. In the first part of this dissertation, we study pipelining, a technique commonly used as a performance optimization in various systems, as a way to perform more efficient distributed model training for both models with small training footprints and those with training footprints larger than the memory capacity of a single GPU. For certain types of models, pipeline parallelism can facilitate model training with lower communication overhead than previous methods. We introduce new strategies for pipeline parallelism, with different tradeoffs between training throughput, memory footprint, and weight update semantics; these outperform existing methods in certain settings. Pipeline parallelism can also be used in conjunction with other forms of parallelism, helping create a richer search space of parallelization strategies. By partitioning the training graph across accelerators in a model-aware way, pipeline parallelism combined with data parallelism can be up to 5x faster than data parallelism in isolation. We also use a principled combination of pipeline parallelism, tensor model parallelism, and data parallelism to efficiently scale training to language models with a trillion parameters on 3072 A100 GPUs (aggregate throughput of 502 petaFLOP/s, which is 52% of peak device throughput). In the second part of this dissertation, we show how heterogeneous compute resources (e.g., different GPU generations like NVIDIA K80 and V100 GPUs) in a shared cluster (either in a private deployment or in the public cloud) should be partitioned among multiple users to optimize objectives specified over one or more training jobs. By formulating existing policies as optimization problems over the allocation, and then using a concept we call effective throughput, policies can be extended to be heterogeneity-aware. A policy-agnostic scheduling mechanism then helps realize the heterogeneity-aware allocations returned by these policies in practice. We can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x, using these heterogeneity-aware policies. Towards the end of this dissertation, we also touch on how the dynamic pricing information of spot instances can be plugged into this heterogeneity-aware policy framework to optimize cost objectives in the public cloud. This can help reduce cost compared to using more expensive on-demand instances alone.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Narayanan, Deepak
Degree supervisor Zaharia, Matei
Thesis advisor Zaharia, Matei
Thesis advisor Fatahalian, Kayvon
Thesis advisor Ré, Christopher
Degree committee member Fatahalian, Kayvon
Degree committee member Ré, Christopher
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Deepak Narayanan.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/qx792hd7022

Access conditions

Copyright
© 2021 by Deepak Narayanan
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...