Hardware and software techniques for scalable thousand-core systems

Sanchez Martin, Daniel; Stanford University, Department of Electrical Engineering

Hardware and software techniques for scalable thousand-core systems

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fmz572jk7876" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Computer architecture is at a critical juncture. Single-thread performance has stopped scaling due to technology limitations and complexity constraints. Manufacturers now rely on multicore processors to scale performance efficiently, and parallel architectures, once rare, are now pervasive across all domains. To keep performance on an exponential curve, the number of cores is expected to increase exponentially, reaching thousands of cores in the next decade. However, achieving efficient thousand-core systems will require significant innovation across the software-hardware stack. At a high level, two main issues hinder multicore scalability. First, hardware resources must scale efficiently, even as some of them are shared among thousands of threads. In particular, the memory hierarchy is hard to scale in several ways: caches spend considerable energy and latency to implement associative lookups, making them inefficient; conventional cache coherence techniques are prohibitively expensive beyond a few tens of cores; and caches cannot be easily shared among multiple threads or processes. Ideally, software should be able to configure these shared resources to provide good overall performance and quality of service (QoS) guarantees under all possible sharing scenarios. Second, software needs to use these parallel architectures efficiently without burdening the programmer with the complexities of large-scale parallelism. To expose ample parallelism, applications will need to be divided in fine-grain tasks of a few thousand instructions each, and scheduled dynamically in a manner that addresses the three major difficulties of fine-grain parallelism: locality, load imbalance, and excessive overheads. The focus of this dissertation is to enable efficient, scalable and easy-to-use multicore systems with thousands of cores. To this end, we present contributions that address both hardware and software scalability bottlenecks. While the overarching goal of these techniques is to enable thousands-core systems, they also improve current systems with tens of cores. On the hardware side, we present three techniques that, together, enable scalable cache hierarchies that can be shared efficiently. First, ZCache is a cache design that provides high associativity at low cost (e.g., 64-way associativity with the latency, energy and area of a 4-way cache) and is characterized with simple and accurate workload-independent analytical models. We use the high associativity and analytical models of ZCache to develop two techniques that address the scalability problems of shared resources in the cache hierarchy. Vantage implements scalable and efficient fine-grain cache partitioning, which enables hundreds of threads to share caches in a controlled fashion, providing configurability, isolation and QoS guarantees. SCD is a coherence directory that scales to thousands of cores efficiently and causes negligible directory-induced invalidations with minimal overprovisioning, enabling efficient cache coherence with QoS guarantees in large-scale multicores. On the software side, our contributions enable efficient and scalable dynamic runtimes and schedulers for a wide range of applications and programming models. First, we develop a runtime system that uses high-level information from the programming model about parallelism, locality, and heterogeneity to perform scheduling dynamically and at fine granularity to avoid load imbalance. This runtime can schedule applications with complex dependencies (such as streaming workloads) efficiently and with bounded memory footprint, and outperforms previous schedulers (both static and dynamic) on a wide variety of applications. Unfortunately, dynamic fine-grain runtimes and schedulers are hard to scale beyond tens of threads due to communication and synchronization overheads. We present a combined hardware-software approach to scale these schedulers efficiently. We design ADM, a hardware messaging technique tailored to the needs of scheduling and control applications, and use it to build scalable and efficient hardware-accelerated schedulers that match or outperform hardware-only schedulers and retain the flexibility of software schedulers.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2012
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Sanchez Martin, Daniel
Associated with	Stanford University, Department of Electrical Engineering
Primary advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Dally, William
Thesis advisor	Olukotun, Oyekunle Ayinde
Advisor	Dally, William
Advisor	Olukotun, Oyekunle Ayinde

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Daniel Sanchez Martin.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2012.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...