Optimizing processor architectures for warehouse-scale computers

Placeholder Show Content

Abstract/Contents

Abstract
Our society is becoming increasingly integrated with and reliant upon services hosted in large-scale datacenters. These services interact with the lives of billions of people, in large part because they offer unprecedented, near-instantaneous access to information, acceleration of communication and transactions, power to influence knowledge and sentiment, and because they have promoted significant economic growth. As more services and features come online, the complexity and amount of data that they generate and process is continually increasing, and as a result the demands of software systems have never been greater. In contrast, hardware scaling is not keeping up. The end of Dennard scaling near the turn of the century quickly pushed chips against the power wall which caused frequency and single-threaded performance scaling to plateau. Process scaling, or Moore's Law, is also in decline as observed by a slowdown in cost scaling over the past several years. With these slowdowns, we no longer enjoy the exponential performance benefits that device scaling enabled for decades, and increasing instruction and data working sets further challenge on-chip capabilities. As a result, we are at a watershed moment where future performance sustainability will be driven more significantly by computer architecture than by process technology. This dissertation is about improving computer performance in spite of slowed hardware scaling and accelerated software scaling. It focuses on server-class, general-purpose CPUs which are the workhorses of the warehouse-scale computers that host the majority of our online activities today. To this end, we present contributions that address the identification and measurement of performance challenges in important workloads today, provide immediate short-term optimizations that alleviate a portion of these challenges, and finally long-term optimizations, methodologies, and strategies for sustaining performance scaling into the future. We do this with a focus on both the instruction and data latency bottlenecks, which constitute the majority of CPU stalls. We first present a detailed microarchitecture and memory subsystem analysis of Google's Web Search, one of the largest and most popular services in the world today. This study shows that stalls from memory latency present an opportunity to more than double performance. It also quantifies significant differences between the hardware performance of large-scale workloads like search and traditional software benchmarks which have historically driven CPU design. We evaluate two opportunities to readjust the memory hierarchy to better support search; a rebalancing of on-chip cache and compute resources, and the introduction of a latency-optimized L4 cache to target shared heap accesses. These optimizations combine to yield between 27% and 38% performance improvement. We next focus on the CPU instruction front-end which contributes to as much as one third of all CPU stalls. We show that, in a large fleet, instruction cache misses are caused by a long tail of millions of unique instructions, which suggests the need for larger caches. However, recognizing that on-chip storage is not scaling and that larger caches increase access latencies, we choose to address instruction availability via prefetching. Specifically, we propose a profile-driven software code prefetcher that can eliminate up to 96% of instruction cache misses with very little execution overhead. Finally, we consider the CPU data back-end which is the largest single contributor to stalls. Data working set sizes have long outpaced cache capacities, and so data prefetching is well-studied. However, despite decades of research on prefetchers, only simple designs are present in modern systems today, and this is primarily because prefetcher proposals don't adequately capture generality and cost. One key reason for this is that relatively little is known about the dominant memory access patterns of important workloads. To that end, we show that access patterns can be extracted directly from programs via dataflow analysis instead of being estimated by indirect methods, and we propose dataflow-based memory pattern analysis tools which can inform us about the capabilities and limitations of current prefetchers as well as guide future prefetcher designs. We evaluate the accuracy and timeliness of a dataflow-informed prefetcher, and show that it is able to consistently outperform hardware prefetchers many times over, and provides a much better design point in the landscape of generality and cost.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Ayers, Grant Edward
Degree supervisor Kozyrakis, Christoforos, 1974-
Thesis advisor Kozyrakis, Christoforos, 1974-
Thesis advisor Olukotun, Oyekunle Ayinde
Thesis advisor Ousterhout, John K
Degree committee member Olukotun, Oyekunle Ayinde
Degree committee member Ousterhout, John K
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Grant Edward Ayers.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Grant Edward Ayers
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...