Optimizing processor architectures for warehouse-scale computers

Ayers, Grant Edward

Optimizing processor architectures for warehouse-scale computers

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fdk058wg7073" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Our society is becoming increasingly integrated with and reliant upon services hosted in large-scale datacenters. These services interact with the lives of billions of people, in large part because they offer unprecedented, near-instantaneous access to information, acceleration of communication and transactions, power to influence knowledge and sentiment, and because they have promoted significant economic growth. As more services and features come online, the complexity and amount of data that they generate and process is continually increasing, and as a result the demands of software systems have never been greater. In contrast, hardware scaling is not keeping up. The end of Dennard scaling near the turn of the century quickly pushed chips against the power wall which caused frequency and single-threaded performance scaling to plateau. Process scaling, or Moore's Law, is also in decline as observed by a slowdown in cost scaling over the past several years. With these slowdowns, we no longer enjoy the exponential performance benefits that device scaling enabled for decades, and increasing instruction and data working sets further challenge on-chip capabilities. As a result, we are at a watershed moment where future performance sustainability will be driven more significantly by computer architecture than by process technology. This dissertation is about improving computer performance in spite of slowed hardware scaling and accelerated software scaling. It focuses on server-class, general-purpose CPUs which are the workhorses of the warehouse-scale computers that host the majority of our online activities today. To this end, we present contributions that address the identification and measurement of performance challenges in important workloads today, provide immediate short-term optimizations that alleviate a portion of these challenges, and finally long-term optimizations, methodologies, and strategies for sustaining performance scaling into the future. We do this with a focus on both the instruction and data latency bottlenecks, which constitute the majority of CPU stalls. We first present a detailed microarchitecture and memory subsystem analysis of Google's Web Search, one of the largest and most popular services in the world today. This study shows that stalls from memory latency present an opportunity to more than double performance. It also quantifies significant differences between the hardware performance of large-scale workloads like search and traditional software benchmarks which have historically driven CPU design. We evaluate two opportunities to readjust the memory hierarchy to better support search; a rebalancing of on-chip cache and compute resources, and the introduction of a latency-optimized L4 cache to target shared heap accesses. These optimizations combine to yield between 27% and 38% performance improvement. We next focus on the CPU instruction front-end which contributes to as much as one third of all CPU stalls. We show that, in a large fleet, instruction cache misses are caused by a long tail of millions of unique instructions, which suggests the need for larger caches. However, recognizing that on-chip storage is not scaling and that larger caches increase access latencies, we choose to address instruction availability via prefetching. Specifically, we propose a profile-driven software code prefetcher that can eliminate up to 96% of instruction cache misses with very little execution overhead. Finally, we consider the CPU data back-end which is the largest single contributor to stalls. Data working set sizes have long outpaced cache capacities, and so data prefetching is well-studied. However, despite decades of research on prefetchers, only simple designs are present in modern systems today, and this is primarily because prefetcher proposals don't adequately capture generality and cost. One key reason for this is that relatively little is known about the dominant memory access patterns of important workloads. To that end, we show that access patterns can be extracted directly from programs via dataflow analysis instead of being estimated by indirect methods, and we propose dataflow-based memory pattern analysis tools which can inform us about the capabilities and limitations of current prefetchers as well as guide future prefetcher designs. We evaluate the accuracy and timeliness of a dataflow-informed prefetcher, and show that it is able to consistently outperform hardware prefetchers many times over, and provides a much better design point in the landscape of generality and cost.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Ayers, Grant Edward
Degree supervisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Olukotun, Oyekunle Ayinde
Thesis advisor	Ousterhout, John K
Degree committee member	Olukotun, Oyekunle Ayinde
Degree committee member	Ousterhout, John K
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Grant Edward Ayers.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2019.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...