Alleviating the variability and communication overheads of irregular parallelism for many-core chips

Placeholder Show Content

Abstract/Contents

Abstract
Technology trends and architectural developments continue to enable an increase in the number of cores available on multicore chips. Certain types of applications easily take advantage of increased concurrency; but there are important application domains that struggle to do so. Such irregular applications, often related to sparse datasets or graph processing, have time-variable, hard-to-predict parallelism. They tend to resist static parallelization, requiring instead dynamic load-balancing layers. These task-management layers by necessity rely more strongly on communication and synchronization than regular applications do, a problem that is only accentuated by increasing on-chip communication delays. As core counts reach into the hundreds, more careful handling of this synchronization becomes necessary to avoid bottlenecks and load imbalances. We explore solutions to these issues in two dimensions: On the software front, we design a task-stealing runtime layer for irregular parallel applications on hundred-core chips. By setting up fast and efficient ways to share information, the runtime is able to embrace the varying relationship between available parallelism and core count, adapting dynamically to both abundance and scarcity of work. When tested with representative sparse-data and graph-processing applications on 128 cores, runtime overhead is reduced by 60%, simultaneously achieving 15% faster execution and 29% lower system utilization. This is done without hardware assistance, unlike prior solutions. On the hardware front, we address the 'latency multiplying'' effects of relying on coherent caches for inter-core communication and synchronization. Our proposed techniques enable efficient implementation of synchronization and data-sharing by reducing the number and complexity of cache-coherence transactions required. Using ISA hints to express preferred placement of both data access and atomic operations, faster and more efficient communication patterns can result. Our solution is compatible with existing architectures, requires only minimal code changes, and presents few practical implementation barriers. We show experimentally on 128 cores that these tools yield many-fold speedups to benchmarks and cut the optimized task-stealing runtime's remaining overhead, for a 70% speedup to runtime stealing and 18% to actual applications, along with increased energy efficiency. The overall result is to significantly improve the viability of irregular parallelism on many-core chips. We minimize waste and enable faster on-chip communication, by implementing smarter, more context-aware software task management, and by enabling more efficient communication patterns at the cache coherence level.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Moreno, Camilo A
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Kozyrakis, Christoforos, 1974-
Thesis advisor Kozyrakis, Christoforos, 1974-
Thesis advisor Horowitz, Mark
Thesis advisor Olukotun, Oyekunle Ayinde
Advisor Horowitz, Mark
Advisor Olukotun, Oyekunle Ayinde

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Camilo A. Moreno.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Camilo Andres Moreno
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...