Alleviating the variability and communication overheads of irregular parallelism for many-core chips

Moreno, Camilo A; Stanford University, Department of Electrical Engineering.

Alleviating the variability and communication overheads of irregular parallelism for many-core chips

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Frs281ww7523" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Technology trends and architectural developments continue to enable an increase in the number of cores available on multicore chips. Certain types of applications easily take advantage of increased concurrency; but there are important application domains that struggle to do so. Such irregular applications, often related to sparse datasets or graph processing, have time-variable, hard-to-predict parallelism. They tend to resist static parallelization, requiring instead dynamic load-balancing layers. These task-management layers by necessity rely more strongly on communication and synchronization than regular applications do, a problem that is only accentuated by increasing on-chip communication delays. As core counts reach into the hundreds, more careful handling of this synchronization becomes necessary to avoid bottlenecks and load imbalances. We explore solutions to these issues in two dimensions: On the software front, we design a task-stealing runtime layer for irregular parallel applications on hundred-core chips. By setting up fast and efficient ways to share information, the runtime is able to embrace the varying relationship between available parallelism and core count, adapting dynamically to both abundance and scarcity of work. When tested with representative sparse-data and graph-processing applications on 128 cores, runtime overhead is reduced by 60%, simultaneously achieving 15% faster execution and 29% lower system utilization. This is done without hardware assistance, unlike prior solutions. On the hardware front, we address the 'latency multiplying'' effects of relying on coherent caches for inter-core communication and synchronization. Our proposed techniques enable efficient implementation of synchronization and data-sharing by reducing the number and complexity of cache-coherence transactions required. Using ISA hints to express preferred placement of both data access and atomic operations, faster and more efficient communication patterns can result. Our solution is compatible with existing architectures, requires only minimal code changes, and presents few practical implementation barriers. We show experimentally on 128 cores that these tools yield many-fold speedups to benchmarks and cut the optimized task-stealing runtime's remaining overhead, for a 70% speedup to runtime stealing and 18% to actual applications, along with increased energy efficiency. The overall result is to significantly improve the viability of irregular parallelism on many-core chips. We minimize waste and enable faster on-chip communication, by implementing smarter, more context-aware software task management, and by enabling more efficient communication patterns at the cache coherence level.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Moreno, Camilo A
Associated with	Stanford University, Department of Electrical Engineering.
Primary advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Horowitz, Mark
Thesis advisor	Olukotun, Oyekunle Ayinde
Advisor	Horowitz, Mark
Advisor	Olukotun, Oyekunle Ayinde

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Camilo A. Moreno.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...