Characterizing and improving graph algorithm performance on multicore systems

Rodia, Nicole Celeste

Characterizing and improving graph algorithm performance on multicore systems

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fhn324sv2962" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The rise of big data analytics has contributed to the growing popularity and scale of graph datasets, positioning graph analysis as an important research area. Graph analysis is an essential tool in many domains, including the physical and social sciences, healthcare, business intelligence, and cybersecurity. The increasing scale of graph analysis problems, with graphs containing millions or billions of vertices and edges, has made parallel and distributed graph algorithms essential for effective analysis of these large datasets. At the same time, modern multicore systems have been scaling to higher core counts, with dozens of complex cores in a single system. At first glance, it would seem that graph algorithms can leverage data-level parallelism across graph vertices and edges to utilize this large number of cores to quickly process large datasets. In fact, on multicore systems, graph algorithms are typically inefficient and perform poorly. The real-world informatics graphs used for today's big data analytics are derived from online social networks, web page links, genomics data, and the like. These networks possess fundamental properties that differ from traditional graphs like trees or meshes, resulting in different execution characteristics. We study the factors behind this lack of performance and demonstrate software and hardware techniques that improve performance. First, we analyze the perfor- mance characteristics of a core set of graph analysis algorithms across several infor- matics, physical, and synthetic graph datasets using a multicore microarchitectural simulator. Our characterization indicates that poor performance is due to several fac- tors, including irregular data access patterns, load imbalance, high communication- to-computation ratio, and ineffective caching techniques. To investigate the potential for caching to improve graph algorithm performance, we study the algorithms' data locality. Cache miss rates are an unreliable metric for data locality because they are heavily influenced by dataset size, cache size, and replacement policy. Thus, we use cache-independent locality analysis techniques, including reuse distance and a probability-based locality score, to analyze data locality in graph algorithms. Based on our analysis of data locality, we find that LRU-based cache replacement policies do not provide good performance for the data access patterns characteristic of graph algorithms. Further, we show that data access patterns correlate with algorithm characteristics, graph dataset structure, and vertex degree. These insights indicate that utilization of algorithm- and dataset-specific locality information paired with an improved cache replacement policy could significantly improve graph algorithm performance. Second, we employ our knowledge of real-world graph properties to redesign the algorithm for detecting strongly connected components (SCCs) in a directed graph, a fundamental graph analysis algorithm used in many scientific and engineering do- mains. Traditional approaches in parallel SCC detection show limited performance and poor scaling behavior when applied to large real-world graph instances. We investigate the shortcomings of the conventional approach and propose a series of ex- tensions that account for the fundamental properties of real-world graphs, particularly the small-world property. Our scalable implementation offers excellent performance on diverse small-world graphs resulting in a factor of 5 to 29 times parallel speedup over an optimal sequential algorithm on 16 cores and 32 hardware threads. Third, we propose a new cache replacement policy based on our observations of data locality in graph algorithms. The Graph Priority Insertion Policy (GPIP) uses per-data-structure software priority hints to improve last-level cache hit rates by maintaining data with higher locality in the cache. This policy provides an average reduction in misses per thousand instruction (MPKI) of 3% over least-recently used (LRU) replacement. Overall, our contributions serve to expand understanding of the characteristics of graph algorithms and improve graph algorithm performance through both software and hardware means.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Rodia, Nicole Celeste
Degree supervisor	Olukotun, Oyekunle Ayinde
Thesis advisor	Olukotun, Oyekunle Ayinde
Thesis advisor	Kozyrakis, Christoforos, 1974-
Thesis advisor	Rosenblum, Mendel
Degree committee member	Kozyrakis, Christoforos, 1974-
Degree committee member	Rosenblum, Mendel
Associated with	Stanford University, Department of Electrical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Nicole Celeste Rodia.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2019.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...