Massive-scale processing of record-oriented and graph data

Placeholder Show Content

Abstract/Contents

Abstract
Many data-driven applications perform computations on large volumes of data that do not fit on a single computer. These applications typically must use parallel shared-nothing distributed software systems to perform their computations. This thesis addresses challenges in large-scale distributed data processing with a particular focus on two primary areas: (i) theoretical foundations for understanding the costs of distribution; and (ii) processing large-scale graph data. The first part of this thesis presents a theoretical framework for the MapReduce system, to analyze the cost of distribution for different problems domains, and for evaluating the ``goodness'' of different algorithms. We identify a fundamental tradeoff between the parallelism and communication costs of algorithms. We first study the setting when computations are constrained to a single round of MapReduce. In this setting, we capture the cost of distributing a problem by deriving a lower-bound curve on the communication cost of any algorithm that solves the problem for different parallelism levels. We derive lower-bound curves for several problems, and prove that existing or new one-round algorithms solving these problems are optimal, i.e., incur the minimum possible communication cost for different parallelism levels. We then show that by allowing multiple rounds of MapReduce computations, we can solve problems more efficiently than any possible one-round algorithm. The second part of this thesis addresses challenges in systems for processing large-scale graph data, with the goal of making graph computation more efficient and easier to program and debug. We focus on systems that are modeled after Google's Pregel framework for large-scale distributed graph processing. We begin by describing an open-source version of Pregel we developed, called GPS (for Graph Processing System). We then describe new static and dynamic schemes for partitioning graphs across machines, and we present experimental results on the performance effects of different partitioning schemes. Next, we describe a set of algorithmic optimizations that address commonly-appearing inefficiencies in algorithms programmed on Pregel-like systems. Because it can be very difficult to debug programs in Pregel-like systems, we developed a new replay-style debugger called Graft. In addition, we defined and implemented a set of high-level parallelizable graph primitives, called HelP (for High-level Primitives), as an alternative to programming graph algorithms using the low-level vertex-centric functions of existing systems. HelP primitives capture several commonly appearing operations in large-scale graph computations. We motivate and describe Graft and HelP using real-world applications and algorithms.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English

Creators/Contributors

Associated with Salihoglu, Semih
Associated with Stanford University, Department of Computer Science.
Primary advisor Ré, Christopher
Primary advisor Widom, Jennifer
Thesis advisor Ré, Christopher
Thesis advisor Widom, Jennifer
Thesis advisor Ullman, Jeffrey D, 1942-
Advisor Ullman, Jeffrey D, 1942-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Semih Salihoglu.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

Copyright
© 2015 by Semih Salihoglu
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...