Provenance in data-oriented workflows

Placeholder Show Content

Abstract/Contents

Abstract
Data-processing tasks are commonly managed using data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output data. In data-oriented workflows, it can be useful to track data provenance (also sometimes called lineage), which describes where data came from and how it has been manipulated and combined. We begin by giving a new general definition of provenance, introducing the notions of correctness, precision, and minimality. We then: (1) Describe a wrapper-based approach for capturing provenance in workflows in which all transformations are either map or reduce functions; (2) Describe a provenance-based approach for selectively refreshing one or more elements in the output data, i.e., computing the latest values of particular output elements based on modified input data; (3) Show how logical provenance, i.e., provenance information stored at the transformation level, can often capture precise provenance relationships in a compact fashion; (4) Describe our prototype system called Panda (for Provenance And Data) that supports refresh in data-oriented workflows, as well as debugging and drill-down using logical provenance. Overall, our work provides a comprehensive foundation, set of algorithms, and prototype system for provenance in data-oriented workflows.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English

Creators/Contributors

Associated with Ikeda, Robert Michael
Associated with Stanford University, Computer Science Department
Primary advisor Widom, Jennifer
Thesis advisor Widom, Jennifer
Thesis advisor Das Sarma, Anish
Thesis advisor Garcia-Molina, Hector
Advisor Das Sarma, Anish
Advisor Garcia-Molina, Hector

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Robert Ikeda.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2012.
Location electronic resource

Access conditions

Copyright
© 2012 by Robert Michael Ikeda
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...