Provenance in data-oriented workflows
- Data-processing tasks are commonly managed using data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output data. In data-oriented workflows, it can be useful to track data provenance (also sometimes called lineage), which describes where data came from and how it has been manipulated and combined. We begin by giving a new general definition of provenance, introducing the notions of correctness, precision, and minimality. We then: (1) Describe a wrapper-based approach for capturing provenance in workflows in which all transformations are either map or reduce functions; (2) Describe a provenance-based approach for selectively refreshing one or more elements in the output data, i.e., computing the latest values of particular output elements based on modified input data; (3) Show how logical provenance, i.e., provenance information stored at the transformation level, can often capture precise provenance relationships in a compact fashion; (4) Describe our prototype system called Panda (for Provenance And Data) that supports refresh in data-oriented workflows, as well as debugging and drill-down using logical provenance. Overall, our work provides a comprehensive foundation, set of algorithms, and prototype system for provenance in data-oriented workflows.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Ikeda, Robert Michael
|Stanford University, Computer Science Department
|Das Sarma, Anish
|Das Sarma, Anish
|Statement of responsibility
|Submitted to the Department of Computer Science.
|Thesis (Ph.D.)--Stanford University, 2012.
- © 2012 by Robert Michael Ikeda
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...