Interactive systems for data transformation and assessment

Kandel, Sean; Stanford University, Department of Computer Science.

Interactive systems for data transformation and assessment

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fst105ch5940" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: In spite of advances in technologies for processing and visualizing data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process often constitutes the most tedious and time-consuming aspect of analysis. This dissertation contributes novel techniques for coupling automated routines with interactive interfaces to enable more rapid data transformation and quality assessment. In this dissertation, we first present an interview study with enterprise data analysts. We characterize the process of industrial data analysis, document how organizational features of an enterprise impact analysis, describe recurring pain points, and discuss design implications for visual analysis tools. Next we introduce Wrangler, an interactive system for creating data transformation scripts. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. We present user study results showing that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing. Underlying the Wrangler interface is a declarative data transformation language that supports code-generation of executable code in a variety of runtime platforms. For large data sets, an analyst can build and test a script on a sample of data before applying the script to the entire data set. Often times, errors or other anomalies will appear in the data set that did not appear in the sample. We introduce and evaluate two methods to aid more rapid debugging of large-scale transformation scripts. Surprise-based anomaly detection applies a model to classify output records as exceptions. Rule-based transform disambiguation generates example records to help analysts refine transformation scripts iv before applying them. After transforming a data set, an analyst often inspects the result for other data quality issues. We present Profiler, a visual analytic tool for assessing data quality issues. We present Profiler's architecture, including modular components for custom data types, anomaly detection routines and summary visualizations. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction entirely in the browser with millions of data points. Taken together, this dissertation contributes novel methods for integrating automated routines with interaction and visualization techniques to improve the efficiency and scale at which data analysts can work.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2013
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Kandel, Sean
Associated with	Stanford University, Department of Computer Science.
Primary advisor	Heer, Jeffrey Michael
Thesis advisor	Heer, Jeffrey Michael
Thesis advisor	Hanrahan, P. M. (Patrick Matthew)
Thesis advisor	Paepcke, Andreas
Advisor	Hanrahan, P. M. (Patrick Matthew)
Advisor	Paepcke, Andreas

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Sean Kandel.
Note	Submitted to the Department of Computer Science.
Thesis	Ph.D. Stanford University 2013
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...