Interactive systems for data transformation and assessment

Placeholder Show Content

Abstract/Contents

Abstract
In spite of advances in technologies for processing and visualizing data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process often constitutes the most tedious and time-consuming aspect of analysis. This dissertation contributes novel techniques for coupling automated routines with interactive interfaces to enable more rapid data transformation and quality assessment. In this dissertation, we first present an interview study with enterprise data analysts. We characterize the process of industrial data analysis, document how organizational features of an enterprise impact analysis, describe recurring pain points, and discuss design implications for visual analysis tools. Next we introduce Wrangler, an interactive system for creating data transformation scripts. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. We present user study results showing that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing. Underlying the Wrangler interface is a declarative data transformation language that supports code-generation of executable code in a variety of runtime platforms. For large data sets, an analyst can build and test a script on a sample of data before applying the script to the entire data set. Often times, errors or other anomalies will appear in the data set that did not appear in the sample. We introduce and evaluate two methods to aid more rapid debugging of large-scale transformation scripts. Surprise-based anomaly detection applies a model to classify output records as exceptions. Rule-based transform disambiguation generates example records to help analysts refine transformation scripts iv before applying them. After transforming a data set, an analyst often inspects the result for other data quality issues. We present Profiler, a visual analytic tool for assessing data quality issues. We present Profiler's architecture, including modular components for custom data types, anomaly detection routines and summary visualizations. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction entirely in the browser with millions of data points. Taken together, this dissertation contributes novel methods for integrating automated routines with interaction and visualization techniques to improve the efficiency and scale at which data analysts can work.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English

Creators/Contributors

Associated with Kandel, Sean
Associated with Stanford University, Department of Computer Science.
Primary advisor Heer, Jeffrey Michael
Thesis advisor Heer, Jeffrey Michael
Thesis advisor Hanrahan, P. M. (Patrick Matthew)
Thesis advisor Paepcke, Andreas
Advisor Hanrahan, P. M. (Patrick Matthew)
Advisor Paepcke, Andreas

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Sean Kandel.
Note Submitted to the Department of Computer Science.
Thesis Ph.D. Stanford University 2013
Location electronic resource

Access conditions

Copyright
© 2013 by Sean Kandel
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...