Interactive systems for data transformation and assessment
Abstract/Contents
- Abstract
- In spite of advances in technologies for processing and visualizing data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process often constitutes the most tedious and time-consuming aspect of analysis. This dissertation contributes novel techniques for coupling automated routines with interactive interfaces to enable more rapid data transformation and quality assessment. In this dissertation, we first present an interview study with enterprise data analysts. We characterize the process of industrial data analysis, document how organizational features of an enterprise impact analysis, describe recurring pain points, and discuss design implications for visual analysis tools. Next we introduce Wrangler, an interactive system for creating data transformation scripts. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. We present user study results showing that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing. Underlying the Wrangler interface is a declarative data transformation language that supports code-generation of executable code in a variety of runtime platforms. For large data sets, an analyst can build and test a script on a sample of data before applying the script to the entire data set. Often times, errors or other anomalies will appear in the data set that did not appear in the sample. We introduce and evaluate two methods to aid more rapid debugging of large-scale transformation scripts. Surprise-based anomaly detection applies a model to classify output records as exceptions. Rule-based transform disambiguation generates example records to help analysts refine transformation scripts iv before applying them. After transforming a data set, an analyst often inspects the result for other data quality issues. We present Profiler, a visual analytic tool for assessing data quality issues. We present Profiler's architecture, including modular components for custom data types, anomaly detection routines and summary visualizations. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction entirely in the browser with millions of data points. Taken together, this dissertation contributes novel methods for integrating automated routines with interaction and visualization techniques to improve the efficiency and scale at which data analysts can work.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2013 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Kandel, Sean | |
---|---|---|
Associated with | Stanford University, Department of Computer Science. | |
Primary advisor | Heer, Jeffrey Michael | |
Thesis advisor | Heer, Jeffrey Michael | |
Thesis advisor | Hanrahan, P. M. (Patrick Matthew) | |
Thesis advisor | Paepcke, Andreas | |
Advisor | Hanrahan, P. M. (Patrick Matthew) | |
Advisor | Paepcke, Andreas |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Sean Kandel. |
---|---|
Note | Submitted to the Department of Computer Science. |
Thesis | Ph.D. Stanford University 2013 |
Location | electronic resource |
Access conditions
- Copyright
- © 2013 by Sean Kandel
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...