Reliable data quality assessment

Placeholder Show Content

Abstract/Contents

Abstract
Modern applications are increasingly powered by data. Sources often contain erronous data, that conflicts with data obtained from other sources. Denoising and integrating data from several sources is thus crucial for most applications. An important step of this integration is determining the quality of each data source. Our thesis is that we can reliably assess data quality under the following settings (i) When we have a small number of data sources with high intersource overlap (ii) When we have a large number of data sources and a small amount of ground truth. Our settings are primarily motivated by two applications: crowdsourcing, and knowledge-base-construction. We first study the crowdsourcing setting; evaluating workers is a critical aspect of any crowdsourcing system. We devise techniques for evaluating workers by finding confidence intervals on their error rates. We focus on giving as tight a confidence interval as possible. We provide techniques that work under very general scenarios, such as when not all workers have attempted every task (a fairly common scenario in practice), when tasks have nonboolean responses, and when workers have different biases for positive and negative tasks. We demonstrate conciseness as well as accuracy of our confidence intervals by testing them on a variety of conditions and multiple real-world datasets. Next, motivated by knowledge-base-construction, we revisit data fusion, i.e., the problem of integrating noisy data from multiple sources by estimating the source accuracies, and show that a simple logistic regression based model can capture many existing approaches for solving data fusion. This allows us to put data fusion on a solid statistical footing and obtain solutions with rigorous theoretical guarantees. We introduce SLiMFast, a framework that converts data fusion to a learning and inference problem over discriminative probabilistic models. Our framework allows us to extend data fusion to take into account domain-specific features that are indicative of the accuracy of data sources, and design data fusion approaches that yield source accuracy estimates with 5x lower error than competing baselines. We also design an optimizer to automatically select the best algorithm for learning the model's parameters, and experimentally show that it chooses the best algorithm for learning in almost all cases. Finally, we look at the problem of trading off data quality for cost. We focus on selection queries involving UDF predicates that are expensive, either in terms of monetary cost or latency. We provide a family of techniques for approximately processing these queries at low cost while satisfying user-specified precision and recall constraints. Our techniques are applicable to a variety of scenarios including when selection probabilities of tuples are available beforehand, when this information is available but noisy, or when no such prior information is available. We also generalize our techniques to more complex queries. Finally, we test our techniques on real datasets, and show that they achieve significant savings in UDF evaluations of up to 80%, while incurring only a small reduction in accuracy.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Joglekar, Manas
Associated with Stanford University, Department of Computer Science.
Primary advisor Garcia-Molina, Hector
Thesis advisor Garcia-Molina, Hector
Thesis advisor Ré, Christopher
Thesis advisor Ullman, Jeffrey D, 1942-
Advisor Ré, Christopher
Advisor Ullman, Jeffrey D, 1942-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Manas Joglekar.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Manas Rajendra Joglekar
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...