Dynamic strategies for crowdsourced data management

Placeholder Show Content

Abstract/Contents

Abstract
As the world becomes ever-more connected to the Internet, crowdsourcing marketplaces such as Amazon Mechanical Turk give us a mechanism for the large-scale inclusion of humans into computational workflows. However, many crowdworkers make mistakes and disagree with one another, some workers are malicious and only contribute spam, and the crowd can often be both slow and expensive. Despite these many challenges, in this thesis we develop new algorithms that allow us to effectively utilize the crowd while still ensuring quick, low-cost, and accurate results. First, we consider the commonly encountered labeling or filtering problem, where we use the crowd to label or filter items in a dataset. We describe CrowdDQS (Crowd Dynamic Question Selection), a general-purpose system we developed that can reduce the cost of labeling by up to 6 times in practice by dynamically issuing questions to workers and automatically detecting and blocking poor workers. Next, we consider the maximum problem, where we are presented with a set of records, each with an unknown intrinsic score, and our goal is to use the crowd to find the record with the highest score. We develop hybrid strategies that judiciously use a combination of both a ratings interface and a comparisons interface to more efficiently find the maximum than typical single-interface strategies. Finally, we consider the problem of using the crowd to cluster together similar records or to perform entity resolution (ER). We significantly reduce the cost of pairwise crowd clustering approaches by soliciting the crowd for attribute labels on records, and then asking for pairwise judgments only between records with similar sets of attribute labels. We describe strategies which allow us to finely control the accuracy of our results while still maintaining significant cost reductions.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Khan, Asif R
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Garcia-Molina, Hector
Thesis advisor Garcia-Molina, Hector
Thesis advisor Mitra, Subhasish
Thesis advisor Ré, Christopher
Advisor Mitra, Subhasish
Advisor Ré, Christopher

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Asif R. Khan.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Asif Khan
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...