Combining algorithms and humans for large-scale data integration

Placeholder Show Content


The number of enterprises that exploit the data they collect in their decision making processes has been rapidly growing. A fundamental component to achieve this goal, is data integration, the process of combining and unifying data from multiple sources. Through data integration, an enterprise can find the underlying types of entities and entity attributes in the collected data, and map the data to those types and attributes. Enterprises expect constant improvements in the quality of the data integration outcome. At the same time, the volume and diversity of the collected data increases at a fast pace, as enterprises decide to monitor and collect data from more of their operations or other external sources. To satisfy the quality-improvement requirements, human involvement in data integration has been proposed and applied in the last years: the main idea is to use human intelligence in tasks that require the understanding of data semantics or involve data containing images, video, or natural language. In this thesis, we study approaches for different data integration tasks that combine machine algorithms and human intelligence. In particular, we focus on three data integration tasks: entity resolution, top-k item detection, and data aggregation. In entity resolution, for each underlying entity in a set of records (e.g., peoples' face images), the goal is to create one cluster containing only the records that refer to the same entity (e.g., same person). In top-k item detection, the input is a set of records (e.g., restaurant images) and the goal is to find the top-k records based on some criteria (e.g., how nice the restaurant looks in the image). In data aggregation, the goal is to create a summary of some input content (e.g., 5-minute summary of a 2-hour movie or summary of the main points in the reviews for a movie or product). The approaches we develop, provide solutions to a spectrum of challenges that hybrid machine-human, data-integration approaches face. Furthermore, our approaches provide significant performance, accuracy, and monetary-cost gains compared to state-of-the-art alternatives. The gains come from efficiently solving a number of fundamental problems for hybrid data-integration approaches: a) combining human errors with machine algorithm errors, b) selecting the best questions to ask humans, c) selecting between different interfaces for human questions, d) managing the resources available for human questions, e) preprocessing data before human curation is applied, and f) creating a context describing the whole dataset, to assist humans in accurately answering questions.


Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2018; ©2018
Publication date 2018; 2018
Issuance monographic
Language English


Author Verroios, Vasilis
Degree supervisor Garcia-Molina, Hector
Thesis advisor Garcia-Molina, Hector
Thesis advisor Bailis, Peter
Thesis advisor Ré, Christopher
Degree committee member Bailis, Peter
Degree committee member Ré, Christopher
Associated with Stanford University, Computer Science Department.


Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Vasilis Verroios.
Note Submitted to the Department of Computer Science.
Thesis Thesis Ph.D. Stanford University 2018.
Location electronic resource

Access conditions

© 2018 by Vasileios Verroios
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...