Combining algorithms and humans for large-scale data integration

Verroios, Vasilis

Combining algorithms and humans for large-scale data integration

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fpt044rf9397" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The number of enterprises that exploit the data they collect in their decision making processes has been rapidly growing. A fundamental component to achieve this goal, is data integration, the process of combining and unifying data from multiple sources. Through data integration, an enterprise can find the underlying types of entities and entity attributes in the collected data, and map the data to those types and attributes. Enterprises expect constant improvements in the quality of the data integration outcome. At the same time, the volume and diversity of the collected data increases at a fast pace, as enterprises decide to monitor and collect data from more of their operations or other external sources. To satisfy the quality-improvement requirements, human involvement in data integration has been proposed and applied in the last years: the main idea is to use human intelligence in tasks that require the understanding of data semantics or involve data containing images, video, or natural language. In this thesis, we study approaches for different data integration tasks that combine machine algorithms and human intelligence. In particular, we focus on three data integration tasks: entity resolution, top-k item detection, and data aggregation. In entity resolution, for each underlying entity in a set of records (e.g., peoples' face images), the goal is to create one cluster containing only the records that refer to the same entity (e.g., same person). In top-k item detection, the input is a set of records (e.g., restaurant images) and the goal is to find the top-k records based on some criteria (e.g., how nice the restaurant looks in the image). In data aggregation, the goal is to create a summary of some input content (e.g., 5-minute summary of a 2-hour movie or summary of the main points in the reviews for a movie or product). The approaches we develop, provide solutions to a spectrum of challenges that hybrid machine-human, data-integration approaches face. Furthermore, our approaches provide significant performance, accuracy, and monetary-cost gains compared to state-of-the-art alternatives. The gains come from efficiently solving a number of fundamental problems for hybrid data-integration approaches: a) combining human errors with machine algorithm errors, b) selecting the best questions to ask humans, c) selecting between different interfaces for human questions, d) managing the resources available for human questions, e) preprocessing data before human curation is applied, and f) creating a context describing the whole dataset, to assist humans in accurately answering questions.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2018; ©2018
Publication date	2018; 2018
Issuance	monographic
Language	English

Creators/Contributors

Author	Verroios, Vasilis
Degree supervisor	Garcia-Molina, Hector
Thesis advisor	Garcia-Molina, Hector
Thesis advisor	Bailis, Peter
Thesis advisor	Ré, Christopher
Degree committee member	Bailis, Peter
Degree committee member	Ré, Christopher
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Vasilis Verroios.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis Ph.D. Stanford University 2018.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...