Combining algorithms and humans for large-scale data integration
Abstract/Contents
- Abstract
- The number of enterprises that exploit the data they collect in their decision making processes has been rapidly growing. A fundamental component to achieve this goal, is data integration, the process of combining and unifying data from multiple sources. Through data integration, an enterprise can find the underlying types of entities and entity attributes in the collected data, and map the data to those types and attributes. Enterprises expect constant improvements in the quality of the data integration outcome. At the same time, the volume and diversity of the collected data increases at a fast pace, as enterprises decide to monitor and collect data from more of their operations or other external sources. To satisfy the quality-improvement requirements, human involvement in data integration has been proposed and applied in the last years: the main idea is to use human intelligence in tasks that require the understanding of data semantics or involve data containing images, video, or natural language. In this thesis, we study approaches for different data integration tasks that combine machine algorithms and human intelligence. In particular, we focus on three data integration tasks: entity resolution, top-k item detection, and data aggregation. In entity resolution, for each underlying entity in a set of records (e.g., peoples' face images), the goal is to create one cluster containing only the records that refer to the same entity (e.g., same person). In top-k item detection, the input is a set of records (e.g., restaurant images) and the goal is to find the top-k records based on some criteria (e.g., how nice the restaurant looks in the image). In data aggregation, the goal is to create a summary of some input content (e.g., 5-minute summary of a 2-hour movie or summary of the main points in the reviews for a movie or product). The approaches we develop, provide solutions to a spectrum of challenges that hybrid machine-human, data-integration approaches face. Furthermore, our approaches provide significant performance, accuracy, and monetary-cost gains compared to state-of-the-art alternatives. The gains come from efficiently solving a number of fundamental problems for hybrid data-integration approaches: a) combining human errors with machine algorithm errors, b) selecting the best questions to ask humans, c) selecting between different interfaces for human questions, d) managing the resources available for human questions, e) preprocessing data before human curation is applied, and f) creating a context describing the whole dataset, to assist humans in accurately answering questions.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2018; ©2018 |
Publication date | 2018; 2018 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Verroios, Vasilis |
---|---|
Degree supervisor | Garcia-Molina, Hector |
Thesis advisor | Garcia-Molina, Hector |
Thesis advisor | Bailis, Peter |
Thesis advisor | Ré, Christopher |
Degree committee member | Bailis, Peter |
Degree committee member | Ré, Christopher |
Associated with | Stanford University, Computer Science Department. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Vasilis Verroios. |
---|---|
Note | Submitted to the Department of Computer Science. |
Thesis | Thesis Ph.D. Stanford University 2018. |
Location | electronic resource |
Access conditions
- Copyright
- © 2018 by Vasileios Verroios
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...