Matching and unifying records in a distributed system

Placeholder Show Content

Abstract/Contents

Abstract
Entity resolution (ER) is the process of identifying records in a database that refer to the same real world entity. We consider various aspects of the entity resolution problem: 1) We present multiple methods for efficiently distributing the ER workload across multiple processors and provide guidelines on when to use each method. 2) We explore the use of blocking on multiple criteria at once to reduce the runtime of processing, while minimizing a loss of accuracy. We propose an efficient technique for performing multiple blocking when the full record set is too large to fit in memory at once and must therefore be stored on disk. 3) We present new locality sensitive hashing schemes based on minhash for new data types, including maps, sets with weighted values, and composite data types. These hashing schemes can be used in conjunction with blocking techniques for entity resolution. 4) We describe ER on data with confidences and provide efficient methods of handling the confidences, introducing the concepts of confidence thresholds and domination. 5) We explore how the results of ER algorithms are evaluated against a gold standard result, and propose a new, configurable distance measure that provides an intuitive method of adapting the measure to any given ER application.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2010
Issuance monographic
Language English

Creators/Contributors

Associated with Menestrina, David Michael
Associated with Stanford University, Computer Science Department
Primary advisor Garcia-Molina, Hector
Thesis advisor Garcia-Molina, Hector
Thesis advisor Ullman, Jeffrey D, 1942-
Thesis advisor Widom, Jennifer
Advisor Ullman, Jeffrey D, 1942-
Advisor Widom, Jennifer

Subjects

Genre Theses

Bibliographic information

Statement of responsibility David Menestrina.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2010.
Location electronic resource

Access conditions

Copyright
© 2010 by David Michael Menestrina
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...