Matching and unifying records in a distributed system
Abstract/Contents
- Abstract
- Entity resolution (ER) is the process of identifying records in a database that refer to the same real world entity. We consider various aspects of the entity resolution problem: 1) We present multiple methods for efficiently distributing the ER workload across multiple processors and provide guidelines on when to use each method. 2) We explore the use of blocking on multiple criteria at once to reduce the runtime of processing, while minimizing a loss of accuracy. We propose an efficient technique for performing multiple blocking when the full record set is too large to fit in memory at once and must therefore be stored on disk. 3) We present new locality sensitive hashing schemes based on minhash for new data types, including maps, sets with weighted values, and composite data types. These hashing schemes can be used in conjunction with blocking techniques for entity resolution. 4) We describe ER on data with confidences and provide efficient methods of handling the confidences, introducing the concepts of confidence thresholds and domination. 5) We explore how the results of ER algorithms are evaluated against a gold standard result, and propose a new, configurable distance measure that provides an intuitive method of adapting the measure to any given ER application.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2010 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Menestrina, David Michael |
---|---|
Associated with | Stanford University, Computer Science Department |
Primary advisor | Garcia-Molina, Hector |
Thesis advisor | Garcia-Molina, Hector |
Thesis advisor | Ullman, Jeffrey D, 1942- |
Thesis advisor | Widom, Jennifer |
Advisor | Ullman, Jeffrey D, 1942- |
Advisor | Widom, Jennifer |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | David Menestrina. |
---|---|
Note | Submitted to the Department of Computer Science. |
Thesis | Thesis (Ph.D.)--Stanford University, 2010. |
Location | electronic resource |
Access conditions
- Copyright
- © 2010 by David Michael Menestrina
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...