Locality sensitive hashing for real time and distributed computation

Placeholder Show Content

Abstract/Contents

Abstract
Similarity Search (SS) involves retrieving objects from a data set similar to a query object. In most applications, objects are represented by feature vectors in a high dimensional space. In this setting, Locality sensitive hashing(LSH) has emerged as the method of choice to implement large-scale SS. LSH uses random projections to insert data into a number of hash tables. SS queries are answered by performing a lookup operation in these hash tables. Although LSH is widely used in data mining and machine learning applications, it requires a large number of hash tables in order to guarantee accurate search results. This makes it difficult to implement LSH in high speed real time applications (due to slow query processing) and distributed systems (due to network load), since each query lookup in a hash bucket requires a network call. In this thesis we describe new techniques to implement SS via LSH in these settings. For Real Time SS, we describe a variant of LSH, Ternary Locality Sensitive Hashing (TLSH), for implementing SS on TCAMs (Ternary Content Addressable Memories), a ternary associative memory popular in networking equipments like routers and switches. Our approach critically exploits the wild-card matching capability of TCAMs. Our TLSH-based TCAM implementation has near-linear space complexity and answers SS queries in a single lookup, thus offering an exponential improvement over existing LSH solutions, which have a run time that grows as a fractional power of the data set size and sub quadratic space complexity. This enables efficient implementation of SS in real-time, at the cost of using a hardware primitive, the TCAM. Our solution works around known LSH lower bounds, as well as the (more generic) cell probe model lower bounds for near neighbor and nearest neighbor problems. The simulations and experiments we perform indicate that our proposed architecture can handle millions of SS queries per second. Distributed SS: We design a new implementation of LSH, called Layered LSH, for distributed (Key, Value) based frameworks. Layered LSH offers an exponential improvement over the in the network cost per query incurred by simple distributed implementations of LSH. In contrast with simple implementations whose network cost increases with increase in desired quality of SS, we prove that network costs associated with Layered LSH are independent of the search quality. Our experiments using the MapReduce framework indicate that this improvement in network efficiency leads to large improvements in run-time, in scenarios where SS needs to be implemented with high accuracy on data sets containing tens of millions of data points. We also prove that despite being network efficient (achieved by co-locating points near points on the same machine), Layered LSH sends points that are a constant distance apart to different machines with high likelihood. This shows that Layered LSH hits the right tradeoff between network efficiency and load balance across machines.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English

Creators/Contributors

Associated with Shinde, Rajendra Bhimsen
Associated with Stanford University, Department of Computational and Mathematical Engineering.
Primary advisor Goel, Ashish
Thesis advisor Goel, Ashish
Thesis advisor Mahoney, Michael
Thesis advisor Ye, Yinyu
Advisor Mahoney, Michael
Advisor Ye, Yinyu

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Rajendra Bhimsen Shinde.
Note Submitted to the Department of Computational and Mathematical Engineering.
Thesis Ph.D. Stanford University 2012
Location electronic resource

Access conditions

Copyright
© 2012 by Rajendra Bhimsen Shinde
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...