Locality-Sensitive Hashing (LSH)

Scala LSH project

Similarity Measure

Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.

  • Cosine similarity
  • Euclidean similarity
  • Nucleotide similarity
  • Amino acid similarity
  • Hamming similarity
  • Jaccard similarity

Types of LSH



  1. Practical and Optimal LSH for Angular Distance
  2. Optimal Data-Dependent Hashing for Approximate Near Neighbors
  3. Beyond Locality Sensitive Hashing
  4. Original LSH algorithm (1999)
  5. Efficient Distributed Locality Sensitive Hashing
  6. Jaccard distance: Mining Massive Data Sets chapter#3
  7. Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
  8. Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://people.csail.mit.edu/indyk/nips-nn.ps
  9. Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
  10. Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
  11. Similarity Estimation Techniques from Rounding Algorithms
  12. Random projection Random projection in dimensionality reduction: Applications to image and text data
  13. An Introduction to Sequence Similarity (“Homology”) Searching
  14. Efficient large-scale sequence comparison by locality-sensitive hashing

Finding Nearest Neighbors

Additional Reading

Issues for LSH

  1. SPARK-5992 Locality Sensitive Hashing (LSH) for Spark
  2. spark/pull/15148

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:

  1. Implement abstract LSH, LSHModel classes as Estimator-Model
  2. Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
  3. Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
  4. Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

  • Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
  • PySpark Integration for the scala classes and methods.