Locality-Sensitive Hashing (LSH)
Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.
- Cosine similarity
- Euclidean similarity
- Nucleotide similarity
- Amino acid similarity
- Hamming similarity
- Jaccard similarity
Types of LSH
- HyperplaneLSH for Cosine Distance
- Super-Bit Locality-Sensitive Hashing for Hamming distance
- Min Hash for Jaccard similarity
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
- tdebatty/java-LSH A Java implementation of Locality Sensitive Hashing (LSH) MinHash & Super-Bit
- apache/incubator-datafu a collection of libraries for working with large-scale data in Hadoop.
- marufaytekin/lsh-spark HyperplaneLSH for Spark
- soundcloud/cosine-lsh-join-spark Approximate Nearest Neighbors in Spark
- karlhigley/spark-neighbors Spark-based approximate nearest neighbor search using locality-sensitive hashing supports Hamming, Jaccard, Euclidean, and cosine distance.
- rholder/nilsimsa Nilsimsa locality-sensitive hashing algorithm in Java.
- chrisjmccormick/MinHash MinHash Tutorial with Python Code with example to mining documents similarity.
- barneygovan/lsh-scala A Locality-Sensitive Hashing Library for Scala with optional Redis storage.
- treadstone90/Locality-Sensitive-Hashing works only for the text and can support only Jaccard Similarity.
- richwhitjr/DistNN Distributed LSH Implementation in Scala.
- beckgael/Mean-Shift-LSH Distributed Nearest Neighbours Mean Shift with Locality Sensitive Hashing DNNMS-LSH. Scala/Spark implementation.
- ohtaman/LSH C++ implemented MinHash and SimHash.
- JorenSix/TarsosLSH A Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time.
- Practical and Optimal LSH for Angular Distance
- Optimal Data-Dependent Hashing for Approximate Near Neighbors
- Beyond Locality Sensitive Hashing
- Original LSH algorithm (1999)
- Efficient Distributed Locality Sensitive Hashing
- Jaccard distance: Mining Massive Data Sets chapter#3
- Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
- Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://people.csail.mit.edu/indyk/nips-nn.ps
- Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
- Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
- Similarity Estimation Techniques from Rounding Algorithms
- Random projection Random projection in dimensionality reduction: Applications to image and text data
- An Introduction to Sequence Similarity (“Homology”) Searching
- Efficient large-scale sequence comparison by locality-sensitive hashing
Finding Nearest Neighbors
Issues for LSH
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.
Detailed changes are as follows:
- Implement abstract LSH, LSHModel classes as Estimator-Model
- Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
- Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
- Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.