Locality-Sensitive Hashing (LSH)
Scala LSH project
Similarity measure is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics.
- Cosine similarity
- Euclidean similarity
- Nucleotide similarity
- Amino acid similarity
- Hamming similarity
- Jaccard similarity
Types of LSH
- Practical and Optimal LSH for Angular Distance
- Optimal Data-Dependent Hashing for Approximate Near Neighbors
- Beyond Locality Sensitive Hashing
- Original LSH algorithm (1999)
- Efficient Distributed Locality Sensitive Hashing
- Jaccard distance: Mining Massive Data Sets chapter#3
- Hamming norm A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
- Lp norms M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
- Cosine distance and Earth movers distance (EMD) M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
- Very Sparse Random Projections Ping Li, T. Hastie and K. W. Church, 2006
- Similarity Estimation Techniques from Rounding
- Random projection Random projection in dimensionality reduction: Applications to image and text data
- An Introduction to Sequence Similarity (“Homology”) Searching
- Efficient large-scale sequence comparison by locality-sensitive hashing
Finding Nearest Neighbors
Issues for LSH
- SPARK-5992 Locality Sensitive Hashing (LSH) for Spark
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.
Detailed changes are as follows:
- Implement abstract LSH, LSHModel classes as Estimator-Model
- Implement approxNearestNeighbors and approxSimilarityJoin in the abstract S.Model
- Implement Random Projection as LSH subclass for Euclidean distance, Min a.h for Jaccard Distance
- Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.