math - cosine similarity LSH and random hyperplane -
i read few solutions nearest neighbor search in high-dimensions using random hyperplane, still confused how buckets work. have 100 millions of document in form of 100-dimension vectors , 1 million queries. each query, need find nearest neighbor based on cosine similarity. brute force approach find cosine
value of query 100 million documents , select the ones value close 1. struggling concept of random hyperplanes can put documents in buckets don't have calculate cosine
value 100 million times each query.
think in geometric way. imagine data points in high dimensional space.
create random hyperplanes (just planes in higher dimension), reduction using imagination.
these hyperplanes cut data (the points), creating partitions, points being positioned apart others (every point in partition; rough approximation).
now buckets should populated according partitions formed hyperplanes. result, every bucket contains less points total size of pointset (because every partition talked before, contains less points total size of pointset).
as consequence, when pose query, check less points (with assistance of buckets) total size. that's gain here, since checking less points, means better (faster) brute force approach, checks points.
Comments
Post a Comment