Approximate nearest neighbor methods and vector models


https://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup
ソース:spotifyエンジニアリングリーダースライド
  • Start with high dimensional data
  • Run dimension reduction to 10-1000 dims
  • Do stuff in a small dimensional space
  • Annoy


  • Buliding an Annoy index
  • start with the point set
  • split it in 2 halves
  • split again (until k items in each leaf, takes n/k memory instead n)
  • binary tree

  • Search
  • the closest isn't necessarily in the same leaf of the binary tree
  • 2 points that are really close may end up on different sides of split
  • → Go both sides of a split if it's close
  • Tricks
  • query structure
  • use priority queue to search all trees until we've k items
  • take union and remove dupliates
  • compute distance for remaining items
  • return NN items
  • pip install --user annoy
    pip install pynndescent