Static Vector Search
An idea popped into my head while checking out this search solution for statically generated sites: PageFind.
Pagefind works by building an index of your site’s content at build time, and then uses that index to perform searches on the client side. Data bandwidth is saved by using an ordered index, so that only a portion of the index needs to be downloaded, based on metadata.
This inspired me to try building a static vector search solution. I have also been meaning to try out running gpu accelerated code in the browser. For large sites, parallelization and gpu acceleration will greatly enhance the user experience, if it is possible.
Sketch
There are quite a few Approximate Nearest Neighbor (ANN) algorithms. But the one that seems to be the most popular is HNSW, which is basically a skip-list in higher dimensions. Unfortunately, each layer strictly increases in size, so there is no way to save bandwidth by only downloading a portion of the index with the traditional algorithm.
Idea: For each layer, each node in a layer maps to a distinct layer which contains the neighbourhood of the node in the previous layer.
This allows us to prune the search space by fixing the size of each layer, akin to a B+ tree, but for higher dimensions. This is also amenable to parallelization, as all the nodes in a layer can be compared to the query in parallel.
There are also some build time optimizations that are possible with parallelization, but this will be fleshed out in the future.
Evaluation
I think the main metric would be recall and precision compared to IVF-PQ, followed by latency and bandwidth usage.
Does it even make sense?
tldr; not really.
Generating the embedding for your search term is already going to require a model or an external call. What is the point of the static vector search, if you need a server to handle the embedding generation? You might as well just do the search on the server side as well, and return the results. Calling a managed external service also doesn’t make sense since you will be shipping your API key to the client!
The alternative is to run the embedding on the client. The smallest model I have seen is > 600MB, which completely negates the bandwidth optimizations of the chunked index.
However, I believe that there will come a time where devices will come with models onboard. We see some direction of this with Apple’s Foundation Models Framework and Google’s Built-in AI. This can be a proof of concept for when on-device models become more ubiquitous.