Static Vector Search

An idea popped into my head while checking out this search solution for statically generated sites: PageFind.

Pagefind works by building an index of your site’s content at build time, and then uses that index to perform searches on the client side. Data bandwidth is saved by using an ordered index, so that only a portion of the index needs to be downloaded, based on metadata.

This inspired me to try building a static vector search solution, and by extension, hybrid search. Big tech companies are going in the direction of offloading LLM computation into on-board devices. We see some direction of this with Apple’s Foundation Models Framework and Google’s Built-in AI in Google Chrome,(2). So in the same way PageFind fulfills the niche of a good search experience for static sites, there should also be a usecase for simple semantic queries, powered by small, on-device models. (Maybe user guides, documentation)

Sketch

There are quite a few Approximate Nearest Neighbor (ANN) algorithms, but the approaches generally are split into:

Pruning the graph search space (IVF, HNSW)
Compressing vector representations to make it cheaper to perform search (DISKANN)

The approaches are not mutually exclusive, but one can imagine that stacking them will introduce more approximations and therefore a lower recall.

Idea: In our case, the client-server situation lends very nicely to the memory-disk interaction in DISKANN. One can simply treat the resource requests to the server as very expensive disk reads. With that, the implementation falls straight into our laps. On load, the client would obtain a compressed, quantized version of the vector graph. With the query vector, we perform a beam search to obtain some top K document vectors, and then request the server for full uncompressed values of said vectors and perform reranking.

On the server side however, we need to take care of model versioning, since vector embeddings are not interchangable between models. This means that we need a new copy of the vector graph for every embedding model that we wish to support.

A proof of concept would simply be a naive implementation of the DISKANN algorithm as described above. We can also incorporate regular full-text search to achieve hybrid search directly. The aggregration and rerank can be a simple RRF scoring, although reranking with a small embedded LLM should also be trivial (at the point of writing Chrome already ships with a chat model but not an embedding one).

Ending notes

This is a good idea under the assumption that the browser ships with the models. If this does not happen however, a bandwidth-conscious hybrid search library is a non-sequitor. Small embedding models I have seen are ~600MB, which completely negates any claim of bounding bandwidth. Not to mention AI reranking after RRF, which requires another model.

However, I predict that we will reach a point where on-device models become more ubiquitous, which makes the idea eventually feasible. In the meantime, the user should be prompted for a request to download a small model instead of doing it silently.

References

https://github.com/tobi/qmd