vec0: Support constraints on `distance` column for pagination/custom thresholds #165

asg017 · 2025-01-09T23:51:27Z

KNN searches in vec0 virtual tables are currently quite limited — ex "get the 10 closest vectors to my query", and that's it. No room for pagination, custom distance thresholds or anything. Sure there's hacks you could do (ex over-fetch 100 vectors and paginate yourself), but we can do better.

The solution - support custom "constraints" on the distance column.

For example, if you wanted to paginate through KNN queries 10 items at a time, your initial query would be:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10;

Which is a basic k=10 KNN query. If we assume the 10th item in that result list has a distance value of 0.21, then you should be able to retrieve the next 10 results like so:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10
  and distance > 0.21;

And say the 20th result has a distance of 0.49, you should be able to replace 0.21 in the above query with 0.49 and get the next 10 results, and so on and so on.

Additionally, if you wanted to retrieve a custom threshold of vector distance, you should be able to do something like:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10
  and distance between  0.5 and 1.0;

with real-world embeddings, this might have limited use-cases since distance is extremely relative and doesnt always mean "relevance", but the point stands.

The vec0 virtual table should recognize WHERE constraints on the distance column, and apply them during KNN queries.

Side-note: why not `OFFSET`?

On SQLite version 3.41+, you could do KNN queries like:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
limit 10;

So why not also support OFFSET for pagination?

For sqlite-vec, using OFFSET wouldnt be very performant. It would require us to essentially perform a KNN query where k=LIMIT + OFFSET, where larger values of OFFSET (ie, more and more pages in) would require way memory and compute to work.

However, cursor-based pagination with distance >= X would be much faster. We can keep k=LIMIT, pre-filter items at query time, and keep things relatively fast. I think that with this approach, the initial KNN query at page=1 will take the same amount of time as page=1000, since it'll be a brute-force scan anyway.

Another benefit of OFFSET vs curor: aaron francis says so

The text was updated successfully, but these errors were encountered:

asg017 mentioned this issue Jan 11, 2025

WIP: Support constaints on distance column in KNN queries, for pagination and range queries #166

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vec0: Support constraints on `distance` column for pagination/custom thresholds #165

vec0: Support constraints on `distance` column for pagination/custom thresholds #165

asg017 commented Jan 9, 2025

vec0: Support constraints on distance column for pagination/custom thresholds #165

vec0: Support constraints on distance column for pagination/custom thresholds #165

Comments

asg017 commented Jan 9, 2025

Side-note: why not OFFSET?

vec0: Support constraints on `distance` column for pagination/custom thresholds #165

vec0: Support constraints on `distance` column for pagination/custom thresholds #165

Side-note: why not `OFFSET`?