Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vec0: Support constraints on distance column for pagination/custom thresholds #165

Open
asg017 opened this issue Jan 9, 2025 · 0 comments

Comments

@asg017
Copy link
Owner

asg017 commented Jan 9, 2025

KNN searches in vec0 virtual tables are currently quite limited — ex "get the 10 closest vectors to my query", and that's it. No room for pagination, custom distance thresholds or anything. Sure there's hacks you could do (ex over-fetch 100 vectors and paginate yourself), but we can do better.

The solution - support custom "constraints" on the distance column.

For example, if you wanted to paginate through KNN queries 10 items at a time, your initial query would be:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10;

Which is a basic k=10 KNN query. If we assume the 10th item in that result list has a distance value of 0.21, then you should be able to retrieve the next 10 results like so:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10
  and distance > 0.21;

And say the 20th result has a distance of 0.49, you should be able to replace 0.21 in the above query with 0.49 and get the next 10 results, and so on and so on.

Additionally, if you wanted to retrieve a custom threshold of vector distance, you should be able to do something like:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
  and k = 10
  and distance between  0.5 and 1.0;

with real-world embeddings, this might have limited use-cases since distance is extremely relative and doesnt always mean "relevance", but the point stands.

The vec0 virtual table should recognize WHERE constraints on the distance column, and apply them during KNN queries.

Side-note: why not OFFSET?

On SQLite version 3.41+, you could do KNN queries like:

select
  rowid,
  distance
from vec_items
where contents_embedding match ?
limit 10;

So why not also support OFFSET for pagination?

For sqlite-vec, using OFFSET wouldnt be very performant. It would require us to essentially perform a KNN query where k=LIMIT + OFFSET, where larger values of OFFSET (ie, more and more pages in) would require way memory and compute to work.

However, cursor-based pagination with distance >= X would be much faster. We can keep k=LIMIT, pre-filter items at query time, and keep things relatively fast. I think that with this approach, the initial KNN query at page=1 will take the same amount of time as page=1000, since it'll be a brute-force scan anyway.

Another benefit of OFFSET vs curor: aaron francis says so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant