You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KNN searches in vec0 virtual tables are currently quite limited — ex "get the 10 closest vectors to my query", and that's it. No room for pagination, custom distance thresholds or anything. Sure there's hacks you could do (ex over-fetch 100 vectors and paginate yourself), but we can do better.
The solution - support custom "constraints" on the distance column.
For example, if you wanted to paginate through KNN queries 10 items at a time, your initial query would be:
select
rowid,
distance
from vec_items
where contents_embedding match ?
and k =10;
Which is a basic k=10 KNN query. If we assume the 10th item in that result list has a distance value of 0.21, then you should be able to retrieve the next 10 results like so:
select
rowid,
distance
from vec_items
where contents_embedding match ?
and k =10and distance >0.21;
And say the 20th result has a distance of 0.49, you should be able to replace 0.21 in the above query with 0.49 and get the next 10 results, and so on and so on.
Additionally, if you wanted to retrieve a custom threshold of vector distance, you should be able to do something like:
select
rowid,
distance
from vec_items
where contents_embedding match ?
and k =10and distance between 0.5and1.0;
with real-world embeddings, this might have limited use-cases since distance is extremely relative and doesnt always mean "relevance", but the point stands.
The vec0 virtual table should recognize WHERE constraints on the distance column, and apply them during KNN queries.
Side-note: why not OFFSET?
On SQLite version 3.41+, you could do KNN queries like:
select
rowid,
distance
from vec_items
where contents_embedding match ?
limit10;
So why not also support OFFSET for pagination?
For sqlite-vec, using OFFSET wouldnt be very performant. It would require us to essentially perform a KNN query where k=LIMIT + OFFSET, where larger values of OFFSET (ie, more and more pages in) would require way memory and compute to work.
However, cursor-based pagination with distance >= X would be much faster. We can keep k=LIMIT, pre-filter items at query time, and keep things relatively fast. I think that with this approach, the initial KNN query at page=1 will take the same amount of time as page=1000, since it'll be a brute-force scan anyway.
KNN searches in
vec0
virtual tables are currently quite limited — ex "get the 10 closest vectors to my query", and that's it. No room for pagination, custom distance thresholds or anything. Sure there's hacks you could do (ex over-fetch 100 vectors and paginate yourself), but we can do better.The solution - support custom "constraints" on the
distance
column.For example, if you wanted to paginate through KNN queries 10 items at a time, your initial query would be:
Which is a basic
k=10
KNN query. If we assume the 10th item in that result list has a distance value of0.21
, then you should be able to retrieve the next 10 results like so:And say the 20th result has a distance of
0.49
, you should be able to replace0.21
in the above query with0.49
and get the next 10 results, and so on and so on.Additionally, if you wanted to retrieve a custom threshold of vector distance, you should be able to do something like:
with real-world embeddings, this might have limited use-cases since distance is extremely relative and doesnt always mean "relevance", but the point stands.
The
vec0
virtual table should recognizeWHERE
constraints on the distance column, and apply them during KNN queries.Side-note: why not
OFFSET
?On SQLite version 3.41+, you could do KNN queries like:
So why not also support
OFFSET
for pagination?For
sqlite-vec
, usingOFFSET
wouldnt be very performant. It would require us to essentially perform a KNN query wherek=LIMIT + OFFSET
, where larger values ofOFFSET
(ie, more and more pages in) would require way memory and compute to work.However, cursor-based pagination with
distance >= X
would be much faster. We can keepk=LIMIT
, pre-filter items at query time, and keep things relatively fast. I think that with this approach, the initial KNN query atpage=1
will take the same amount of time aspage=1000
, since it'll be a brute-force scan anyway.Another benefit of
OFFSET
vs curor: aaron francis says soThe text was updated successfully, but these errors were encountered: