Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Integrate Jvector engine as another vector engine of choice #2386

Open
sam-herman opened this issue Jan 14, 2025 · 14 comments · May be fixed by #2505
Open

[FEATURE] Integrate Jvector engine as another vector engine of choice #2386

sam-herman opened this issue Jan 14, 2025 · 14 comments · May be fixed by #2505
Assignees

Comments

@sam-herman
Copy link
Contributor

sam-herman commented Jan 14, 2025

Is your feature request related to a problem?
Currently k-NN plugin supports 3 engines, Nmslib, Faiss and Lucene.
In this change I would like to integrate JVector as another engine of choice.
There are a number of unique advantages to doing so:

  1. Disk ANN - JVector is capable to perform search without loading the entire index into RAM. This is a functionality that is not available today through Lucene and can be done through jVector without involving native dependencies (FAISS) and cumbersome JNI mechanism.
  2. Thread Safety - JVector is a threadsafe index that supports concurrent modification and inserts with near perfect scalability as you add cores, Lucene is not threadsafe; OpenSearch kind of works around this with multiple segments but then has to compact them so insert performance still suffers (and I believe you can't read from a lucene segment during construction)
  3. quantized index construction - JVector can perform index construction w/ quantized vectors, saving memory = larger segments = fewer segments = faster searches
  4. Quantized Disk ANN - JVector supports DiskANN style quantization with rerank, it's quite easy (in principle) to demonstrate that this is a massive difference in performance for larger-than-memory indexes (in practice it takes days/weeks to insert enough vectors into Lucene to show this b/c of the single threaded problem, that's the only hard part)
  5. PQ and BQ support - As part of (3) JVector supports PQ as well as the BQ that Lucene offers, it seems that this is fairly rare (pgvector doesn't do PQ either) because (1) the code required to get high performance ADC with SIMD is a bit involved and (2) it requires a separate codebook which Lucene isn't set up to easily accommodate. PQ at 64x compression gives you higher relevance than BQ at 32x
  6. Fused ADC - Features that nobody else has like Fused ADC and NVQ and Anisotropic PQ
  7. Compatibility - JVector is compatible with Cassandra. Which allows to more easily transfer vector encoded data from Cassandra to OpenSearch and vice versa.

What solution would you like?
Introduce JVector into K-NN plugin as another supported engine.

** Benchmarks **
Will be adding some benchmarks to illustrate the above advantages...

What alternatives have you considered?
NA

@jmazanec15
Copy link
Member

Thanks @sam-herman. Jvector looks very cool and would be a good addition. I think to add this, we would need clean/formalize engine abstractions/interfaces to make vector engine's truly extendable. We have done this to some degree, but definitely need to go further.

@sam-herman
Copy link
Contributor Author

@jmazanec15 can you share the plans around this? JFYI, I will probably have a POC ready in about a week or so with JVector integrated in k-NN. Let me know what is the best issue/RFC to contribute to the intended design around it.

@jmazanec15
Copy link
Member

There arent any formal plans, but a trail of issues:

  1. [FEATURE] Is possible to add new KNNs using microservices? That broadens the KNNs that can be used as backends #1328
  2. Move FaissService and NmslibService in jni directory and make proper sub-gradle project #1540
  3. [FEATURE] Move Lucene Vector field and HNSW KNN Search as a first class feature in core  #1467
  4. [FEATURE] Support ScaNN/FastScan and re-ranking for faiss Engine #1347

I think long term goal is to provide interface for new engines and clean separation between core k-NN plugin and different engines. We started doing that in 2.17 with https://github.com/opensearch-project/k-NN/tree/main/src/main/java/org/opensearch/knn/index/engine. Theres still work todo, but idea is to get away from having branches on engines.

@sam-herman
Copy link
Contributor Author

sam-herman commented Jan 21, 2025

Thanks for sharing those.
For JVector the immediate straightforward path I see is to add it as an engine that extends the JVMLibrary interface same as Lucene KnnEngine does. Currently I don't see any substantial difference between the two in terms of plugin interface requirements, therefore my suggestion is to add JVector with the existing abstractions and this should be orthogonal to future refactoring work of the interfaces.

@sam-herman
Copy link
Contributor Author

@navneet1v @jmazanec15 I added more bullet points in the description to the "added value" category. I'm hoping to add the output of a few benchmarks to support those pretty soon. But hopefully for the time being it can provide some qualitative measure.
Prior to adding the benchmarks I will provide a benchmark plan for your review, however if you have any suggestions in the time being that you think can help let me know.

@navneet1v
Copy link
Collaborator

@sam-herman thanks for providing the details. Just to provide a clarity I am not against adding JVector in k-NN plugin, I think in initial integration this should be an Optional Vector engine.

Please find my response below:

Disk ANN - JVector is capable to perform search without loading the entire index into RAM. This is a functionality that is not available today through Lucene and can be done through jVector without involving native dependencies (FAISS) and cumbersome JNI mechanism.

K-NN plugin is very much involved/invested in FAISS and native code(BTW lucene is also coming up with Faiss integration: https://github.com/apache/lucene/pull/14178/files). Even though JVector has the benefits from a k-NN plugin standpoint if you look at, Faiss gives us all the different quantization including 32x compression based search. So JVector is not just a drop in replacement.

Thread Safety - JVector is a threadsafe index that supports concurrent modification and inserts with near perfect scalability as you add cores, Lucene is not threadsafe; OpenSearch kind of works around this with multiple segments but then has to compact them so insert performance still suffers (and I believe you can't read from a lucene segment during construction)

This is really good, but since Faiss/Lucene HNSW Index which are created at segment level. And segments become searchable only when refresh/flush on IndexReaders is completed. We don't go into scenarios where we need to do search during indexing. Hence, this capability might not even be used.

quantized index construction - JVector can perform index construction w/ quantized vectors, saving memory = larger segments = fewer segments = faster searches

In k-NN plugin we currently support 1x, 2x, 4x, 8x, 16x, 32x and above quantization techniques. So the quantizations as feature is present, so would like to know more on this.

Quantized Disk ANN - JVector supports DiskANN style quantization with rerank, it's quite easy (in principle) to demonstrate that this is a massive difference in performance for larger-than-memory indexes (in practice it takes days/weeks to insert enough vectors into Lucene to show this b/c of the single threaded problem, that's the only hard part)

In 2.17 version of k-NN plugin we launched disk based vector search support https://opensearch.org/docs/latest/search-plugins/knn/disk-based-vector-search/ which has quantization and re-ranking support. I think we should compare this performance of k-NN disk based vector search with JVector disk ann implementation. This will help us take a informed decision.

PQ and BQ support - As part of (3) JVector supports PQ as well as the BQ that Lucene offers, it seems that this is fairly rare (pgvector doesn't do PQ either) because (1) the code required to get high performance ADC with SIMD is a bit involved and (2) it requires a separate codebook which Lucene isn't set up to easily accommodate. PQ at 64x compression gives you higher relevance than BQ at 32x

With Faiss and new Quantization framework support for both PQ and BQ is present in k-NN plugin. But some of the techniques like ADC is still getting worked upon. @Vikasht34 and @jmazanec15 can add more here. But this is where I think we can reuse some of the Quantization techniques provided by JVector.

Fused ADC - Features that nobody else has like Fused ADC and NVQ and Anisotropic PQ

@jmazanec15 and @Vikasht34 please comment on this.

Compatibility - JVector is compatible with Cassandra. Which allows to more easily transfer vector encoded data from Cassandra to OpenSearch and vice versa.

This one I want to understand more. Since vector indices in Opensearch are stored at segment level how you think a transfer will look like? and what is the meaning of encoded vectors here.

Can you please help ans following questions:

  1. Does JVector support filtering?
  2. Does JVector suppoort nested fields?
  3. When PQ is happening does PQ will happen at a segment level or it will happen globally? and same for the code book

@sam-herman
Copy link
Contributor Author

I think in initial integration this should be an Optional Vector engine.

K-NN plugin is very much involved/invested in FAISS and native code(BTW lucene is also coming up with Faiss integration: https://github.com/apache/lucene/pull/14178/files). Even though JVector has the benefits from a k-NN plugin standpoint if you look at, Faiss gives us all the different quantization including 32x compression based search. So JVector is not just a drop in replacement.

@navneet1v It sounds like the long term approach would be to push it down through Lucene in a similar way to FAISS which I'm open to.
In that case, let's prioritize the focus right now on the optional path and how it's going to happen.
Maybe as a first step I can create a new module which we can call "extras" those will be codecs that are added to the build if a certain build flag is set, then they can simply be injected as a replacement codec that overrides the existing mechanism if provided as a parameter.
I see two advantages for that:

  1. It add minimum branching of if/then logic on the existing one (which is already non trivial)
  2. Also it requires minimum investment in the KNN plugin plumbing of backwards compatibility and allow for a quick evaluation when we want to test something new. Most of the efforts can later be focused on the right interface to push long term (e.g. Lucene if Java or FAISS if native).

What are your thoughts?

@sam-herman
Copy link
Contributor Author

K-NN plugin is very much involved/invested in FAISS and native code(BTW lucene is also coming up with Faiss integration: https://github.com/apache/lucene/pull/14178/files). Even though JVector has the benefits from a k-NN plugin standpoint if you look at, Faiss gives us all the different quantization including 32x compression based search. So JVector is not just a drop in replacement.

@navneet1v I thought about this a little bit more, I do have a higher level concern regarding this statement. I think the main issue here that not all users/developers would like their KNN functionality to depend on native libraries. Also I am not aware that we have made such decision in the project as a whole to have such strong dependency for KNN functionality. In fact this is a step backwards from the original tenets of pure Java based portable implementation that OpenSearch/Lucene had been following through many years.
most interestingly, I can see that a solution was proposed to make KNN facade move to core and keep native extensions in the KNN plugin exactly for those reasons:
#1467

So to summarize my concerns with the above statement:

  1. Separation Of Concerns - KNN Facade should be separated from plugin in order to avoid a situation that the plugin dictates the decision on what all possible KNN codecs extensions that are available. This should be delegated to codec extension per field format, in the same way it's done in other places (e.g. Transport and Storage). Until very recently (and still so in core) Lucene was the only default codec for anything, other than that you needed a codec extension. This seem to break this rule.
  2. Impact Of Benchmarks - It seems that there was already a local decision in the plugin to commit the KNN plugin around FAISS, NMSLIB and not add any other formats besides the default Lucene. So whatever new benchmarks we come up with for jVector wouldn't really matter. There are public benchmarks for jVector we can obviously share and we would love to come up with more if there is a reasonable justification for the ask other than "we want it to perform this much better than existing local native dependency for that to be considered"
  3. Native Dependencies - Including native dependencies in the build process creates a lot of complications, we shouldn't assume this to be something that everybody wants by default and rule out JVM based engines assuming that the functionality already exists in native libraries.

@navneet1v
Copy link
Collaborator

@navneet1v It sounds like the long term approach would be to push it down through Lucene in a similar way to FAISS which I'm open to.

I think Faiss integration in Lucene is in Sandbox, if this shift will happen

I think in initial integration this should be an Optional Vector engine.

K-NN plugin is very much involved/invested in FAISS and native code(BTW lucene is also coming up with Faiss integration: https://github.com/apache/lucene/pull/14178/files). Even though JVector has the benefits from a k-NN plugin standpoint if you look at, Faiss gives us all the different quantization including 32x compression based search. So JVector is not just a drop in replacement.

@navneet1v It sounds like the long term approach would be to push it down through Lucene in a similar way to FAISS which I'm open to. In that case, let's prioritize the focus right now on the optional path and how it's going to happen. Maybe as a first step I can create a new module which we can call "extras" those will be codecs that are added to the build if a certain build flag is set, then they can simply be injected as a replacement codec that overrides the existing mechanism if provided as a parameter. I see two advantages for that:

  1. It add minimum branching of if/then logic on the existing one (which is already non trivial)
  2. Also it requires minimum investment in the KNN plugin plumbing of backwards compatibility and allow for a quick evaluation when we want to test something new. Most of the efforts can later be focused on the right interface to push long term (e.g. Lucene if Java or FAISS if native).

What are your thoughts?

I like the idea of extras as a module. Even though other things you mentioned around builds etc, I would be able to comment once we have some POC available. But direction wise I am aligned on that.

I do believe that just having the codec classes might not only work, but I don't want to bias your thinking here. If we can reduce change radius to codec that will be pretty awesome to have. Please feel free to put up a small POC on which we can iterate upon.

@navneet1v
Copy link
Collaborator

@navneet1v I thought about this a little bit more, I do have a higher level concern regarding this statement. I think the main issue here that not all users/developers would like their KNN functionality to depend on native libraries.

The reason why I am saying k-NN is much more involved as part of JNI is, because we have added more abstractions and extended the functionalities of Faiss. One example is building the Loading and writing layer for integrating Faiss with IndexInput. We have also added capabilities for parent child relationship on top of faiss. Hence that statement.

Also I am not aware that we have made such decision in the project as a whole to have such strong dependency for KNN functionality. In fact this is a step backwards from the original tenets of pure Java based portable implementation that OpenSearch/Lucene had been following through many years.

From the inception of k-NN plugin back from opendistro for elasticsearch k-NN plugin was using JNI code. Since k-NN functionality is part of the plugin and doesn't make any impact on the min distribution, I think plugins are feel free to execute and explore any language they want.

most interestingly, I can see that a solution was proposed to make KNN facade move to core and keep native extensions in the KNN plugin exactly for those reasons:
#1467

Yes this was an interesting GH issue created sometime back, and even if you see we agree that we should move vectors fasade to core if core believes that Vector query is a core search feature. But at the same time, we are trying to make the OpenSearch core engine light weight by moving its functionalities to plugins, modules and core libs. So its kind of conflicting in that sense.

@jmazanec15
Copy link
Member

@sam-herman @navneet1v For jvector, why not implement as a knnvectorformat and then we can hook it up as a per-field format and create mapping integration like lucene in our codec (improving the abstractions along the way)? This has come up in other places: #2414 (comment). I really think long term extension point will be this per-field vector format, its just going to take some time to get there - but having our own codec makes it difficult to integrate with other features. It seems to me though that if jvector has its own perfield KnnVectorsFormat, then we can (somewhat) easily wire it in to the existing codec - no additional jvector codec needed - thus it will be very lightweight.

In terms of optional, I think initially we should integrate it into the knn codec and make it experimental, not optional. This will let people test it and give us an idea if this is something we should make optional over the long term. But, it gives some wiggle room on long term decisions, to collect more data. That being said, we should do our best to sandbox it so it doesnt cause any issues in non-experimental environments. As part of this, we would need to ensure there arent any controversial dependencies.

With Faiss and new Quantization framework support for both PQ and BQ is present in k-NN plugin. But some of the techniques like ADC is still getting worked upon. @Vikasht34 and @jmazanec15 can add more here. But this is where I think we can reuse some of the Quantization techniques provided by JVector.

Yes, we are looking to enhance faiss + BQ with ADC. Still working on proposal.

@sam-herman
Copy link
Contributor Author

sam-herman commented Feb 12, 2025

In terms of optional, I think initially we should integrate it into the knn codec and make it experimental, not optional. This will let people test it and give us an idea if this is something we should make optional over the long term. But, it gives some wiggle room on long term decisions, to collect more data. That being said, we should do our best to sandbox it so it doesnt cause any issues in non-experimental environments. As part of this, we would need to ensure there arent any controversial dependencies.

@jmazanec15 I have given this some thought, unfortunately I don't see much benefit for us for jVector being labeled as experimental suggestion doesn't provide any value to us at the moment:

  1. If this is ignored in the build as was previously suggested it means it can silently fail.
  2. This puts on our team an additional overhead of constantly fixing KNN plugin AND making sure our own change is backwards compatible and passing build with all new changes while maintainers and committers are free to break it whenever.
  3. The KNN plugin is already a lot more complex than it should be for a pure Java based codec, this makes point number 2 even worse for us.

Which is why I am strongly convinced moving KNN facade/query/mappers into core is the best way forward.
The existing dependency on the KNN plugin as gate keeper for codecs is quite problematic from both extensibility standpoints and from project philosophy standpoint for the reasons I already enumerated above.
The case can be made further stronger with this existing issue of this contribution being at a dead end which reinforces this point.

I created this issue with core maintainers: opensearch-project/OpenSearch#17338 (EDIT: Fixed link here)

@navneet1v
Copy link
Collaborator

@sam-herman Moving KNN facade/query/mappers etc to core is something which we all agree should do. But there is also an alternative path for this migration which I think we should explore, which goes like this:

  1. Refactor the KNN facade/query/mappers etc in k-NN plugin so that any module/plugin can just extend those interfaces, rather than directly moving to core. This will allow us to regressively test the BWC and speed up the interfaces creation progress. This will also unblock the JVector development.
  2. Once we have achieved 1, moving to core will be easy as we just need to port the code and it will pretty clean and can be done in 1 shot.

This will allow the current functionality of Vector Engine in OpenSearch is not broken, feature development can still happen in k-NN plugin and all consumers of OpenSearch are also able to easily consume these changes.

@sam-herman
Copy link
Contributor Author

sam-herman commented Feb 14, 2025

@sam-herman Moving KNN facade/query/mappers etc to core is something which we all agree should do. But there is also an alternative path for this migration which I think we should explore, which goes like this:

  1. Refactor the KNN facade/query/mappers etc in k-NN plugin so that any module/plugin can just extend those interfaces, rather than directly moving to core. This will allow us to regressively test the BWC and speed up the interfaces creation progress. This will also unblock the JVector development.
  2. Once we have achieved 1, moving to core will be easy as we just need to port the code and it will pretty clean and can be done in 1 shot.

This will allow the current functionality of Vector Engine in OpenSearch is not broken, feature development can still happen in k-NN plugin and all consumers of OpenSearch are also able to easily consume these changes.

@navneet1v correct me if there is something I misunderstood by your suggestion however making KNN plugin extendable by other plugins seem to me like it may have the following issues:

  1. Duplication of effort with core responsibilities especially since KnnVectorField is a first class citizen in Lucene. KNN plugin is the only instance in this project where a plugin is needed to access a first class Lucene data structure which doesn't make sense.
  2. The KNN plugin is quite bloated and I don't know if it would make sense for a super light weight JVM library to depend on something with so many dependencies.
  3. The KNN plugin is quite unstable due to issue 2, which makes development while extending it difficult. I would rather invest time stabilizing core than stabilizing native dependencies which might never be used.

I am still convinced core extension would be the right path forward, every other path I'm afraid would be a long detour which will make us work twice as much and twice as hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants