Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2386/integrate jvector knn engine #2505

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sam-herman
Copy link
Contributor

Description

This change integrates jVector as a JVM based vector format into KNN plugin.
jVector introduces DiskANN search that is purely implemented in Java and doesn't require native dependencies.

Related Issues

Resolves #2386

Check List

  • [x ] New functionality includes testing.
  • [ x] New functionality has been documented.
  • [x ] API changes companion pull request created.
  • [x ] Commits are signed per the DCO using --signoff.
  • [x ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@sam-herman sam-herman force-pushed the 2386/integrate-jvector-knn-engine branch from 125e473 to f9c1120 Compare February 7, 2025 02:04
@navneet1v
Copy link
Collaborator

navneet1v commented Feb 7, 2025

@sam-herman I think based on our discussion we were thinking to move the JVector not part of the original build and code but to be vended out as a separate engine whose code will be residing in a separate module and also will be a custom build process.

The main reason for choosing this approach was:

  1. Adding engine directly in k-NN plugin is not scaleable.

What is the reason of choosing this approach? Are putting this as a draft PR to provide a context on interfaces we need to change?

@sam-herman
Copy link
Contributor Author

@sam-herman I think based on our discussion we were thinking to move the JVector not part of the original build and code but to be vended out as a separate engine whose code will be residing in a separate module and also will be a custom build process.

The main reason for choosing this approach was:

  1. Adding engine directly in k-NN plugin is not scaleable.

What is the reason of choosing this approach? Are putting this as a draft PR to provide a context on interfaces we need to change?

Hi @navneet1v totally agree with the scalability issues mentioned, if we keep adding random libraries it will get bloated. I think we also mentioned that we have two options:

  1. Provide benchmarks to show that it has some added value.
  2. With the absence of benchmark to make a point of added value, we'll make it part of a "extras" modules folder.

I am working on getting those benchmarks out next week, so how about we treat it as a draft for now to review the correctness of the integration?
Once the benchmarks are out we can decide to refactor to different module or keep as default?
Moreover, I made some sample files that I wanted the reviewer to consider as suggestions for abstraction that would make the project more concise. So I think it's fair to say we can treat it as a draft at the moment to review the approach.

@navneet1v
Copy link
Collaborator

So I think it's fair to say we can treat it as a draft at the moment to review the approach.

Thanks for the confirmation.

@@ -331,7 +335,7 @@ task windowsPatches(type:Exec) {
task cmakeJniLib(type:Exec) {
workingDir 'jni'
def args = []
args.add("cmake")
args.add("/opt/homebrew/bin/cmake")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move this to resources or as part of a README.md for jVector demonstration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a not used right now, but it illustrates a suggestion for a more concise abstraction that will also allow to select compound format based on engine.
It also more consistent with how suppliers and constructors are passed in OpenSearch core and avoids back and forth referenced between class to enum and back by being self contained.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need updates to all of the gradle wrapper?

@@ -316,6 +316,10 @@ dependencies {
}
testFixturesImplementation "org.opensearch:common-utils:${version}"
implementation 'com.github.oshi:oshi-core:6.4.13'

implementation 'io.github.jbellis:jvector:4.0.0-beta.2-SNAPSHOT'
implementation 'org.agrona:agrona:1.20.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used? I didnt see in the codec

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be runtimeOnly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a transient dependency of jvector, I had to include it because it doesn't resolve automatically. Can make it a runtimeOnly because it's not explicitly used.

// TODO: This needs to be moved under the same package name as the Lucene internal package name for {@link Lucene90CompoundReader}
// this way the internal package constants can be accessed directly and we can avoid duplicating them.
@Log4j2
public class JVectorCompoundFormat extends CompoundFormat {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me if this is still needed after our discuss? I remember that @navneet1v mentioned on slack that we dont need this. It seems like this is the only reason we'd need to create a custom codec for jvector as opposed to just the knnvectorsformat. So, Id like to get rid of it if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can potentially be removed after RandomAccessReader for jVector is delegated to IndexInput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Integrate Jvector engine as another vector engine of choice
4 participants