Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Vector Index Build] Introduce RemoteIndexBuilder skeleton #2525

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Feb 13, 2025

Description

First PR for #2465

In order to review changes incrementally, this PR is scoped down to only the following:

  • Feature Flag
  • Index Setting
  • Repository name cluster setting
  • RemoteIndexBuilder skeleton with repository service and index settings wired through the Codec

I am keeping the vector upload changes in a separate follow-up PR as that will deserve it's own in-depth discussion.

Related Issues

Relates: #2465

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from 3f39eba to 739d2ef Compare February 13, 2025 04:14
@jed326
Copy link
Contributor Author

jed326 commented Feb 13, 2025

Thanks @navneet1v , I also agree that a new writer isn't completely necessary as the underlying formats are not changing. Moreover, the remote build should be format agnostic anyways, so I've refactored NativeIndexWriter into an interface to be used either locally or remotely when performing flush or merge.

Copy link
Member

@kotwanikunal kotwanikunal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall @jed326 .

@Log4j2
public class NativeIndexWriter {
private static final Long CRC32_CHECKSUM_SANITY = 0xFFFFFFFF00000000L;
public interface NativeIndexWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be worried about consumers here? abstract class would be not as clean but a good mitigation strategy.
Given that this is not marked as internal and we are changing the signature - code would be breaking.

Copy link
Contributor Author

@jed326 jed326 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kotwanikunal , regarding class marking I think that's a concept specific to core as [1] it's only mentioned in the core repo and [2] there are no other tags in the k-nn plugin: https://github.com/opensearch-project/OpenSearch/blob/fc81a906da88d99c7ea09ffd9fec92ff81925cf6/CONTRIBUTING.md?plain=1#L174-L177. FWIW this class is/was not marked with @opensearch.api either.

Regarding consumers:

  1. I don't see any entry point for a custom NativeIndexWriter to be provided by a consumer so I'm not really sure how a consumer could instantiate it, and we've extracted all the public methods into the interface.
  2. We're targeting 3.0 for this change, so I'm fine with taking the potentially breaking change especially as it's a much cleaner implementation

Comment on lines +58 to +63
/**
* @return true if remote vector index build feature flag is enabled
*/
public static boolean isKNNRemoteVectorBuildEnabled() {
return Booleans.parseBooleanStrict(KNNSettings.state().getSettingValue(KNN_REMOTE_VECTOR_BUILD).toString(), false);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we will have this setting at both index and cluster level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the LLD #2465 we landed on only using a cluster setting as we see this only as an opt-in feature so we want users to explicitly determine for each index if they want to incur the additional costs with building indices remotely.

@Override
public Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
if (indexSettings.getValue(KNNSettings.IS_KNN_INDEX_SETTING)) {
return Optional.of(KNNCodecService::new);
return Optional.of((config) -> new KNNCodecService(config, new RemoteIndexBuilder(repositoriesServiceSupplier, indexSettings)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea of passing index settings during codec creation to RemoteIndex builder rather than just using index name and getting index settings from cluster state from any part of the code.

This is good thinking.

nativeIndexWriterMockedStatic.when(() -> NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState))
.thenReturn(nativeIndexWriter);
nativeIndexWriterMockedStatic.when(
() -> NativeIndexWriter.getWriter(eq(fieldInfo), eq(segmentWriteState), eq(quantizationState), any(), any())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid doing any() in the mocking, have proper mock objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will make this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not totally sure how to match the Supplier<KNNVectorValues<?>> argument to getWriter() now without an any(). From what I can tell I don't think you can, so I would need to pull the supplier creation into it's own method in NativeEngines990KnnVectorsWriter to mock properly. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stick with the any() for now, I will figure out how to properly mock this in order to write the tests for the following PR as we will need to test the RemoteIndexBuilder functionalities there too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell I don't think you can

If you can't make its best to have a capture and then verify from strong unit tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there's no functionality to be tested yet in this PR though so it would be hard to verify if any captor/matcher semantic is even working so I will follow-up with that on the next PR.


public KNNCodecService(CodecServiceConfig codecServiceConfig) {
public KNNCodecService(CodecServiceConfig codecServiceConfig, RemoteIndexBuilder remoteIndexBuilder) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hesitant for the entire codec should have access to remote index builder, Ideally the access be limited to knn writer implementation. Any restrictions on KNNCodecService only hold the respository service supplier instance and then pass it to the writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any restrictions on KNNCodecService only hold the respository service supplier instance and then pass it to the writer?

This is actually how I started in my first POC, but I had 2 main reasons for this. First, I also want to pass in the indexSettings so that I can check within the RemoteIndexBuilder per-index if the feature should actually be used. Second, in the future we may want to support GPU acceleration for search case in a similar fashion, so at that time we could re-use the same RemoteIndexBuilder (probably renamed) to perform the similar orchestration.

Ideally the access be limited to knn writer implementation.

I don't think I fully understand this -- if I were to pass just the repository service supplier down and only use it in the knn writer implementation isn't that the same concern?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I also want to pass in the indexSettings so that I can check within the RemoteIndexBuilder per-index if the feature should actually be used

Why not pass in the IndexSettings separately rather than wrapping up in remoteIndexBuilder?

Second, in the future we may want to support GPU acceleration for search case in a similar fashion, so at that time we could re-use the same RemoteIndexBuilder (probably renamed) to perform the similar orchestration

IndexBuilding shouldn't have search logic, we can always have a separate class for search to have single responsibility

f I were to pass just the repository service supplier down and only use it in the knn writer implementation isn't that the same concern?

Not really, because no other part of code has access to buildIndex and cannot misuse it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass in the IndexSettings separately rather than wrapping up in remoteIndexBuilder?

First, the two are meant to be used together so it makes sense to logically bundle them together. Second, and more pragmatically, the knnCodecSupplier is a TriFunction currently and if we need to pass the repository supplier + settings then we need 4 arguments and there's no "QuadFunction" available. We could implement this functional interface of course but ideally that should go in opensearch core.

IndexBuilding shouldn't have search logic, we can always have a separate class for search to have single responsibility

Agreed, that's why I said we would probably rename it if we do see the need for the functionality of using repository service + settings in the future.

Not really, because no other part of code has access to buildIndex and cannot misuse it

But you could make the same argument that if repositoryservice is only meant for use during index build then it could be misused still?

}

@Override
public void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the responsibilities of submitting the vector builds and polling should be this class's responsibility, it will help keep the flexibility of the entire logic of trainsitioning into different states as the logic evolves. I think RemoteIndexBuilder should be renamed and should have an isolated responsibility of managing uploads and downloads.

thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was that the RemoteIndexBuilder would be responsible for all parts of "building" the index, so that includes uploading, polling, downloading, and then the RemoteNativeIndexWriter would be responsible for actually writing the index to the IndexOutput.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2525 (comment) I think the implementation of build strategy should be responsible

remoteIndexBuilder.buildIndexRemotely(fieldInfo, knnVectorValuesSupplier, totalLiveDocs, segmentWriteState);
} catch (Exception e) {
log.warn("Failed to flush index remotely", e);
fallbackWriter.flushIndex(knnVectorValues, totalLiveDocs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel API call before this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll finalize on all functionality related aspects in a future PR, but yes this most likely would be the place.

* @param knnVectorValues
* @throws IOException
*/
void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the abstraction!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the code and I was wondering why we need this on NativeIndexWriter level? We already have NativeIndexBuildStrategy and RemoteBuildStrategy should just be an extension of that, Any reason we didn't use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So the main problem is that I need the knnVectorValuesSupplier itself so that I can create multiple InputStreams in order to perform parallel uploads as the knnVectorValues iterator itself is sequential. However, the BuildIndexParams takes a concrete KNNVectorValues<?>, so I would need to refactor this class to instead take the supplier and that is a huge refactor.

Beyond that, I would still need to pass the RemoteIndexBuilder (or the repository service supplier + index settings into the NativeIndexWriter anyways to determine if/when I should use the RemoteBuildStrategy.

I actually do agree that long-term, refactoring this into a RemoteBuildStrategy may be a good idea, but I really want to avoid taking on large scale refactors in the middle of feature development.

nativeIndexWriterMockedStatic.when(() -> NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState))
.thenReturn(nativeIndexWriter);
nativeIndexWriterMockedStatic.when(
() -> NativeIndexWriter.getWriter(eq(fieldInfo), eq(segmentWriteState), eq(quantizationState), any(), any())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell I don't think you can

If you can't make its best to have a capture and then verify from strong unit tests

*/
@AllArgsConstructor
@Log4j2
public class LocalNativeIndexWriter implements NativeIndexWriter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping nothing changed in this, do highlight if there are changes and I can take a deeper look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep no changes, this was done by IntelliJ's "extract as interface" functionality

@Override
public void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException {
try {
remoteIndexBuilder.buildIndexRemotely(fieldInfo, knnVectorValuesSupplier, totalLiveDocs, segmentWriteState);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its missing flush and merge stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, metrics/stats will be taken in it's own PR, see the issue tracker in #2391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants