-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Vector Index Build] Introduce RemoteIndexBuilder skeleton #2525
base: main
Are you sure you want to change the base?
Conversation
25981af
to
8040bae
Compare
...main/java/org/opensearch/knn/index/codec/KNN10010Codec/KNN10010PerFieldKnnVectorsFormat.java
Outdated
Show resolved
Hide resolved
...in/java/org/opensearch/knn/index/codec/KNN10010Codec/NativeEngines10010KnnVectorsWriter.java
Outdated
Show resolved
Hide resolved
3f39eba
to
739d2ef
Compare
Thanks @navneet1v , I also agree that a new writer isn't completely necessary as the underlying formats are not changing. Moreover, the remote build should be format agnostic anyways, so I've refactored |
739d2ef
to
38cb684
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall @jed326 .
@Log4j2 | ||
public class NativeIndexWriter { | ||
private static final Long CRC32_CHECKSUM_SANITY = 0xFFFFFFFF00000000L; | ||
public interface NativeIndexWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be worried about consumers here? abstract class would be not as clean but a good mitigation strategy.
Given that this is not marked as internal and we are changing the signature - code would be breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kotwanikunal , regarding class marking I think that's a concept specific to core as [1] it's only mentioned in the core repo and [2] there are no other tags in the k-nn plugin: https://github.com/opensearch-project/OpenSearch/blob/fc81a906da88d99c7ea09ffd9fec92ff81925cf6/CONTRIBUTING.md?plain=1#L174-L177. FWIW this class is/was not marked with @opensearch.api
either.
Regarding consumers:
- I don't see any entry point for a custom
NativeIndexWriter
to be provided by a consumer so I'm not really sure how a consumer could instantiate it, and we've extracted all the public methods into the interface. - We're targeting 3.0 for this change, so I'm fine with taking the potentially breaking change especially as it's a much cleaner implementation
38cb684
to
105ae13
Compare
/** | ||
* @return true if remote vector index build feature flag is enabled | ||
*/ | ||
public static boolean isKNNRemoteVectorBuildEnabled() { | ||
return Booleans.parseBooleanStrict(KNNSettings.state().getSettingValue(KNN_REMOTE_VECTOR_BUILD).toString(), false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we will have this setting at both index and cluster level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the LLD #2465 we landed on only using a cluster setting as we see this only as an opt-in feature so we want users to explicitly determine for each index if they want to incur the additional costs with building indices remotely.
src/main/java/org/opensearch/knn/index/codec/nativeindex/RemoteNativeIndexWriter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexWriter.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexWriter.java
Outdated
Show resolved
Hide resolved
@Override | ||
public Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) { | ||
if (indexSettings.getValue(KNNSettings.IS_KNN_INDEX_SETTING)) { | ||
return Optional.of(KNNCodecService::new); | ||
return Optional.of((config) -> new KNNCodecService(config, new RemoteIndexBuilder(repositoriesServiceSupplier, indexSettings))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea of passing index settings during codec creation to RemoteIndex builder rather than just using index name and getting index settings from cluster state from any part of the code.
This is good thinking.
nativeIndexWriterMockedStatic.when(() -> NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState)) | ||
.thenReturn(nativeIndexWriter); | ||
nativeIndexWriterMockedStatic.when( | ||
() -> NativeIndexWriter.getWriter(eq(fieldInfo), eq(segmentWriteState), eq(quantizationState), any(), any()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid doing any()
in the mocking, have proper mock objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not totally sure how to match the Supplier<KNNVectorValues<?>>
argument to getWriter()
now without an any()
. From what I can tell I don't think you can, so I would need to pull the supplier creation into it's own method in NativeEngines990KnnVectorsWriter
to mock properly. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to stick with the any()
for now, I will figure out how to properly mock this in order to write the tests for the following PR as we will need to test the RemoteIndexBuilder
functionalities there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell I don't think you can
If you can't make its best to have a capture and then verify from strong unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, there's no functionality to be tested yet in this PR though so it would be hard to verify if any captor/matcher semantic is even working so I will follow-up with that on the next PR.
src/main/java/org/opensearch/knn/common/featureflags/KNNFeatureFlags.java
Show resolved
Hide resolved
105ae13
to
e7f50da
Compare
…ate inteface Signed-off-by: Jay Deng <[email protected]>
e7f50da
to
4f8648c
Compare
|
||
public KNNCodecService(CodecServiceConfig codecServiceConfig) { | ||
public KNNCodecService(CodecServiceConfig codecServiceConfig, RemoteIndexBuilder remoteIndexBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit hesitant for the entire codec should have access to remote index builder, Ideally the access be limited to knn writer implementation. Any restrictions on KNNCodecService
only hold the respository service supplier instance and then pass it to the writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any restrictions on KNNCodecService only hold the respository service supplier instance and then pass it to the writer?
This is actually how I started in my first POC, but I had 2 main reasons for this. First, I also want to pass in the indexSettings so that I can check within the RemoteIndexBuilder
per-index if the feature should actually be used. Second, in the future we may want to support GPU acceleration for search case in a similar fashion, so at that time we could re-use the same RemoteIndexBuilder
(probably renamed) to perform the similar orchestration.
Ideally the access be limited to knn writer implementation.
I don't think I fully understand this -- if I were to pass just the repository service supplier down and only use it in the knn writer implementation isn't that the same concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I also want to pass in the indexSettings so that I can check within the RemoteIndexBuilder per-index if the feature should actually be used
Why not pass in the IndexSettings separately rather than wrapping up in remoteIndexBuilder?
Second, in the future we may want to support GPU acceleration for search case in a similar fashion, so at that time we could re-use the same RemoteIndexBuilder (probably renamed) to perform the similar orchestration
IndexBuilding shouldn't have search logic, we can always have a separate class for search to have single responsibility
f I were to pass just the repository service supplier down and only use it in the knn writer implementation isn't that the same concern?
Not really, because no other part of code has access to buildIndex and cannot misuse it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass in the IndexSettings separately rather than wrapping up in remoteIndexBuilder?
First, the two are meant to be used together so it makes sense to logically bundle them together. Second, and more pragmatically, the knnCodecSupplier
is a TriFunction
currently and if we need to pass the repository supplier + settings then we need 4 arguments and there's no "QuadFunction" available. We could implement this functional interface of course but ideally that should go in opensearch core.
IndexBuilding shouldn't have search logic, we can always have a separate class for search to have single responsibility
Agreed, that's why I said we would probably rename it if we do see the need for the functionality of using repository service + settings in the future.
Not really, because no other part of code has access to buildIndex and cannot misuse it
But you could make the same argument that if repositoryservice is only meant for use during index build then it could be misused still?
} | ||
|
||
@Override | ||
public void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think the responsibilities of submitting the vector builds and polling should be this class's responsibility, it will help keep the flexibility of the entire logic of trainsitioning into different states as the logic evolves. I think RemoteIndexBuilder
should be renamed and should have an isolated responsibility of managing uploads and downloads.
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here was that the RemoteIndexBuilder
would be responsible for all parts of "building" the index, so that includes uploading, polling, downloading, and then the RemoteNativeIndexWriter
would be responsible for actually writing the index to the IndexOutput
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #2525 (comment) I think the implementation of build strategy should be responsible
remoteIndexBuilder.buildIndexRemotely(fieldInfo, knnVectorValuesSupplier, totalLiveDocs, segmentWriteState); | ||
} catch (Exception e) { | ||
log.warn("Failed to flush index remotely", e); | ||
fallbackWriter.flushIndex(knnVectorValues, totalLiveDocs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancel API call before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll finalize on all functionality related aspects in a future PR, but yes this most likely would be the place.
* @param knnVectorValues | ||
* @throws IOException | ||
*/ | ||
void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the abstraction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the code and I was wondering why we need this on NativeIndexWriter level? We already have NativeIndexBuildStrategy and RemoteBuildStrategy should just be an extension of that, Any reason we didn't use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So the main problem is that I need the knnVectorValuesSupplier
itself so that I can create multiple InputStreams
in order to perform parallel uploads as the knnVectorValues iterator itself is sequential. However, the BuildIndexParams
takes a concrete KNNVectorValues<?>
, so I would need to refactor this class to instead take the supplier and that is a huge refactor.
Beyond that, I would still need to pass the RemoteIndexBuilder
(or the repository service supplier + index settings into the NativeIndexWriter
anyways to determine if/when I should use the RemoteBuildStrategy
.
I actually do agree that long-term, refactoring this into a RemoteBuildStrategy
may be a good idea, but I really want to avoid taking on large scale refactors in the middle of feature development.
nativeIndexWriterMockedStatic.when(() -> NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState)) | ||
.thenReturn(nativeIndexWriter); | ||
nativeIndexWriterMockedStatic.when( | ||
() -> NativeIndexWriter.getWriter(eq(fieldInfo), eq(segmentWriteState), eq(quantizationState), any(), any()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell I don't think you can
If you can't make its best to have a capture and then verify from strong unit tests
*/ | ||
@AllArgsConstructor | ||
@Log4j2 | ||
public class LocalNativeIndexWriter implements NativeIndexWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping nothing changed in this, do highlight if there are changes and I can take a deeper look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep no changes, this was done by IntelliJ's "extract as interface" functionality
@Override | ||
public void flushIndex(KNNVectorValues<?> knnVectorValues, int totalLiveDocs) throws IOException { | ||
try { | ||
remoteIndexBuilder.buildIndexRemotely(fieldInfo, knnVectorValuesSupplier, totalLiveDocs, segmentWriteState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its missing flush and merge stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, metrics/stats will be taken in it's own PR, see the issue tracker in #2391
a154f4d
to
4f8648c
Compare
Description
First PR for #2465
In order to review changes incrementally, this PR is scoped down to only the following:
I am keeping the vector upload changes in a separate follow-up PR as that will deserve it's own in-depth discussion.
Related Issues
Relates: #2465
Check List
- [ ] New functionality has been documented.- [ ] API changes companion pull request created.--signoff
.- [ ] Public documentation issue/PR created.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.