Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Using KNN codec in conjunction with custom codec #2414

Open
bugmakerrrrrr opened this issue Jan 22, 2025 · 10 comments
Open

[FEATURE] Using KNN codec in conjunction with custom codec #2414

bugmakerrrrrr opened this issue Jan 22, 2025 · 10 comments

Comments

@bugmakerrrrrr
Copy link
Contributor

Is your feature request related to a problem?
One of our customers uses the following index settings, which combines the KNN codec with the zstd codec.

"settings": {
    "index.knn": true,
    "index.codec": "zstd"
 }

Everything works fine when the shard is running all the time. However, when the index is recovered after a server reboot, the following exception is thrown.

java.lang.IllegalStateException: missing value for Lucene90StoredFieldsFormat.mode for segment: _3
        at org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsReader(Lucene90StoredFieldsFormat.java:133) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:138) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:92) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:180) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:222) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:536) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
        at org.opensearch.index.engine.InternalEngine.createReaderManager(InternalEngine.java:656) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:322) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:223) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:43) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2362) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2324) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2294) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:630) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:115) ~[opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.9.0.jar:2.9.0]
        ... 8 more

This is because when indices are created, we use the parametric constructor to initialize the KNN codec, so we can get the correct delegate code.

public Codec codec(String name) {
return KNNCodecVersion.current().getKnnCodecSupplier().apply(super.codec(name), mapperService);
}

And when indices are recovering, we initialize the KNN codec using a parameterless constructor, so the delegate codec we get will always be the default one, which doesn't match the codec used for writing, and thus throws an exception.

public KNN990Codec() {
this(VERSION.getDefaultCodecDelegate(), VERSION.getPerFieldKnnVectorsFormat());
}

What solution would you like?
We can put the delegate codec name to the segment info attribute during writing SegmentInfo data. And when getting stored fields reader, we first read the delegate codec name from the SegmentInfo and then use the correct delegate codec to generate the stored fields reader.

private final StoredFieldsFormat storedFieldsFormat = new StoredFieldsFormat() {
        @Override
        public StoredFieldsReader fieldsReader(Directory dir, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
            String name = si.getAttribute(DELEGATE_KEY);
            if (name == null) {
                return delegate.storedFieldsFormat().fieldsReader(dir, si, fn, context);
            }
            return Codec.forName(name).storedFieldsFormat().fieldsReader(dir, si, fn, context);
        }
}

Note: since we can't get a delegate for SegmentInfoFormat, we must use the default delegate codec's SegmentInfoFormat.

What alternatives have you considered?
If we don't want the user to use a custom codec, then we can check the index.codec configuration when fetching the codec service factory and throw an exception if it's not default or best_compression. Instead of having unintended behavior during runtime.

public Optional<CodecServiceFactory> getCustomCodecServiceFactory(IndexSettings indexSettings) {
if (indexSettings.getValue(KNNSettings.IS_KNN_INDEX_SETTING)) {
return Optional.of(KNNCodecService::new);
}
return Optional.empty();
}

Do you have any additional context?
Add any other context or screenshots about the feature request here.

@navneet1v
Copy link
Collaborator

@bugmakerrrrrr thanks for raising the issue. I think me and @jmazanec15 have discussed this sometime back when he was proposing the the derived source change. and this is exactly we found if shard moves from 1 node to other node or you close or open the index there will be problems.

Let me read your solution in more details. To comment more or if we are missing anything. Because the solution works then I will be pretty excited.

@navneet1v
Copy link
Collaborator

@bugmakerrrrrr I read your whole solution and I kind of understood what you are trying to do. The approach of writing the default delegate codec in segment info will diverge the KNNCodec and other Opensearch default codec. We already have a lot of differences there in terms of new codecs(created for star tree index etc.) not being compatible with k-NN codec. I think we should think of more elegant solution here.

Having said that I am aligned on having a consistent behavior by ensuring that we throw exceptions if custom codec is not provided.

@jmazanec15 WDYT?

@bugmakerrrrrr
Copy link
Contributor Author

@navneet1v Yes, I'm also concerned about compatibility issues if the restrictions on custom codecs are completely liberalized. But in most cases where KNN is used, we only care about the stored field format (for cost reasons). I'm wondering if we can determine if a custom codec can be used with KNN codec by checking if the other formats except stored field format it provides is the same as the default delegate.

@navneet1v
Copy link
Collaborator

@navneet1v Yes, I'm also concerned about compatibility issues if the restrictions on custom codecs are completely liberalized. But in most cases where KNN is used, we only care about the stored field format (for cost reasons). I'm wondering if we can determine if a custom codec can be used with KNN codec by checking if the other formats except stored field format it provides is the same as the default delegate.

can you please elaborate on this. I didn't get your point completely here.

@jmazanec15
Copy link
Member

@navneet1v @bugmakerrrrrr Sorry, missed this one. I think we do need to start writing this information to the segment info (both actual and delegate) - this approach makes sense. This will allow us to properly load via SPI. Otherwise, Im not sure there is anything else we can do with a no-arg constructor.

@jmazanec15
Copy link
Member

@bugmakerrrrrr Do you have a PoC? If so, could you share via draft PR?

@jmazanec15
Copy link
Member

Note: since we can't get a delegate for SegmentInfoFormat, we must use the default delegate codec's SegmentInfoFormat.

Was thinking more about this - how would we write custom attributes, yet use the delegate format?

This approach does seem to similar to how per field attributes are used for per field formats. I think for the long term, we should just implement a per field knn vector format and then, instead of writing delegates name to the segment info, create a codec that uses the delegate's name as an identifier. Our custom codec will ensure that the per knn vector format is serialized as a field attribute during write, so any codec that implements per field correctly will be able to read with this format. For derived source feature, I think we would need to migrate functionality to use your sole StoredFieldsVisitor approach, given there is no per field StoredFieldsFormat.

@bugmakerrrrrr
Copy link
Contributor Author

Note: since we can't get a delegate for SegmentInfoFormat, we must use the default delegate codec's SegmentInfoFormat.

Was thinking more about this - how would we write custom attributes, yet use the delegate format?

We should make sure the SegmentInfoFormat of the delegate is same as the SegmentInfoFormat of the default delegate and use the following custom SegmentInfoFormat to bake delegate codec name into segment attributes.

private final SegmentInfoFormat segmentInfoFormat = new SegmentInfoFormat() {
        @Override
        public SegmentInfo read(Directory dir, String segmentName, byte[] segmentID, IOContext context) throws IOException {
            return defaultSegmentInfoFormat.read(dir, segmentName, segmentID, context);
        }

        @Override
        public void write(Directory dir, SegmentInfo info, IOContext ioContext) throws IOException {
            assert delegate != VERSION.getDefaultCodecDelegate();
            info.putAttribute(DELEGATE_KEY, delegate.getName());
            defaultSegmentInfoFormat.write(dir, info, ioContext);
        }
    };

This approach does seem to similar to how per field attributes are used for per field formats. I think for the long term, we should just implement a per field knn vector format and then, instead of writing delegates name to the segment info, create a codec that uses the delegate's name as an identifier. Our custom codec will ensure that the per knn vector format is serialized as a field attribute during write, so any codec that implements per field correctly will be able to read with this format. For derived source feature, I think we would need to migrate functionality to use your sole StoredFieldsVisitor approach, given there is no per field StoredFieldsFormat.

Indeed, I think your approach is neater and compatible with the derived source feature. I am willing to give it a try.

@bugmakerrrrrr
Copy link
Contributor Author

@jmazanec15 I just realized that the KNN codec relies on the custom DocValuesFormat and CompoundFormat, and there is also no per field CompoundFormat.

@jmazanec15
Copy link
Member

Hmmm I think we should see if we can get rid of the custom compound format. @navneet1v I think had an idea for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog (Hot)
Development

No branches or pull requests

3 participants