[Enhancement] Make Merge in nativeEngine can Abort #2529

luyuncheng · 2025-02-14T14:26:55Z

Description

When there is a scenarios:

There is A Merge Task On Node1#Index1#Shard1(long time running)
After merge task started, begin relocating from Node1#Index1#Shard1 TO Node2#Index1#Shard1
At the finalize step, source need do closeShard, but the merge task would take a long time, stack as following shows.
The clusterApplierService would wait for about N minutes(long time running), and mark the node stale, and master let node1 left because node1 long time no response.

opensearch[datanode1][clusterApplierService#updateTask][T#1]" #41 daemon prio=5 os_prio=0 cpu=5183.70ms elapsed=93132.85s tid=0x00007f3f392509d0 nid=0x101 in Object.wait()  [0x00007f3f6ddfb000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <no object reference available>
	at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:5410)
	- locked <0x0000001022b0abe8> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.abortMerges(IndexWriter.java:2721)
	- locked <0x0000001022b0abe8> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2469)
	- locked <0x0000001022b0abe8> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2449)
	- locked <0x0000001022bae6d0> (a java.lang.Object)
	at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2441)
	at org.opensearch.index.engine.InternalEngine.closeNoLock(InternalEngine.java:2370)
	at org.opensearch.index.engine.Engine.close(Engine.java:2000)
	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:1987)
	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1907)
	- locked <0x0000001022b07ea0> (a java.lang.Object)
	at org.opensearch.index.IndexService.closeShard(IndexService.java:623)
	at org.opensearch.index.IndexService.removeShard(IndexService.java:599)
	- locked <0x0000001022a976a8> (a org.opensearch.index.IndexService)
	at org.opensearch.index.IndexService.close(IndexService.java:374)
	- locked <0x0000001022a976a8> (a org.opensearch.index.IndexService)
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:993)
	at org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:446)
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:287)
	- locked <0x000000100b7da520> (a org.opensearch.indices.cluster.IndicesClusterStateService)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)

Proposal

i think we can introduce abort mechanism for long time merge task meanwhile close shard called.

i think we can introduce KNNMergeHelper class to check if merge aborted. and when build the graph, we can reuse faiss::InterruptCallback which is interrupt callback mechanism to check whether aborted or not

BUT ConcurrentMergeScheduler#MergeThread is a internal class, we can not call this directly. it throws org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread is in unnamed module of loader 'app'

we can added this static method into OpenSearch Core like OneMergeHelper

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#2530

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: luyuncheng <[email protected]>

jmazanec15 · 2025-02-14T18:01:00Z

@luyuncheng This is interesting - but can the interrupt callback with faiss be per graph or would it be for all graphs? In other words, would it cancel all graph builds happening on the node instead of just the one for the closed shard.

navneet1v · 2025-02-14T18:22:41Z

@luyuncheng thanks for creating the GH issue. and I think we ourselves have seen this problem in couple of places(ref: opensearch-project/OpenSearch#14828, @kotwanikunal created this), and seeing a solution around this problem is really great. I looked through the code and I want to know how this code is even working? Because as per my understanding of the code you are checking if merge is aborted if somehow write fails and then eating up that exception.

Is this PR only to see if the merge is aborted and write to directory fails how we can handle the errors in case shard is not present on the node because it moved? Because if that is the case then in the 2.19 version of k-NN plugin we added the support writing index using IndexInput/Output. So if a rather than checking if merge is aborted or not, can we not see if IndexInput/Output is closed or not?

Make Merge Abortable

32e151a

Signed-off-by: luyuncheng <[email protected]>

luyuncheng requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, shatejas, 0ctopus13prime and Vikasht34 as code owners February 14, 2025 14:26

luyuncheng mentioned this pull request Feb 14, 2025

[BUG]Merge in nativeEngine can not Abort #2530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Make Merge in nativeEngine can Abort #2529

[Enhancement] Make Merge in nativeEngine can Abort #2529

luyuncheng commented Feb 14, 2025 •

edited

Loading

jmazanec15 commented Feb 14, 2025

navneet1v commented Feb 14, 2025 •

edited

Loading

[Enhancement] Make Merge in nativeEngine can Abort #2529

Are you sure you want to change the base?

[Enhancement] Make Merge in nativeEngine can Abort #2529

Conversation

luyuncheng commented Feb 14, 2025 • edited Loading

Description

Proposal

Related Issues

Check List

jmazanec15 commented Feb 14, 2025

navneet1v commented Feb 14, 2025 • edited Loading

luyuncheng commented Feb 14, 2025 •

edited

Loading

navneet1v commented Feb 14, 2025 •

edited

Loading