[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

akolarkunnu · 2025-02-05T07:51:40Z

Is your feature request related to a problem?

Problem Statement:
The cross-cluster replication feature for OpenSearch allows you to replicate data from one OpenSearch cluster (the leader) to another (the follower). This is useful for a range of use cases such as multi-region replication for disaster recovery purposes.
However, the cross-cluster replication feature has some functionality gaps that don’t make it ideal for multi-region disaster recovery.

The scenario:
Imagine we have an OpenSearch cluster running in region 1. We want to be able to ensure availability in the event of an outage. So using cross-cluster replication, we replicate all of our data to a cluster in region 2. The “leader” cluster is in region 1 and the “follower cluster is in region 2. In the event of an outage in region 1, we would like to be able to switch our application over to use the follower cluster in region 2 until the cluster in region 1 is available again in which case we would reverse the process. We would also like to do this with minimal (if any) downtime and no data loss.
However this is currently very hard to achieve for the following reasons:

Problem 1: When the leader goes down, the follower cannot be easily promoted to the leader.
Cross cluster replication is limited in that a “follower” cluster is not able to be quickly or easily promoted to be a “leader” cluster in the event of an outage in region 1. A number of manual steps are required to be conducted, leading to increased workload in what will already be a very busy and stressful time for devops engineers.
If there was an easy way to quickly promote the follower cluster to have the role of leader in the event of an outage, downtime would be minimized.

Problem 2: The previous leader cannot easily become a follower
When a follower is promoted to a leader, the previous leader's indices will no longer be in sync with the new leader. Since cross cluster replication does not allow replicating to pre-existing indices, the previous leader's indices would need to be deleted and replication from the new leader would need to start from scratch, which can be a lengthy process for indices with large amounts of data and therefore increase the risk from loss of availability to customers.

Problem 3: The leader-follower relationship cannot easily be restored to the original state
Same as above, because of limitations with replicating to pre-existing indices, the new follower would need to discard their indices and replicate from the new leader from scratch.

What solution would you like?
Provide an easy failover and fail back mechanisms for replication.

What alternatives have you considered?
Nothing

Do you have any additional context?
No

ankitkala · 2025-02-12T15:47:35Z

Thanks for the request @akolarkunnu.

Looking at the problems that you called out, it doesn't really need an active active replication. Active active setup can take writes on both cluster simultaneously and should be able to do conflict resolution. A much harder problem to solve.

The current issues sounds more like "how to provide easy failover and failback mechanisms for replication".

Let me know if you have a proposal which you'd like contribute with. I can help with guidance and reviews.

akolarkunnu · 2025-02-13T15:28:07Z

Ok, Thanks. I updated the title accordingly. I am trying to understand the current design and code. After that, I will come up with a proposal.

akolarkunnu added enhancement New feature or request untriaged labels Feb 5, 2025

akolarkunnu changed the title ~~[FEATURE] Active-Active OpenSearch Cross-Cluster Replication~~ Provide an easy failover and fail back mechanisms for replication Feb 13, 2025

akolarkunnu changed the title ~~Provide an easy failover and fail back mechanisms for replication~~ [Enhancement] Provide an easy failover and fail back mechanisms for replication Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

akolarkunnu commented Feb 5, 2025 •

edited

Loading

ankitkala commented Feb 12, 2025 •

edited

Loading

akolarkunnu commented Feb 13, 2025

[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

Comments

akolarkunnu commented Feb 5, 2025 • edited Loading

ankitkala commented Feb 12, 2025 • edited Loading

akolarkunnu commented Feb 13, 2025

akolarkunnu commented Feb 5, 2025 •

edited

Loading

ankitkala commented Feb 12, 2025 •

edited

Loading