Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Provide an easy failover and fail back mechanisms for replication #1485

Open
akolarkunnu opened this issue Feb 5, 2025 · 2 comments
Labels
enhancement New feature or request untriaged

Comments

@akolarkunnu
Copy link

akolarkunnu commented Feb 5, 2025

Is your feature request related to a problem?

Problem Statement:
The cross-cluster replication feature for OpenSearch allows you to replicate data from one OpenSearch cluster (the leader) to another (the follower). This is useful for a range of use cases such as multi-region replication for disaster recovery purposes.
However, the cross-cluster replication feature has some functionality gaps that don’t make it ideal for multi-region disaster recovery.

The scenario:
Imagine we have an OpenSearch cluster running in region 1. We want to be able to ensure availability in the event of an outage. So using cross-cluster replication, we replicate all of our data to a cluster in region 2. The “leader” cluster is in region 1 and the “follower cluster is in region 2. In the event of an outage in region 1, we would like to be able to switch our application over to use the follower cluster in region 2 until the cluster in region 1 is available again in which case we would reverse the process. We would also like to do this with minimal (if any) downtime and no data loss.
However this is currently very hard to achieve for the following reasons:

Problem 1: When the leader goes down, the follower cannot be easily promoted to the leader.
Cross cluster replication is limited in that a “follower” cluster is not able to be quickly or easily promoted to be a “leader” cluster in the event of an outage in region 1. A number of manual steps are required to be conducted, leading to increased workload in what will already be a very busy and stressful time for devops engineers.
If there was an easy way to quickly promote the follower cluster to have the role of leader in the event of an outage, downtime would be minimized.

Problem 2: The previous leader cannot easily become a follower
When a follower is promoted to a leader, the previous leader's indices will no longer be in sync with the new leader. Since cross cluster replication does not allow replicating to pre-existing indices, the previous leader's indices would need to be deleted and replication from the new leader would need to start from scratch, which can be a lengthy process for indices with large amounts of data and therefore increase the risk from loss of availability to customers.

Problem 3: The leader-follower relationship cannot easily be restored to the original state
Same as above, because of limitations with replicating to pre-existing indices, the new follower would need to discard their indices and replicate from the new leader from scratch.

What solution would you like?
Provide an easy failover and fail back mechanisms for replication.

What alternatives have you considered?
Nothing

Do you have any additional context?
No

@akolarkunnu akolarkunnu added enhancement New feature or request untriaged labels Feb 5, 2025
@ankitkala
Copy link
Member

ankitkala commented Feb 12, 2025

Thanks for the request @akolarkunnu.

Looking at the problems that you called out, it doesn't really need an active active replication. Active active setup can take writes on both cluster simultaneously and should be able to do conflict resolution. A much harder problem to solve.

The current issues sounds more like "how to provide easy failover and failback mechanisms for replication".

Let me know if you have a proposal which you'd like contribute with. I can help with guidance and reviews.

@akolarkunnu akolarkunnu changed the title [FEATURE] Active-Active OpenSearch Cross-Cluster Replication Provide an easy failover and fail back mechanisms for replication Feb 13, 2025
@akolarkunnu
Copy link
Author

Ok, Thanks. I updated the title accordingly. I am trying to understand the current design and code. After that, I will come up with a proposal.

@akolarkunnu akolarkunnu changed the title Provide an easy failover and fail back mechanisms for replication [Enhancement] Provide an easy failover and fail back mechanisms for replication Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged
Projects
None yet
Development

No branches or pull requests

2 participants