Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality Based Routing Support #1909

Open
tanujd11 opened this issue Sep 27, 2023 · 42 comments · May be fixed by #5299
Open

Locality Based Routing Support #1909

tanujd11 opened this issue Sep 27, 2023 · 42 comments · May be fixed by #5299
Assignees
Labels
area/api API-related issues kind/enhancement New feature or request
Milestone

Comments

@tanujd11
Copy link
Member

Description:
Implement locality based routing support by default in EG. Now that we we can have individual endpoints as backend to EG. Can we support region/zone/subzone based routing based on EndpointSlice information, node labels etc.?

@tanujd11 tanujd11 added the kind/enhancement New feature or request label Sep 27, 2023
@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

Hey @tanujd11 from a user perspective can you share what you like to happen on the data plane ( from gateway to multiple backend endpoints with different topology info )

@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

I understand this is very useful for optimizing East West traffic within a cluster, is that also the case for north South ?

@tanujd11
Copy link
Member Author

I think for an Envoy gateway running in us-east-1/us-east-1a should prefer the same zone backend to prevent cross zonal traffic. I think this behaviour could be made as default as cross zone communication is obviously costly. WDYT?

@arkodg
Copy link
Contributor

arkodg commented Sep 27, 2023

thanks, here's something more to think about

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 28, 2023
@tanujd11 tanujd11 removed the stale label Oct 29, 2023
@tanujd11 tanujd11 self-assigned this Nov 2, 2023
Copy link

github-actions bot commented Dec 2, 2023

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@arkodg
Copy link
Contributor

arkodg commented May 23, 2024

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

@github-actions github-actions bot removed the stale label May 23, 2024
@aoledk
Copy link
Contributor

aoledk commented May 23, 2024

there's a new field in the Service spec (trafficDistribution.preferClose) https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution that we could consider using to automate priority amongst endpoints within a Service

Could be an option when this new field is stable and corresponding K8s version is adopted by massive companies.

Before that, IMO it's better to do load balancing accross endpoints in the cluster via Envoy's capability.

Currently EG has implemented locality weighted load balancing 1, one BackendRef is translated to one LocalityLbEndpoints.

locality := &endpointv3.LocalityLbEndpoints{
	Locality: &corev3.Locality{
		Region: fmt.Sprintf("%s/backend/%d", clusterName, i),
  	},
	LbEndpoints: endpoints,
	Priority:    0,
}
  
// Set locality weight
var weight uint32
if ds.Weight != nil {
	weight = *ds.Weight
} else {
	weight = 1
}

Actually endpoints inside a LocalityLbEndpoints may be running in different zone, cross zone cost can't be saved in this way.


Through Envoy's capability, priority levels 2 or zone aware routing 3 4 can archive the goal to save cross zone cost.

priority levels

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, means which zone Envoy Pod is running in.
  3. EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

zone aware routing

This approach is mutually exclusive with locality weighted load balancing, since in the case of locality aware LB, we rely on the management server to provide the locality weighting, rather than the Envoy-side heuristics used in zone aware routing.

  1. Backend endpoint should be set with correct zone, it can be retrieved from EndpointSlice, inherited from Node topology.kubernetes.io/zone label.
  2. Envoy's command options should be set with --service-zone option, value meaning which zone Envoy Pod is running in.
  3. Envoy's bootstrap config should be set with cluster_manager. local_cluster_name, means which fleet Envoy Pod belongs to, it will be irKey in implementation.
  4. Add cluster corresponding to cluster_manager. local_cluster_name to CDS resources.
  5. Design a mechanism to discover Envoy Pods belongs to cluster_manager. local_cluster_name as endpoints and add them to EDS resources.
  6. Both Envoy and Backend cluster are not in panic mode 5.

personal preference

Since step 1 and 2 is required by both, priority levels can work with implemented locality weighed load balancing, but zone aware routing can't. Apparently priority levels are easier to implement. But it requires EDS resources should be arranged in xds/cache module for individual Envoy. No matter EG do this, or create new xDS Hook API, like PostEndpointModify(ClusterLoadAssignment, Node) which allow extension server to do this.

Footnotes

  1. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/locality_weight

  2. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/priority

  3. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/zone_aware

  4. https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing

  5. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold#arch-overview-load-balancing-panic-threshold

@arkodg
Copy link
Contributor

arkodg commented May 23, 2024

thanks for outlining the steps @aoledk ! we currently have #3055 open to get explicit priority per backendRef and program that into the xds cluster resource.

In the future, we can use this issue to make sure we track the auto priority work, the field in k8s preferClose could be the knob for users to say they want to opt in to this feature

@guydc
Copy link
Contributor

guydc commented Jun 6, 2024

Hi @aoledk, regarding:

priority levels
[...]
EG rearranges EDS resources for each Envoy, if Envoy and Backend endpoint are in same zone, priority as 0, else 1.

Is this option viable? Can our XDS server produce different EDS for different envoy pods that are part of the same Envoy deployment?

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 23, 2024
@flyik
Copy link

flyik commented Dec 4, 2024

Hi @aoledk! Are you still looking into implementing it yourself? If not, I’m interested in this feature and can work on bringing it to life.

@aoledk
Copy link
Contributor

aoledk commented Dec 4, 2024

@flyik recently I'm busy with bringing in EG, you can go ahead.

@aoledk aoledk removed their assignment Dec 4, 2024
@aoledk
Copy link
Contributor

aoledk commented Dec 4, 2024

@flyik I've unassigned myself, you can assign to yourself.

@github-actions github-actions bot removed the stale label Dec 5, 2024
Copy link

github-actions bot commented Jan 4, 2025

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jan 4, 2025
@kahirokunn
Copy link

keep

@github-actions github-actions bot removed the stale label Jan 9, 2025
Copy link

github-actions bot commented Feb 8, 2025

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Feb 8, 2025
@kahirokunn
Copy link

keep

@github-actions github-actions bot removed the stale label Feb 8, 2025
@jukie
Copy link
Contributor

jukie commented Feb 16, 2025

Is anyone working on this who would like to collaborate? I'm interested in support for both Traffic Distribution and Topology Aware Routing

@arkodg
Copy link
Contributor

arkodg commented Feb 17, 2025

thanks @jukie assigning this one to you.
Rethinking #1909 (comment), we may need to make this opt in, using a field such as 'ZoneAwareRoutingEnabled' in the EnvoyProxy resource

@arkodg arkodg modified the milestones: Backlog, v1.4.0-rc.1 Feb 17, 2025
@arkodg arkodg added the area/api API-related issues label Feb 17, 2025
@jukie
Copy link
Contributor

jukie commented Feb 17, 2025

@arkodg wouldn't opting in via the Kubernetes Service annotation (service.kubernetes.io/topology-mode: Auto) be a better place so that it can be configured per-service?

Updating getIREndpointsFromEndpointSlice() and DestinationEndpoint should be pretty simple to include topology info from Kubernetes but a topic with more complexity would be how individual Envoy instances should discover their locality zone. Before I get started, any preferences on the approach there?

The approach I thought of was to include the --service-zone arg for the Envoy proxy deployment. It would default to --service-zone $SERVICE_ZONE and we'd inject this env var separately through an init container or mutating webhook. In either setup it'd fail open for an empty env var resulting in no zone being set and falling back to current logic.

@arkodg
Copy link
Contributor

arkodg commented Feb 17, 2025

I think there are two parts here

  1. Enabling settings in envoy proxy to enable the zone aware routing feature
  2. Enabling zone aware routing per backend service using trafficDistribution.preferClose

I was referring to 1. to be opt in, in case setting those fields has an impact on common case performance, and might affect users not interested in the feature, and if it isn't, default is fine

@jukie jukie linked a pull request Feb 18, 2025 that will close this issue
@flyik
Copy link

flyik commented Feb 18, 2025

Hi! Will there be an option to choose between Zone-Aware Routing and Locality-Weighted Load Balancing in the future? It looks like the approach with priorities has a lot to offer—starting with more than one "layer" for balancing (region/zone/subzone vs. just zone in the case of ZAR), the ability to manually set fallback locations, and the option to specify weights across them.

@jukie
Copy link
Contributor

jukie commented Feb 19, 2025

@flyik yes, my PR is still very much WIP but the end result won't hardcode one or the other like it does currently.

@jukie
Copy link
Contributor

jukie commented Feb 19, 2025

From the contributors call it sounds like there's a strong preference against adding a Mutating Webhook so I was curious if anyone has another suggestion for how to handle zone discovery from the envoy proxy instances. An initContainer is the other option I had in mind.

The challenge is that we can't use downwardAPI to share underlying node topology labels and inject as an env var for example so we'll need something else to populate the --service-zone arg or bootstrap configuration.

@arkodg
Copy link
Contributor

arkodg commented Feb 19, 2025

thanks for digging into this @jukie, +1 to the init container approach of injecting an env var and reusing it in the proxy container

@jukie
Copy link
Contributor

jukie commented Feb 19, 2025

Is anyone interested in owning that piece of work? That can happen separately from my PR and would amount to the goal of getting the envoy proxy instances launched with --service-zone=$SOME_ZONE or --service-zone=$(cat /some/shared/file/from/init)".

@arkodg
Copy link
Contributor

arkodg commented Feb 19, 2025

sure @jukie feel free to create a GH issue or subtask for it, and someone else from the community can pick that piece up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api API-related issues kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants