Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Egress Connection Timeouts Design #63
base: main
Are you sure you want to change the base?
Egress Connection Timeouts Design #63
Changes from 5 commits
d4398fd
4484bf6
7e6141d
115f76b
1348474
90835a9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some High-level comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, I thought we were defaulting to 0s (== dynamic interval) ?
But we're dealing with different personas here? Setting a per-policy GC option will only be effective if the node-wide configuration is also tuned accordingly. So I don't think we can just declare this aspect out-of-scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I confused GC Interval with Ct Report Interval.
CT GC interval is set dynamically and proportionally by taking into account the percent of entries deleted from the ct_table in each cycle.
I believe this would be sufficient for the initial implementation of this feature as it keeps the footprint of this change small. If a large number of ct_entries are being deleted in each interval on the GW Node due to low CT timeouts, then GC interval should happen more frequently.
I'm proposing this simpler approach for now to minimize the footprint of this change. However, I agree that a more comprehensive review of the GC interval mechanism could be beneficial down the road. Perhaps we can explore that as a follow-up enhancement to this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context EGW connections, what would be the implication of a conntrack entry timing out via one of these timeouts?
Without a timeout on the socket level wouldn't we possibly see hanging connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To broaden a bit, I think we'd need to detail how the various mechanisms for enforcing timeouts would play out in practice - especially with regards to intended use cases.
For example, are we looking for timeout functionality from/in-parity with kernel sysctl settings (ex. ipv4 timeout) such that the connection is immediately terminated following the expiry?
That might help with making a decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep the behavior consistent with current cilium implementation. Today ct timeouts are configurable at the cluster/node level and these timeouts are not passed up to the socket. This feature just looks to improve the granularity of the existing functionality.
CT timeouts are especially useful for the EGW node because they can free up ports from stale connections and prevent exhaustion. This is the primary motivator for the feature; so passing these timeouts up to socket level is not a goal of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe extending the map value will require a map migration. At that point, let's consider what additional values we should place into the map value - I've wanted to store the ifindex of the egress interface for a long time :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, where can I get a list of proposed additional values to add for this map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think such a list exists, it's something we can raise with @cilium/egress-gateway once work on an implementation has started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating entry code could present some complexity with regard to upgrading Cilium. Possible alternative approach could be to enable the lookup code only if the feature is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, we can introduce a new cilium config flag to enable this feature.