-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the agent #684
Refactor the agent #684
Conversation
47bbb6e
to
8b30652
Compare
assuming that there are instances of files that are just existing code that has been moved (but otherwise functionally unchanged), it would be helpful if you could point those out so we don't waste time reviewing them :) |
yay! will we get deallocate calls from DRA? (and be able to do away with the reconciler!!!) |
&mut self, | ||
request: impl tonic::IntoRequest<super::PreferredAllocationRequest>, | ||
) -> Result<tonic::Response<super::PreferredAllocationResponse>, tonic::Status> { | ||
self.inner.ready().await.map_err(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use this preferredallocation concept to avoid kubelet requesting shared resources? (or maybe this isn't worth investing in if we move to DRA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the preferred allocation is preference, there are no guarantees that the kubelet will follow those preferences (especially if multiple resources are requested), so I think it's not worth it (esp. compared to DRA)
Some parts of files are code that just got moved, but most of it required rework (hence the fact I couldn't just copy/paste the unit tests) |
8b30652
to
597839c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some comments based on our first synchronous review
@@ -19,11 +19,46 @@ pub type InstanceList = ObjectList<Instance>; | |||
#[derive(CustomResource, Deserialize, Serialize, Clone, Debug, JsonSchema)] | |||
#[serde(rename_all = "camelCase")] | |||
// group = API_NAMESPACE and version = API_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diconico07 do we need to add an annotation to declare this immutable? Can we add a todo to support capacity
changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not enforce immutability on the Instance as it is a purely internal one, concerning the capacity
changes, I added a TODO
comment about it in the device plugin instance controller (where the change will be needed)
|
||
/// Real world implementation of the Discovery Handler Request | ||
struct DHRequestImpl { | ||
endpoints: RwLock<Vec<watch::Receiver<Vec<Arc<DiscoveredDevice>>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can we document in code that we support multiple discovery handlers of the same type? AKA if i have deployed 3 onvif discovery handlers, for each configuration applied, i create 3 connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a right place for documenting this in code (maybe in the crate documentation of the discovery utils crate), but we may want to add a paragraph about this in the docs
e7a0872
to
83d3e7f
Compare
This commit broadly refactors the agent to: - use kube Controller construct - take advantage of Server Side Apply - prepare for resource split and CDI+DRA - don't put everything under a util directory - use closer to kube upstream kube client - update proto definitions for device plugins - use kubelet pod resources monitoring interface rather than CRI to do slot reconciliation - Use CRD definition in Rust code to generate yaml file Signed-off-by: Nicolas Belouin <[email protected]>
83d3e7f
to
d109b36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got about half way through and am really liking how you've brought in the kube-rs
controller updates. All i have left is reviewing the discovery_configuration_controller, though that is a bulk of the implementation. My main confusion so far is with why we are using Signal.
Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
3671870
to
006f78a
Compare
Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
da04fe8
to
2e3c3a5
Compare
Signed-off-by: Nicolas Belouin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the implementation looks good. I had mainly nit comments. I really like the slot reconciliation changes. However, I wasn't able to successfully get the changes running with a debugEcho discovery handler to test the changes. It looks like an issue with kube_client resolving the CRD updates. I am running the agent and debugEcho DH locally and applying this config.
The client errors with .spec.capacity: field not declared in schema
and .spec.cdiName: field not declared in schema
. Here is a snippet of error from the agent:
[2024-06-27T22:23:46Z INFO kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=unknown
[2024-06-27T22:23:46Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:46Z WARN agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 1s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema: (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
Caused by:
failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema: )
[2024-06-27T22:23:47Z INFO kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=error policy requested retry
[2024-06-27T22:23:47Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:47Z WARN agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 2s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema: (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
...
[2024-06-27T22:24:31Z WARN agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 64s: Other(ApiError: the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest (ErrorResponse { status: "Failure", message: "the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided", reason: "BadRequest", code: 400 })
Caused by:
the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest)
let state = self.state.borrow(); | ||
let cdi_kind = state.get(kind)?; | ||
let mut device = cdi_kind.devices.iter().find(|dev| dev.name == id)?.clone(); | ||
device.name = format!("{}-{}", kind, id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this modification of the name should be put in it's own function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to, this name modification is really only relevant for this in-memory/non full CDI approach we are currently using with device plugins, so won't get re-used, and would probably make it more complicated to read from my point of view.
agent/src/discovery_handler_manager/discovery_handler_registry.rs
Outdated
Show resolved
Hide resolved
agent/src/discovery_handler_manager/discovery_handler_registry.rs
Outdated
Show resolved
Hide resolved
agent/src/discovery_handler_manager/discovery_handler_registry.rs
Outdated
Show resolved
Hide resolved
}, | ||
Ok(endpoint) = rec.recv() => { | ||
if endpoint.get_name() != self.handler_name { | ||
// We woke up for another kind of DH, let's get back to sleep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this happen? When would DH Foo be woken up by DH Bar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rename rec
and endpoint
here to be more explicit, this can happen if another discovery handler registers itself:
We are awaking all active requests whenever a new discovery handler registers itself, so that if it is a DH with the same name/kind as the one of this request we also send the request to that newly registered endpoint.
To simplify I chose to have a single MQ to broadcast those registrations (that should be pretty rare anyway), rather than creating one per name/kind of DH.
agent/src/discovery_handler_manager/discovery_handler_registry.rs
Outdated
Show resolved
Hide resolved
} | ||
|
||
#[async_trait] | ||
impl DiscoveryHandlerEndpoint for NetworkEndpoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to it's own module, say network_handler.rs
? It is unexpected to have it bundled in with the registration module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to me to have them separated as the registration socket only cares for network handlers anyway.
Maybe I should just rename the module network_handler.rs
, as it basically defines the network handlers and its registration mechanism (akin to the embedded_handlers.rs
module that also contains the registration of the active embedded handlers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Maybe we do this in a followup pr so as to not add more changes
Signed-off-by: Nicolas Belouin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @diconico07 for revamping the agent and the patience iterating on this.
/version minor |
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This PR broadly refactors the agent to:
While a strict refactor would not change anything user-facing, in order to facilitate some elements, the capacity is added to the Instance object, as well as the CDI fully qualified device name.
This Still WIP for now, as most of the unit tests were not ported to new architecture.
What this PR does / why we need it:
Special notes for your reviewer:
Sorry for the huge PR
If applicable:
cargo fmt
)cargo build
)cargo clippy
)cargo test
)cargo doc
)