Refactor the agent #684

diconico07 · 2023-12-27T15:22:03Z

This PR broadly refactors the agent to:

use kube Controller construct
take advantage of Server Side Apply
prepare for resource split and CDI+DRA
don't put everything under a util directory
use closer to kube upstream kube client
update proto definitions for device plugins
use kubelet pod resources monitoring interface rather than CRI to do slot reconciliation
Use CRD definition in Rust code to generate yaml file

While a strict refactor would not change anything user-facing, in order to facilitate some elements, the capacity is added to the Instance object, as well as the CDI fully qualified device name.

This Still WIP for now, as most of the unit tests were not ported to new architecture.

What this PR does / why we need it:

Special notes for your reviewer:
Sorry for the huge PR

If applicable:

this PR has an associated PR with documentation in akri-docs
this PR contains unit tests
added code adheres to standard Rust formatting (cargo fmt)
code builds properly (cargo build)
code is free of common mistakes (cargo clippy)
all Akri tests succeed (cargo test)
inline documentation builds (cargo doc)
all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

bfjelds · 2024-01-04T17:25:21Z

assuming that there are instances of files that are just existing code that has been moved (but otherwise functionally unchanged), it would be helpful if you could point those out so we don't waste time reviewing them :)

bfjelds · 2024-01-04T17:35:53Z

prepare for resource split and CDI+DRA

yay! will we get deallocate calls from DRA? (and be able to do away with the reconciler!!!)

bfjelds · 2024-01-04T20:11:45Z

agent/src/plugin_manager/v1beta1.rs

+            &mut self,
+            request: impl tonic::IntoRequest<super::PreferredAllocationRequest>,
+        ) -> Result<tonic::Response<super::PreferredAllocationResponse>, tonic::Status> {
+            self.inner.ready().await.map_err(|e| {


can we use this preferredallocation concept to avoid kubelet requesting shared resources? (or maybe this isn't worth investing in if we move to DRA)

Well the preferred allocation is preference, there are no guarantees that the kubelet will follow those preferences (especially if multiple resources are requested), so I think it's not worth it (esp. compared to DRA)

diconico07 · 2024-01-08T11:12:29Z

assuming that there are instances of files that are just existing code that has been moved (but otherwise functionally unchanged), it would be helpful if you could point those out so we don't waste time reviewing them :)

Some parts of files are code that just got moved, but most of it required rework (hence the fact I couldn't just copy/paste the unit tests)

kate-goldenring

Leaving some comments based on our first synchronous review

shared/src/k8s/crud.rs

agent/src/util/discovery_configuration_controller.rs

kate-goldenring · 2024-01-16T16:38:11Z

shared/src/akri/instance.rs

@@ -19,11 +19,46 @@ pub type InstanceList = ObjectList<Instance>;
 #[derive(CustomResource, Deserialize, Serialize, Clone, Debug, JsonSchema)]
 #[serde(rename_all = "camelCase")]
 // group = API_NAMESPACE and version = API_VERSION


@diconico07 do we need to add an annotation to declare this immutable? Can we add a todo to support capacity changes?

I did not enforce immutability on the Instance as it is a purely internal one, concerning the capacity changes, I added a TODO comment about it in the device plugin instance controller (where the change will be needed)

agent/src/util/discovery_configuration_controller.rs

kate-goldenring · 2024-01-16T16:53:36Z

agent/src/discovery_handler_manager/discovery_handler_registry.rs

+
+/// Real world implementation of the Discovery Handler Request
+struct DHRequestImpl {
+    endpoints: RwLock<Vec<watch::Receiver<Vec<Arc<DiscoveredDevice>>>>>,


Where can we document in code that we support multiple discovery handlers of the same type? AKA if i have deployed 3 onvif discovery handlers, for each configuration applied, i create 3 connections.

I don't know if there is a right place for documenting this in code (maybe in the crate documentation of the discovery utils crate), but we may want to add a paragraph about this in the docs

This commit broadly refactors the agent to: - use kube Controller construct - take advantage of Server Side Apply - prepare for resource split and CDI+DRA - don't put everything under a util directory - use closer to kube upstream kube client - update proto definitions for device plugins - use kubelet pod resources monitoring interface rather than CRI to do slot reconciliation - Use CRD definition in Rust code to generate yaml file Signed-off-by: Nicolas Belouin <[email protected]>

kate-goldenring

I got about half way through and am really liking how you've brought in the kube-rs controller updates. All i have left is reviewing the discovery_configuration_controller, though that is a bulk of the implementation. My main confusion so far is with why we are using Signal.

agent/build.rs

deployment/helm/crds/akri-instance-crd.yaml

shared/src/k8s/crud.rs

shared/src/akri/configuration.rs

agent/src/plugin_manager/device_plugin_instance_controller.rs

agent/src/plugin_manager/device_plugin_slot_reclaimer.rs

agent/src/main.rs

agent/src/util/discovery_configuration_controller.rs

Signed-off-by: Nicolas Belouin <[email protected]>

kate-goldenring

Overall the implementation looks good. I had mainly nit comments. I really like the slot reconciliation changes. However, I wasn't able to successfully get the changes running with a debugEcho discovery handler to test the changes. It looks like an issue with kube_client resolving the CRD updates. I am running the agent and debugEcho DH locally and applying this config.

The client errors with .spec.capacity: field not declared in schema and .spec.cdiName: field not declared in schema. Here is a snippet of error from the agent:

[2024-06-27T22:23:46Z INFO  kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=unknown
[2024-06-27T22:23:46Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:46Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 1s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema:  (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
    
    Caused by:
        failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema: )
[2024-06-27T22:23:47Z INFO  kube_runtime::controller] reconciling object; object.ref=Configuration.v0.akri.sh/akri-debug-echo-foo.default object.reason=error policy requested retry
[2024-06-27T22:23:47Z ERROR kube_client::client::builder] failed with status 500 Internal Server Error
[2024-06-27T22:23:47Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 2s: Other(ApiError: failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema:  (ErrorResponse { status: "Failure", message: "failed to create typed patch object (/akri-debug-echo-foo-489660; akri.sh/v0, Kind=Instance): .spec.capacity: field not declared in schema", reason: "", code: 500 })
...


[2024-06-27T22:24:31Z WARN  agent::util::discovery_configuration_controller] Error during reconciliation for Some("default")::akri-debug-echo-foo, retrying in 64s: Other(ApiError: the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest (ErrorResponse { status: "Failure", message: "the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided", reason: "BadRequest", code: 400 })
    
    Caused by:
        the name of the object (akri-debug-echo-foo based on URL) was undeterminable: name must be provided: BadRequest)

agent/proto/pluginapi.proto

kate-goldenring · 2024-06-27T20:38:02Z

agent/src/device_manager/in_memory.rs

+        let state = self.state.borrow();
+        let cdi_kind = state.get(kind)?;
+        let mut device = cdi_kind.devices.iter().find(|dev| dev.name == id)?.clone();
+        device.name = format!("{}-{}", kind, id);


nit: this modification of the name should be put in it's own function

I prefer not to, this name modification is really only relevant for this in-memory/non full CDI approach we are currently using with device plugins, so won't get re-used, and would probably make it more complicated to read from my point of view.

agent/src/discovery_handler_manager/embedded_handler.rs

agent/src/discovery_handler_manager/discovery_handler_registry.rs

kate-goldenring · 2024-06-27T21:18:44Z

agent/src/discovery_handler_manager/discovery_handler_registry.rs

+                },
+                Ok(endpoint) = rec.recv() => {
+                    if endpoint.get_name() != self.handler_name {
+                        // We woke up for another kind of DH, let's get back to sleep


Can this happen? When would DH Foo be woken up by DH Bar?

Will rename rec and endpoint here to be more explicit, this can happen if another discovery handler registers itself:
We are awaking all active requests whenever a new discovery handler registers itself, so that if it is a DH with the same name/kind as the one of this request we also send the request to that newly registered endpoint.
To simplify I chose to have a single MQ to broadcast those registrations (that should be pretty rare anyway), rather than creating one per name/kind of DH.

agent/src/discovery_handler_manager/discovery_handler_registry.rs

kate-goldenring · 2024-06-27T21:39:06Z

agent/src/discovery_handler_manager/registration_socket.rs

+}
+
+#[async_trait]
+impl DiscoveryHandlerEndpoint for NetworkEndpoint {


Can this be moved to it's own module, say network_handler.rs? It is unexpected to have it bundled in with the registration module

It feels weird to me to have them separated as the registration socket only cares for network handlers anyway.
Maybe I should just rename the module network_handler.rs, as it basically defines the network handlers and its registration mechanism (akin to the embedded_handlers.rs module that also contains the registration of the active embedded handlers)

That makes sense. Maybe we do this in a followup pr so as to not add more changes

agent/src/plugin_manager/device_plugin_runner.rs

Signed-off-by: Nicolas Belouin <[email protected]>

kate-goldenring

LGTM! Thank you @diconico07 for revamping the agent and the patience iterating on this.

kate-goldenring · 2024-07-12T17:06:22Z

/version minor

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

diconico07 force-pushed the revamp-agent branch 2 times, most recently from 47bbb6e to 8b30652 Compare December 28, 2023 15:29

bfjelds reviewed Jan 4, 2024

View reviewed changes

kate-goldenring mentioned this pull request Jan 9, 2024

containerd.socket mounting inside container - security concern / best practice deviation? #682

Open

diconico07 force-pushed the revamp-agent branch from 8b30652 to 597839c Compare January 15, 2024 13:56

kate-goldenring reviewed Jan 23, 2024

View reviewed changes

diconico07 force-pushed the revamp-agent branch 3 times, most recently from e7a0872 to 83d3e7f Compare February 6, 2024 13:44

diconico07 force-pushed the revamp-agent branch from 83d3e7f to d109b36 Compare February 15, 2024 09:16

diconico07 marked this pull request as ready for review February 15, 2024 10:01

diconico07 requested review from jiria, Britel, romoh, adithyaj and johnsonshih as code owners February 15, 2024 10:01

kate-goldenring reviewed Feb 21, 2024

View reviewed changes

diconico07 added 2 commits April 26, 2024 17:39

Apply suggestions

39d346b

Signed-off-by: Nicolas Belouin <[email protected]>

Address more comments

006f78a

Signed-off-by: Nicolas Belouin <[email protected]>

diconico07 force-pushed the revamp-agent branch from 3671870 to 006f78a Compare April 29, 2024 14:13

diconico07 added 2 commits April 30, 2024 09:45

Fix documentation build

cef2054

Signed-off-by: Nicolas Belouin <[email protected]>

Improve reclaimer logging

2e3c3a5

Signed-off-by: Nicolas Belouin <[email protected]>

diconico07 force-pushed the revamp-agent branch from da04fe8 to 2e3c3a5 Compare April 30, 2024 14:11

Directly use kube-rs Client structure in Agent

c799833

Signed-off-by: Nicolas Belouin <[email protected]>

kate-goldenring reviewed Jun 27, 2024

View reviewed changes

Improve documentation and clarity of code

3145c71

Signed-off-by: Nicolas Belouin <[email protected]>

kate-goldenring approved these changes Jul 9, 2024

View reviewed changes

kate-goldenring added version/minor Minor version change is needed and removed version/minor Minor version change is needed labels Jul 12, 2024

github-actions bot added the version/minor Minor version change is needed label Jul 12, 2024

Update minor version

488f9bf

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

kate-goldenring merged commit 3175bb9 into project-akri:main Jul 15, 2024
3 checks passed

diconico07 mentioned this pull request Jul 31, 2024

Refactor agent followup #700

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the agent #684

Refactor the agent #684

diconico07 commented Dec 27, 2023

bfjelds commented Jan 4, 2024

bfjelds commented Jan 4, 2024

bfjelds Jan 4, 2024

diconico07 Jan 8, 2024

diconico07 commented Jan 8, 2024

kate-goldenring left a comment

kate-goldenring Jan 16, 2024

diconico07 May 2, 2024

kate-goldenring Jan 16, 2024

diconico07 Mar 5, 2024

kate-goldenring left a comment

kate-goldenring left a comment

kate-goldenring Jun 27, 2024

diconico07 Jun 28, 2024

kate-goldenring Jun 27, 2024

diconico07 Jun 28, 2024

kate-goldenring Jun 27, 2024

diconico07 Jun 28, 2024

kate-goldenring Jul 8, 2024

kate-goldenring left a comment

kate-goldenring commented Jul 12, 2024

Refactor the agent #684

Refactor the agent #684

Conversation

diconico07 commented Dec 27, 2023

bfjelds commented Jan 4, 2024

bfjelds commented Jan 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diconico07 commented Jan 8, 2024

kate-goldenring left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kate-goldenring left a comment

Choose a reason for hiding this comment

kate-goldenring left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kate-goldenring left a comment

Choose a reason for hiding this comment

kate-goldenring commented Jul 12, 2024