Skip to content

Commit

Permalink
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design
Browse files Browse the repository at this point in the history
Signed-off-by: Yuki Iwai <[email protected]>
  • Loading branch information
tenzen-y committed Feb 16, 2025
1 parent e7e35d1 commit 8a60919
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 0 deletions.
55 changes: 55 additions & 0 deletions docs/proposals/2170-kubeflow-training-v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._

_Will be added after initial implementation for PyTorch._

## Pipeline Framework

We introduce the framework as internal mechanism so that we can easily expand mechanism
for combination of Runtimes and TrainJob.

The framework is called as Kubeflow Trainer Pipeline Framework, and it has 4 phases as you can see the following
overview.

![Overview](./TrainerPipelineFrameworkOverview.drawio.svg)

As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once
during starting trainer-controller-manager:

- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
- `PreExecution Phase`: This phase is executed as a part of admission validating webhooks triggered by TrainJob is created and updated.
- `Build Phase`: This phase is executed to build child Kubernetes resources and deploy those to the cluster.
- `PostExecution Phase`: This phase is executed after the `Build Phase`.

As you can see in the diagram, each phase has 2 types of APIs, `Internal API` and `Extension Point`.
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
and those plugins are performed in any order.
On the other hand, the Internal APIs are not exposed and could not add any operations as opposed to the Extension Point.

![Kubeflow TrainerPipelineFramework](./TrainerPipelineFramework.drawio.svg)

- `Startup Phase`:
- Internal API:
- `TrainJobController`: Set up TrainJob controller and register it to Manager.
- `Built-in Webhook Servers`: Set up Built-in Admission Webhook Servers and register those to Manager.
- `Start Manager`: Start Manager.
- Extension Point
- `WatchExtension`: This registers arbitrary reconciler builders for watching any kind of resources
and triggering TrainJob reconciliations.
- `PreExecution Phase`:
- Extension Point:
- `CustomValidation`: This registers validators for validating any kind of resources to Admission Validating Webhook Servers
when TrainJob is created and updated.
- `Build Phase`:
- Internal API:
- `ComponentDeployer`: This deploys built components (resources) to the cluster which is performed as a part of reconciler.
- Extension Point:
- `EnforcePodGroupPolicy`: This configures PodGroup specific parameters (e.x, specified in TrainingRuntime `.spec.podGroupPolicy`)
to any kind of resources like PodSpec.
- `EnforceMLPolicy`: This configure MachineLearning framework specific parameters (e.x, specified in TrainingRuntime `.spec.mlPolicy`)
to any kind of resources like PodSpec.
- `ComponentBuilder`: This builds Kubernetes resources leveraging `RuntimeInfo` and `TrainJob`.
`RuntimeInfo` is abstracted objects extracted from runtimes like TrainingRuntime and ClusterTrainingRuntime.
- `PostExecution Phase`:
- Internal API:
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob.
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob.
- Extension Point:
- `TerminalCondition`: Check if TrainJob is terminated state, and then add `Complete` condition with
a propagated terminal reason and message from child Jobs to TrainJob.

## Migration from Kubeflow Training V1

These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8a60919

Please sign in to comment.