Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add Kubeflow Trainer Pipeline Framework Design #2439

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions docs/proposals/2170-kubeflow-training-v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._

_Will be added after initial implementation for PyTorch._

## Pipeline Framework
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for adding this @tenzen-y!
Please can you create a dedicated issue so we can update the Kubeflow Trainer operators docs with this guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure. In that case, where do we want to put this?
I am guessing if we should create Kubeflow Trainer > User Guide > PlatformDeveloper Guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should add them under Operator Guides: https://www.kubeflow.org/docs/components/trainer/operator-guides/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't find that. That sounds great.


We introduce the framework as internal mechanism so that we can easily expand mechanism
for combination of Runtimes and TrainJob.

The framework is called as Kubeflow Trainer Pipeline Framework, and it has 4 phases as you can see the following
overview.

![Overview](./TrainerPipelineFrameworkOverview.drawio.svg)

As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once
during starting trainer-controller-manager:

- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
- `Startup Phase`: Initialize internal components at once when the `kubeflow-trainer-controller-manager` starts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

- `PreExecution Phase`: This phase is executed as a part of admission validating webhooks triggered by TrainJob is created and updated.
- `Build Phase`: This phase is executed to build child Kubernetes resources and deploy those to the cluster.
- `PostExecution Phase`: This phase is executed after the `Build Phase`.

As you can see in the diagram, each phase has 2 types of APIs, `Internal API` and `Extension Point`.
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
and those plugins are performed in any order.
Comment on lines +1714 to +1715
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
and those plugins are performed in any order.
The Extension Point is exposed, allowing operations to be added as plugins within the scope of the Pipeline Framework Plugins Interfaces. These plugins can be executed in any order.

Rephrase this sentence so it would be more readable:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, These plugins can be executed in any order. is incorrect since we can not handle the order.
So, I will refine this with These plugins are executed in any order.

On the other hand, the Internal APIs are not exposed and could not add any operations as opposed to the Extension Point.

![Kubeflow TrainerPipelineFramework](./TrainerPipelineFramework.drawio.svg)

- `Startup Phase`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should also say that cluster operators can:

  1. Register new plugins into existing TrainingRuntime and ClusterTrainingRuntime
  2. Register new runtimes into runtime framework (e.g. SlurmRuntime).
    Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Register new runtimes into runtime framework (e.g. SlurmRuntime).
Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?

Yeah, the current framework does not support inserting arbitrary plugins and runtimes (for sure, technically, we could do it). However, I think we can add another InternalAPI Initialize Trainer Framework Pipelines to diagram to clarify when pipelines are initialized. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would be nice.

- Internal API:
- `TrainJobController`: Set up TrainJob controller and register it to Manager.
- `Built-in Webhook Servers`: Set up Built-in Admission Webhook Servers and register those to Manager.
- `Start Manager`: Start Manager.
- Extension Point
- `WatchExtension`: This registers arbitrary reconciler builders for watching any kind of resources
and triggering TrainJob reconciliations.
- `PreExecution Phase`:
- Extension Point:
- `CustomValidation`: This registers validators for validating any kind of resources to Admission Validating Webhook Servers
when TrainJob is created and updated.
- `Build Phase`:
- Internal API:
- `ComponentDeployer`: This deploys built components (resources) to the cluster which is performed as a part of reconciler.
- Extension Point:
- `EnforcePodGroupPolicy`: This configures PodGroup specific parameters (e.x, specified in TrainingRuntime `.spec.podGroupPolicy`)
to any kind of resources like PodSpec.
- `EnforceMLPolicy`: This configure MachineLearning framework specific parameters (e.x, specified in TrainingRuntime `.spec.mlPolicy`)
to any kind of resources like PodSpec.
- `ComponentBuilder`: This builds Kubernetes resources leveraging `RuntimeInfo` and `TrainJob`.
`RuntimeInfo` is abstracted objects extracted from runtimes like TrainingRuntime and ClusterTrainingRuntime.
- `PostExecution Phase`:
- Internal API:
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob.
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more a general comment rather than something strictly in the scope of this PR, I find the name of this condition a bit confusing as the TrainJob has to be "created", even before that condition can be applied.

So either it could be stated that this condition is about whether the children components of the TrainJob have been created, or possibly rename the condition to something like InitializedCondition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That makes sense. The current Created condition problem is mixed semantics for "created TrainJob" and "created sub-objects based on runtime and job".

So, we might want to reconsider it. @astefanutti @andreyvelich What about trying to change "Created" condition type name in a follow-up PR? If we decide which name we should use instead of "Created", we can fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about trying to change "Created" condition type name in a follow-up PR?

That sounds good to me 👍🏼.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's discuss it separately.
@astefanutti So do you suggest that we split Created condition for TrainJob between two ?
https://github.com/kubeflow/trainer/blob/8a6091907a2df6ab1d55b9713ead631f72ece56e/docs/proposals/2170-kubeflow-training-v2/README.md#state-transition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I was more simply thinking about renaming it to something like Initialized or ComponentsCreated more than splitting it. As it stands it may be interpreted as the TrainJob is created, which is not what it is.

- Extension Point:
- `TerminalCondition`: Check if TrainJob is terminated state, and then add `Complete` condition with
a propagated terminal reason and message from child Jobs to TrainJob.

## Migration from Kubeflow Training V1

These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading