-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design #2439
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._ | |||||||
|
||||||||
_Will be added after initial implementation for PyTorch._ | ||||||||
|
||||||||
## Pipeline Framework | ||||||||
|
||||||||
We introduce the framework as internal mechanism so that we can easily expand mechanism | ||||||||
for combination of Runtimes and TrainJob. | ||||||||
|
||||||||
The framework is called as Kubeflow Trainer Pipeline Framework, and it has 4 phases as you can see the following | ||||||||
overview. | ||||||||
|
||||||||
 | ||||||||
|
||||||||
As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once | ||||||||
during starting trainer-controller-manager: | ||||||||
|
||||||||
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM |
||||||||
- `PreExecution Phase`: This phase is executed as a part of admission validating webhooks triggered by TrainJob is created and updated. | ||||||||
- `Build Phase`: This phase is executed to build child Kubernetes resources and deploy those to the cluster. | ||||||||
- `PostExecution Phase`: This phase is executed after the `Build Phase`. | ||||||||
|
||||||||
As you can see in the diagram, each phase has 2 types of APIs, `Internal API` and `Extension Point`. | ||||||||
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins | ||||||||
and those plugins are performed in any order. | ||||||||
Comment on lines
+1714
to
+1715
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Rephrase this sentence so it would be more readable:) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, |
||||||||
On the other hand, the Internal APIs are not exposed and could not add any operations as opposed to the Extension Point. | ||||||||
|
||||||||
 | ||||||||
|
||||||||
- `Startup Phase`: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering if we should also say that cluster operators can:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, the current framework does not support inserting arbitrary plugins and runtimes (for sure, technically, we could do it). However, I think we can add another InternalAPI There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think that would be nice. |
||||||||
- Internal API: | ||||||||
- `TrainJobController`: Set up TrainJob controller and register it to Manager. | ||||||||
- `Built-in Webhook Servers`: Set up Built-in Admission Webhook Servers and register those to Manager. | ||||||||
- `Start Manager`: Start Manager. | ||||||||
- Extension Point | ||||||||
- `WatchExtension`: This registers arbitrary reconciler builders for watching any kind of resources | ||||||||
and triggering TrainJob reconciliations. | ||||||||
- `PreExecution Phase`: | ||||||||
- Extension Point: | ||||||||
- `CustomValidation`: This registers validators for validating any kind of resources to Admission Validating Webhook Servers | ||||||||
when TrainJob is created and updated. | ||||||||
- `Build Phase`: | ||||||||
- Internal API: | ||||||||
- `ComponentDeployer`: This deploys built components (resources) to the cluster which is performed as a part of reconciler. | ||||||||
- Extension Point: | ||||||||
- `EnforcePodGroupPolicy`: This configures PodGroup specific parameters (e.x, specified in TrainingRuntime `.spec.podGroupPolicy`) | ||||||||
to any kind of resources like PodSpec. | ||||||||
- `EnforceMLPolicy`: This configure MachineLearning framework specific parameters (e.x, specified in TrainingRuntime `.spec.mlPolicy`) | ||||||||
to any kind of resources like PodSpec. | ||||||||
- `ComponentBuilder`: This builds Kubernetes resources leveraging `RuntimeInfo` and `TrainJob`. | ||||||||
`RuntimeInfo` is abstracted objects extracted from runtimes like TrainingRuntime and ClusterTrainingRuntime. | ||||||||
- `PostExecution Phase`: | ||||||||
- Internal API: | ||||||||
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob. | ||||||||
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe more a general comment rather than something strictly in the scope of this PR, I find the name of this condition a bit confusing as the TrainJob has to be "created", even before that condition can be applied. So either it could be stated that this condition is about whether the children components of the TrainJob have been created, or possibly rename the condition to something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. That makes sense. The current So, we might want to reconsider it. @astefanutti @andreyvelich What about trying to change "Created" condition type name in a follow-up PR? If we decide which name we should use instead of "Created", we can fix this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That sounds good to me 👍🏼. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, let's discuss it separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich I was more simply thinking about renaming it to something like |
||||||||
- Extension Point: | ||||||||
- `TerminalCondition`: Check if TrainJob is terminated state, and then add `Complete` condition with | ||||||||
a propagated terminal reason and message from child Jobs to TrainJob. | ||||||||
|
||||||||
## Migration from Kubeflow Training V1 | ||||||||
|
||||||||
These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have | ||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you for adding this @tenzen-y!
Please can you create a dedicated issue so we can update the Kubeflow Trainer operators docs with this guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure. In that case, where do we want to put this?
I am guessing if we should create
Kubeflow Trainer > User Guide > PlatformDeveloper Guide
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should add them under Operator Guides: https://www.kubeflow.org/docs/components/trainer/operator-guides/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't find that. That sounds great.