Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Update Training V2 APIs in the KEP #2240

Merged
merged 4 commits into from
Aug 30, 2024

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Aug 28, 2024

I updated APIs in Training V2 KEP after our discussions in this PR: #2223.
Additionally, we discussed with @tenzen-y to rename Exporter Job to the Finalizer, since we might introduce another post-processing functionality later (in addition to model exporter).

Please take a look.

/assign @kubeflow/wg-training-leads @shravan-achar @vsoch @kannon92

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

Part of: #2170

@coveralls
Copy link

coveralls commented Aug 28, 2024

Pull Request Test Coverage Report for Build 10632842905

Details

  • 0 of 9 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 31.761%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/apis/kubeflow.org/v2alpha1/zz_generated.deepcopy.go 0 9 0.0%
Totals Coverage Status
Change from base Build 10600609425: 0.02%
Covered Lines: 3945
Relevant Lines: 12421

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, could we clarify the criteria for when the Initializer, Trainer, and Finalizer will start?
For example, if the initializer fails, what happens? The Trainer and Finalizer will start or not?

I have similar questions about Trainer and Finalizer. So, I believe in clarifying the criteria for when each task will start.

docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
@@ -819,49 +824,51 @@ orchestration (e.g. using Kubernetes admission webhooks or custom clients).
In the future, we can add more parameters if we find use-cases when it is required.

```golang
type PodSpecOverride struct {
// Name of the training replica in the training runtime template to override
type PodSpecOverrides struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type PodSpecOverrides struct {
type PodSpecOverride struct {

Could we use the singular name instead of the plural since we used to select the singular name for field typed?
We can image the following fields:

PodSpecOverrides []PodSpecOverride `json:"podSpecOverrides"` 

vs

PodSpecOverrides []PodSpecOverrides `json:"podSpecOverrides"` 

// Name for the container.
// ContainerOverrides represents parameters that can be overridden using PodSpecOverrides.
// Parameters from the Trainer, DatasetConfig, and ModelConfig will take precedence.
type ContainerOverrides struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type ContainerOverrides struct {
type ContainerOverride struct {

I would recommend the singular name based on the above comment.

@andreyvelich
Copy link
Member Author

Additionally, could we clarify the criteria for when the Initializer, Trainer, and Finalizer will start?
For example, if the initializer fails, what happens? The Trainer and Finalizer will start or not?

Sure, can we do that in the followup PR when we agree on the Initializer and Finalizer structure ? E.g. init container + sidecar container which will be triggered with preStop lifecycle hook.

andreyvelich and others added 2 commits August 30, 2024 13:33
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich force-pushed the update-api-v2-kep-2206 branch from 54bdbcc to 273af3c Compare August 30, 2024 12:39
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

@tenzen-y I've done changes to PodSpecOverride API.

@tenzen-y
Copy link
Member

Additionally, could we clarify the criteria for when the Initializer, Trainer, and Finalizer will start?
For example, if the initializer fails, what happens? The Trainer and Finalizer will start or not?

Sure, can we do that in the followup PR when we agree on the Initializer and Finalizer structure ? E.g. init container + sidecar container which will be triggered with preStop lifecycle hook.

@andreyvelich Sure. If so, can we treat the Initializer and Finalizer as a separate feature similar to PodSpecOverride: #2218?

And, can we create a dedicated issue and work on design and implementations in the issue?

@andreyvelich
Copy link
Member Author

We have issue for Initializer job already: #2210.
Let me create one for Finalizer job.

@tenzen-y
Copy link
Member

We have issue for Initializer job already: #2210.
Let me create one for Finalizer job.

Sounds great. Thanks!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 13c3ee8 into kubeflow:master Aug 30, 2024
39 checks passed
@andreyvelich andreyvelich deleted the update-api-v2-kep-2206 branch August 30, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants