KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

andreyvelich · 2024-08-14T15:32:22Z

Related: #2170

We should add the e2e tests the Kubeflow Training V2.

I was thinking about these tests that we can add:

/test/e2e/initializer_v2 - Tests for our dataset and model initializer components.
/test/e2e/runtimes - Tests for the existing runtimes without any modification of TrainJob. E.g. the tests might look as follows:

job_id = TrainingClient().train(runtime_ref="torch-distributed")
TrainingClient().get_job_logs(job_id, follow=True)

/test/e2e/notebooks - Tests our V2 examples with Papermil. I think the majority of our examples can be represented with Jupyter Notebooks, so Data Scientists can quickly run them.

/area testing

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-12T20:02:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2024-11-15T12:08:30Z

/remove-lifecycle stale
/good-first-issue

google-oss-prow · 2024-11-15T12:08:32Z

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2024-11-18T04:58:17Z

/assign

I can help with this. Please let me know if you have different plans @kubeflow/wg-training-leads .

tenzen-y · 2024-11-28T15:33:09Z

RE: #2328 (comment)

What do you mean by control-plane-specific E2E in the /test/e2e ?

This indicates the E2E testings with Kind cluster which does not use Python SDK.

I think that v1 API E2Es have multiple significant problems, which is difficult to understand the failure errors place SDK vs controllers (controller, validations, and mutatings...).

So, we should implement 2 types of E2Es, which indicates testings without SDK and with SDK.
But, those do not have same test cases. The controller testings focus only on controller testings and SDK testings focus only on SDK specific cases.

andreyvelich · 2024-11-28T21:56:39Z

This indicates the E2E testings with Kind cluster which does not use Python SDK
But, those do not have same test cases. The controller testings focus only on controller testings and SDK testings focus only on SDK specific cases.

We will still use Kind with Python SDK. We just use SDK to trigger the TrainJobs with appropriate configuration that we want to test, so we don't require to store YAMLs of TrainJobs in our repository.

I think that v1 API E2Es have multiple significant problems, which is difficult to understand the failure errors place SDK v

Can you explain more details, where using the Python SDK might be not sufficient to verify that controller + webhooks are functionally correct ?

andreyvelich added this to KEP-2170: Kubeflow Training V2 API Aug 14, 2024

google-oss-prow bot added the area/testing label Aug 14, 2024

andreyvelich mentioned this issue Aug 28, 2024

KEP-2170: Kubeflow Training V2 API #2170

Open

18 tasks

github-actions bot added the lifecycle/stale label Nov 12, 2024

google-oss-prow bot added good first issue help wanted and removed lifecycle/stale labels Nov 15, 2024

Electronic-Waste moved this from Todo to In Progress in KEP-2170: Kubeflow Training V2 API Nov 18, 2024

Electronic-Waste moved this from In Progress to Todo in KEP-2170: Kubeflow Training V2 API Nov 18, 2024

google-oss-prow bot assigned Electronic-Waste Nov 18, 2024

Electronic-Waste moved this from Todo to In Progress in KEP-2170: Kubeflow Training V2 API Nov 18, 2024

andreyvelich mentioned this issue Nov 28, 2024

KEP-2170: Add Torch Distributed Runtime #2328

Merged

andreyvelich changed the title ~~KEP-2170: Add E2E tests for TrainJob~~ KEP-2170: Add E2E tests for Kubeflow Training V2 Nov 28, 2024

andreyvelich added the release/2.0 label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

andreyvelich commented Aug 14, 2024 •

edited

Loading

github-actions bot commented Nov 12, 2024

andreyvelich commented Nov 15, 2024

google-oss-prow bot commented Nov 15, 2024

Electronic-Waste commented Nov 18, 2024

tenzen-y commented Nov 28, 2024

andreyvelich commented Nov 28, 2024

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

Comments

andreyvelich commented Aug 14, 2024 • edited Loading

github-actions bot commented Nov 12, 2024

andreyvelich commented Nov 15, 2024

google-oss-prow bot commented Nov 15, 2024

Electronic-Waste commented Nov 18, 2024

tenzen-y commented Nov 28, 2024

andreyvelich commented Nov 28, 2024

andreyvelich commented Aug 14, 2024 •

edited

Loading