Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

Open
3 tasks
Tracked by #2170
andreyvelich opened this issue Aug 14, 2024 · 6 comments
Open
3 tasks
Tracked by #2170

KEP-2170: Add E2E tests for Kubeflow Training V2 #2213

andreyvelich opened this issue Aug 14, 2024 · 6 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Aug 14, 2024

Related: #2170

We should add the e2e tests the Kubeflow Training V2.

I was thinking about these tests that we can add:

  • /test/e2e/initializer_v2 - Tests for our dataset and model initializer components.
  • /test/e2e/runtimes - Tests for the existing runtimes without any modification of TrainJob. E.g. the tests might look as follows:
job_id = TrainingClient().train(runtime_ref="torch-distributed")
TrainingClient().get_job_logs(job_id, follow=True)
  • /test/e2e/notebooks - Tests our V2 examples with Papermil. I think the majority of our examples can be represented with Jupyter Notebooks, so Data Scientists can quickly run them.

/area testing

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

/remove-lifecycle stale
/good-first-issue

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member

/assign

I can help with this. Please let me know if you have different plans @kubeflow/wg-training-leads .

@Electronic-Waste Electronic-Waste moved this from Todo to In Progress in KEP-2170: Kubeflow Training V2 API Nov 18, 2024
@andreyvelich andreyvelich changed the title KEP-2170: Add E2E tests for TrainJob KEP-2170: Add E2E tests for Kubeflow Training V2 Nov 28, 2024
@tenzen-y
Copy link
Member

RE: #2328 (comment)

What do you mean by control-plane-specific E2E in the /test/e2e ?

This indicates the E2E testings with Kind cluster which does not use Python SDK.

I think that v1 API E2Es have multiple significant problems, which is difficult to understand the failure errors place SDK vs controllers (controller, validations, and mutatings...).

So, we should implement 2 types of E2Es, which indicates testings without SDK and with SDK.
But, those do not have same test cases. The controller testings focus only on controller testings and SDK testings focus only on SDK specific cases.

@andreyvelich
Copy link
Member Author

This indicates the E2E testings with Kind cluster which does not use Python SDK
But, those do not have same test cases. The controller testings focus only on controller testings and SDK testings focus only on SDK specific cases.

We will still use Kind with Python SDK. We just use SDK to trigger the TrainJobs with appropriate configuration that we want to test, so we don't require to store YAMLs of TrainJobs in our repository.

I think that v1 API E2Es have multiple significant problems, which is difficult to understand the failure errors place SDK v

Can you explain more details, where using the Python SDK might be not sufficient to verify that controller + webhooks are functionally correct ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants