Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kubernetes Support #180

Merged
merged 10 commits into from
Sep 3, 2024
Merged

Add Kubernetes Support #180

merged 10 commits into from
Sep 3, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Aug 21, 2024

Summary

  • Added Kubernetes support with KubernetesSystem, KubernetesJob, and KubernetesRunner classes.
  • Implemented JsonGenStrategy for generating Kubernetes job specs, including SleepKubernetesJsonGenStrategy.
  • Introduced KubernetesInstaller for managing Kubernetes installations.
  • Updated Registry and Tests to include Kubernetes-specific components.

This PR builds upon @wpeng102's work in PR #165.

Test Plan

  • Install
$ cloudai --mode install --system-config conf/system/kubernetes_cluster.toml --test-templates-dir conf/test_template --tests-dir conf/test/  --test-scenario conf/test_scenario/nccl_test.toml

[INFO] System configuration file: conf/system/kubernetes_cluster.toml
[INFO] Test templates directory: conf/test_template
[INFO] Tests directory: conf/test
[INFO] Test scenario file: conf/test_scenario/nccl_test.toml
[INFO] Output directory: None
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[INFO] System Name: kubernetes-cluster
[INFO] Scheduler: kubernetes
[INFO] CloudAI is already installed.
  • Sleep test scenario
# cloudai --mode run --system-config conf/system/kubernetes_cluster.toml --test-templates-dir conf/test_template --tests-dir conf/test/  --test-scenario conf/test_scenario/sleep.toml                            


[INFO] System configuration file: conf/system/kubernetes_cluster.toml
[INFO] Test templates directory: conf/test_template
[INFO] Tests directory: conf/test
[INFO] Test scenario file: conf/test_scenario/sleep.toml
[INFO] Output directory: None
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobStatusRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No InstallStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No CommandGenStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No JobIdRetrievalStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No ReportGenerationStrategy found for TestTemplateParser and KubernetesSystem
[WARNING] No GradingStrategy found for TestTemplateParser and KubernetesSystem
[INFO] System Name: kubernetes-cluster
[INFO] Scheduler: kubernetes
[INFO] Test Scenario Name: test_scenario_example
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: test_scenario_example

Section Name: Tests.1
  Test Name: sleep_10
  Description: sleep_10
  No dependencies
Section Name: Tests.2
  Test Name: sleep_5
  Description: sleep_5
  Start Post Init: Tests.1, Time: 5 seconds
Section Name: Tests.3
  Test Name: sleep_5
  Description: sleep_5
  Start Post Comp: Tests.1, Time: 0 seconds
Section Name: Tests.4
  Test Name: sleep_20
  Description: sleep_20
  End Post Comp: Tests.1, Time: 5 seconds
[INFO] Initializing Runner
[INFO] Creating KubernetesRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Generated JSON string for test Tests.1: {'apiVersion': 'batch/v1', 'kind': 'Job', 'metadata': {'name': 'tests-1', 'namespace': 'default'}, 'spec': {'ttlSecondsAfterFinished': 0, 'template': {'spec': {'containers': [{'args': ['sleep 10'], 'command': ['/bin/bash', '-c'], 'image': 'ubuntu:22.04', 'name': 'task'}], 'restartPolicy': 'Never'}}}}
[INFO] Starting test: Tests.4
[INFO] Running test: Tests.4
[INFO] Generated JSON string for test Tests.4: {'apiVersion': 'batch/v1', 'kind': 'Job', 'metadata': {'name': 'tests-4', 'namespace': 'default'}, 'spec': {'ttlSecondsAfterFinished': 0, 'template': {'spec': {'containers': [{'args': ['sleep 20'], 'command': ['/bin/bash', '-c'], 'image': 'ubuntu:22.04', 'name': 'task'}], 'restartPolicy': 'Never'}}}}
[INFO] Delayed start for test Tests.2 by 5 seconds.
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Generated JSON string for test Tests.2: {'apiVersion': 'batch/v1', 'kind': 'Job', 'metadata': {'name': 'tests-2', 'namespace': 'default'}, 'spec': {'ttlSecondsAfterFinished': 0, 'template': {'spec': {'containers': [{'args': ['sleep 5'], 'command': ['/bin/bash', '-c'], 'image': 'ubuntu:22.04', 'name': 'task'}], 'restartPolicy': 'Never'}}}}
[INFO] Job completed: Tests.1
[INFO] Delayed start for test Tests.3 by 0 seconds.
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Generated JSON string for test Tests.3: {'apiVersion': 'batch/v1', 'kind': 'Job', 'metadata': {'name': 'tests-3', 'namespace': 'default'}, 'spec': {'ttlSecondsAfterFinished': 0, 'template': {'spec': {'containers': [{'args': ['sleep 5'], 'command': ['/bin/bash', '-c'], 'image': 'ubuntu:22.04', 'name': 'task'}], 'restartPolicy': 'Never'}}}}
[INFO] Scheduling termination of job tests-4 after 5 seconds.
[INFO] Job completed: Tests.2
[INFO] Job completed: Tests.4
[INFO] Job completed: Tests.3
[INFO] All test scenario results stored at: /root/cloudai/results/test_scenario_example_2024-08-09_14-17-22
[WARNING] Skipping directory '/root/cloudai/results/test_scenario_example_2024-08-09_14-17-22/Tests.1/0' for test 'sleep_10'
[WARNING] Skipping directory '/root/cloudai/results/test_scenario_example_2024-08-09_14-17-22/Tests.2/0' for test 'sleep_5'
[WARNING] Skipping directory '/root/cloudai/results/test_scenario_example_2024-08-09_14-17-22/Tests.3/0' for test 'sleep_5'
[WARNING] Skipping directory '/root/cloudai/results/test_scenario_example_2024-08-09_14-17-22/Tests.4/0' for test 'sleep_20'
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

TaekyungHeo and others added 7 commits August 29, 2024 16:03
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
@TaekyungHeo
Copy link
Member Author

@TaekyungHeo & @amaslenn 's comment: Let's unblock first. Let's merge without updating this PR.

@TaekyungHeo TaekyungHeo marked this pull request as ready for review September 3, 2024 12:02
@TaekyungHeo
Copy link
Member Author

TaekyungHeo commented Sep 3, 2024

Resolving all comments to make forward progress based on the discussion with @amaslenn .

The following comments should be resolved in follow-up PRs

Copy link
Collaborator

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. I see them as non-blockers for approval.

conf/common/system/kubernetes_cluster.toml Show resolved Hide resolved
src/cloudai/_core/test.py Show resolved Hide resolved
@srinivas212 srinivas212 merged commit 1e6d618 into NVIDIA:main Sep 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants