Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefine Job, Runner, and System Responsibilities; Refactor Classes #176

Merged
merged 3 commits into from
Aug 29, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Aug 19, 2024

Summary

  • Moved job lifecycle methods (is_running, is_completed, kill) from runner to job classes.
  • Enhanced BaseJob with new methods for mode, system, and ID retrieval.
  • Fixed a bug in parsing.

Test Plan

1. CI
CI passes. New unit tests were added, but they do not cover all of the changes I made. I did not spend more time adding additional unit tests for the following reasons:

  • The changes made in the Job classes are straightforward, and adding unit tests would add complexity without significantly improving coverage.
  • The changes made in the System classes cannot be fully tested with unit tests.

2. Ran on a server
Sleep

$ python cloudaix.py --mode run --system-config conf/v0.6/general/system/system.toml --test-templates-dir conf/v0.6/general/test_template --tests-dir conf/v0.6/general/test --test-scenario conf/v0.6/general/test_scenario/sleep.toml 
[INFO] System configuration file: conf/v0.6/general/system/system.toml
[INFO] Test templates directory: conf/v0.6/general/test_template
[INFO] Tests directory: conf/v0.6/general/test
[INFO] Test scenario file: conf/v0.6/general/test_scenario/sleep.toml
[INFO] Output directory: None
[INFO] System Name: System
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: test_scenario_example
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: test_scenario_example

Section Name: Tests.1
  Test Name: sleep_10
  Description: sleep_10
  No dependencies
Section Name: Tests.2
  Test Name: sleep_5
  Description: sleep_5
  Start Post Init: Tests.1, Time: 5 seconds
Section Name: Tests.3
  Test Name: sleep_5
  Description: sleep_5
  Start Post Comp: Tests.1, Time: 0 seconds
Section Name: Tests.4
  Test Name: sleep_20
  Description: sleep_20
  End Post Comp: Tests.1, Time: 5 seconds
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Starting test: Tests.4
[INFO] Running test: Tests.4
[INFO] Executing command for test Tests.4: sbatch /obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.4/0/cloudai_sbatch_script.sh
[INFO] Delayed start for test Tests.2 by 5 seconds.
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Executing command for test Tests.2: sbatch /obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.2/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] Delayed start for test Tests.3 by 0 seconds.
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Executing command for test Tests.3: sbatch /obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.3/0/cloudai_sbatch_script.sh
[INFO] Scheduling termination of job 286816 after 5 seconds.
[INFO] Job completed: Tests.2
[INFO] Job completed: Tests.4
[INFO] Job completed: Tests.3
[INFO] All test scenario results stored at: /obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25
[WARNING] Skipping directory '/obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.1/0' for test 'sleep_10'
[WARNING] Skipping directory '/obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.2/0' for test 'sleep_5'
[WARNING] Skipping directory '/obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.3/0' for test 'sleep_5'
[WARNING] Skipping directory '/obfuscated/path/results/test_scenario_example_2024-08-23_22-46-25/Tests.4/0' for test 'sleep_20'

NCCL test

$ python cloudaix.py --mode run --system-config conf/v0.6/general/system/system.toml --test-templates-dir conf/v0.6/general/test_template --tests-dir conf/v0.6/general/test --test-scenario conf/v0.6/general/test_scenario/nccl_test.toml
[INFO] System configuration file: conf/v0.6/general/system/system.toml
[INFO] Test templates directory: conf/v0.6/general/test_template
[INFO] Tests directory: conf/v0.6/general/test
[INFO] Test scenario file: conf/v0.6/general/test_scenario/nccl_test.toml
[INFO] Output directory: None
[INFO] System Name: System
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /obfuscated/path/results/nccl-test_2024-08-23_22-50-48/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: /obfuscated/path/results/nccl-test_2024-08-23_22-50-48

@TaekyungHeo TaekyungHeo changed the title Refactor job Refactor Job and Runner Classes Aug 19, 2024
@TaekyungHeo TaekyungHeo changed the title Refactor Job and Runner Classes Refactor Job and Runner Classes, Fix Bug in Test Scenario Parsing Aug 20, 2024
@TaekyungHeo TaekyungHeo added the Oct24 Oct'24 release feature label Aug 20, 2024
@TaekyungHeo TaekyungHeo force-pushed the refactor-job branch 2 times, most recently from d127bf8 to 7f18c90 Compare August 20, 2024 19:18
@TaekyungHeo TaekyungHeo changed the title Refactor Job and Runner Classes, Fix Bug in Test Scenario Parsing Redefine Job, Runner, and System Responsibilities; Refactor Classes Aug 20, 2024
@TaekyungHeo TaekyungHeo marked this pull request as ready for review August 23, 2024 19:58
src/cloudai/_core/base_job.py Outdated Show resolved Hide resolved
src/cloudai/_core/base_job.py Show resolved Hide resolved
src/cloudai/_core/base_runner.py Show resolved Hide resolved
src/cloudai/_core/system.py Show resolved Hide resolved
src/cloudai/_core/base_job.py Show resolved Hide resolved
@TaekyungHeo
Copy link
Member Author

Rebased to the main branch.

Copy link
Collaborator

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major issues with merging this. As discussed in the call, I understand this is blocking the K8s PR as well which needs rebasing. LGTM.

@TaekyungHeo TaekyungHeo merged commit e0b3418 into NVIDIA:main Aug 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants