Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pydantic to verify System schemas #158

Merged
merged 29 commits into from
Sep 16, 2024
Merged

Use Pydantic to verify System schemas #158

merged 29 commits into from
Sep 16, 2024

Conversation

amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Jul 22, 2024

Summary

Use Pydantic to verify System schemas.

  1. Removed System Parsers, added parse_system() into Parser class.
  2. Refactored SlurmSystem: introduced SlurmPartition and SlurmGroup classes to manage node lists.
  3. All Systems are supported now, including K8s.
  4. Added a new mode for verifying system schemas: verify-systems:
    cloudai --mode verify-systems --tests-dir s --test-templates-dir d --system-config conf/common/system

Test Plan

CI

Additional Notes

A message for release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.

  1. Added new command for verifying the configs: cloudai --mode verify-systems. --system-config can be a file or a directory to verify all configs in the directory.
  2. Slurm system config format was updated to take advantage of TOML features:
    [partitions]
    [partitions.partition_1]
    name = "partition_1"
    nodes = ["node-[001-100]"]
    
    [partitions.partition_2]
    name = "partition_2"
    nodes = ["node-[101-200]"]
    is now
    [[partitions]]
    name = "partition_1"
    nodes = ["node-[001-100]"]
    
    [[partitions]]
    name = "partition_2"
    nodes = ["node-[101-200]"]
    The same is for groups inside partitions.
  3. System parser objects were removed, this functionality is now handled by Pydantic.

@amaslenn amaslenn changed the title Am/pydantic system Use Pydantic to verify System schemas Jul 23, 2024
@srinivas212 srinivas212 added the Oct24 Oct'24 release feature label Jul 28, 2024
@amaslenn amaslenn marked this pull request as ready for review August 30, 2024 11:39
README.md Outdated Show resolved Hide resolved
conf/common/system/example_slurm_cluster.toml Show resolved Hide resolved
src/cloudai/__main__.py Show resolved Hide resolved
Copy link
Collaborator

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. The test plan only relies on CI/Unit tests. Is it also possible to test this on a simple slurm system in addition to CI? This is a big change in my opinion.

Without making modification to the system schemas in private repo, we cannot ideally use this feature. Should wait before this PR is also ready: https://github.com/Mellanox/cloudaix/pull/37?

@amaslenn
Copy link
Contributor Author

Is it also possible to test this on a simple slurm system in addition to CI? This is a big change in my opinion.

What kind of tests would you like to add? Currently, we load our slurm example system config only.

Should wait before this PR is also ready: Mellanox/cloudaix#37?

This PR requires changes in Cloud AI, it cannot be ready until we make a release.

Without making modification to the system schemas in private repo, we cannot ideally use this feature.

The PR above does this update for all System TOML where it is required. It is related to the first comment, but what tests would you add here to be sure it is safe?

@srivatsankrishnan
Copy link
Collaborator

What kind of tests would you like to add? Currently, we load our slurm example system config only.

We can try a simple verification (gpt or grok or even nccl single node test) that Daria is running in one additional internal cluster? Just an additional test other than CI. I assume we can modify the system schema she is using to use the Pydantic. I think your draft PR here can still be used to test this PR?

This PR requires changes in Cloud AI, it cannot be ready until we make a release.

Right, we usually install the a local copy of cloudAI to test features we add. We don't have to wait for a release right? Also, I assume we are also planning to add new features and want to ensure it doesn't break any features we have in the pipeline. Most of the changes in this PR looks transparent and I assume it won't break anything. If you have one example validation on simple verification other than CI that will helpful.

@amaslenn
Copy link
Contributor Author

We can try a simple verification (gpt or grok or even nccl single node test) that Daria is running in one additional internal cluster? Just an additional test other than CI. ...

This all comes to the single question: do we trust our tests or not.
I tend to trust our set of tests, it is pretty comprehensive.

My question to you @srivatsankrishnan is what do you think is missing in our CI so we don't rely on real HW verification as it is slow and costly. Our unit tests could provide much better coverage, we just need to add more tests where needed.

What additional coverage will real HW run provide? And is there a reason not to cover it with unit tests?

@srivatsankrishnan
Copy link
Collaborator

You are making a big change to the system schema and how it's getting processed within cloudAI and it's not about trusting the CI test or not. This PR needs to be aligned with your other draft PR and it would be good if we can test this those changes. You already have changes that one need to make in system schema. I think even a simple sleep test on a real slurm system would be helpful. If we have to rebase of this version, I want to have a validated system schema to begin with.

@amaslenn
Copy link
Contributor Author

I think even a simple sleep test on a real slurm system would be helpful.

I still don't get how this is better compared to all our CI tests. But anyway, I run one test:

cloudai --mode run --system-config conf/common/system/jazz.toml --test-templates-dir conf/common/test_template --tests-dir conf/common/test --test-scenario conf/common/test_scenario/sleep.toml
[INFO] System configuration file: conf/common/system/jazz.toml
[INFO] Test templates directory: conf/common/test_template
[INFO] Tests directory: conf/common/test
[INFO] Test scenario file: conf/common/test_scenario/sleep.toml
[INFO] Output directory: None
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[INFO] System Name: hpchead
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: test_scenario_example
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: test_scenario_example

Section Name: Tests.1
  Test Name: sleep_10
  Description: sleep_10
  No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: results/test_scenario_example_2024-09-11_09-57-39
[WARNING] Skipping directory 'results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0' for test 'sleep_10'
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

(for the rest of your message, please see my comment in Cloud AI X PR)

Copy link
Collaborator

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

@amaslenn amaslenn merged commit 11c5592 into main Sep 16, 2024
2 checks passed
@amaslenn amaslenn deleted the am/pydantic-system branch September 16, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants