Use Pydantic to verify System schemas #158

amaslenn · 2024-07-22T17:41:17Z

Summary

Use Pydantic to verify System schemas.

Removed System Parsers, added parse_system() into Parser class.
Refactored SlurmSystem: introduced SlurmPartition and SlurmGroup classes to manage node lists.
All Systems are supported now, including K8s.

Added a new mode for verifying system schemas: verify-systems:

cloudai --mode verify-systems --tests-dir s --test-templates-dir d --system-config conf/common/system

Test Plan

CI

Additional Notes

—

A message for release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.

Added new command for verifying the configs: cloudai --mode verify-systems. --system-config can be a file or a directory to verify all configs in the directory.

Slurm system config format was updated to take advantage of TOML features:

[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]

[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]

is now

[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]

[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]

The same is for groups inside partitions.

System parser objects were removed, this functionality is now handled by Pydantic.

README.md

conf/common/system/example_slurm_cluster.toml

src/cloudai/__main__.py

srivatsankrishnan

LGTM overall. The test plan only relies on CI/Unit tests. Is it also possible to test this on a simple slurm system in addition to CI? This is a big change in my opinion.

Without making modification to the system schemas in private repo, we cannot ideally use this feature. Should wait before this PR is also ready: https://github.com/Mellanox/cloudaix/pull/37?

amaslenn · 2024-09-10T14:34:45Z

Is it also possible to test this on a simple slurm system in addition to CI? This is a big change in my opinion.

What kind of tests would you like to add? Currently, we load our slurm example system config only.

Should wait before this PR is also ready: Mellanox/cloudaix#37?

This PR requires changes in Cloud AI, it cannot be ready until we make a release.

Without making modification to the system schemas in private repo, we cannot ideally use this feature.

The PR above does this update for all System TOML where it is required. It is related to the first comment, but what tests would you add here to be sure it is safe?

srivatsankrishnan · 2024-09-10T17:24:57Z

What kind of tests would you like to add? Currently, we load our slurm example system config only.

We can try a simple verification (gpt or grok or even nccl single node test) that Daria is running in one additional internal cluster? Just an additional test other than CI. I assume we can modify the system schema she is using to use the Pydantic. I think your draft PR here can still be used to test this PR?

This PR requires changes in Cloud AI, it cannot be ready until we make a release.

Right, we usually install the a local copy of cloudAI to test features we add. We don't have to wait for a release right? Also, I assume we are also planning to add new features and want to ensure it doesn't break any features we have in the pipeline. Most of the changes in this PR looks transparent and I assume it won't break anything. If you have one example validation on simple verification other than CI that will helpful.

amaslenn · 2024-09-10T17:38:55Z

We can try a simple verification (gpt or grok or even nccl single node test) that Daria is running in one additional internal cluster? Just an additional test other than CI. ...

This all comes to the single question: do we trust our tests or not.
I tend to trust our set of tests, it is pretty comprehensive.

My question to you @srivatsankrishnan is what do you think is missing in our CI so we don't rely on real HW verification as it is slow and costly. Our unit tests could provide much better coverage, we just need to add more tests where needed.

What additional coverage will real HW run provide? And is there a reason not to cover it with unit tests?

srivatsankrishnan · 2024-09-10T18:44:11Z

You are making a big change to the system schema and how it's getting processed within cloudAI and it's not about trusting the CI test or not. This PR needs to be aligned with your other draft PR and it would be good if we can test this those changes. You already have changes that one need to make in system schema. I think even a simple sleep test on a real slurm system would be helpful. If we have to rebase of this version, I want to have a validated system schema to begin with.

amaslenn · 2024-09-11T07:04:15Z

I think even a simple sleep test on a real slurm system would be helpful.

I still don't get how this is better compared to all our CI tests. But anyway, I run one test:

cloudai --mode run --system-config conf/common/system/jazz.toml --test-templates-dir conf/common/test_template --tests-dir conf/common/test --test-scenario conf/common/test_scenario/sleep.toml
[INFO] System configuration file: conf/common/system/jazz.toml
[INFO] Test templates directory: conf/common/test_template
[INFO] Tests directory: conf/common/test
[INFO] Test scenario file: conf/common/test_scenario/sleep.toml
[INFO] Output directory: None
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[INFO] System Name: hpchead
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: test_scenario_example
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: test_scenario_example

Section Name: Tests.1
  Test Name: sleep_10
  Description: sleep_10
  No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: results/test_scenario_example_2024-09-11_09-57-39
[WARNING] Skipping directory 'results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0' for test 'sleep_10'
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

(for the rest of your message, please see my comment in Cloud AI X PR)

srivatsankrishnan

Thanks. LGTM

amaslenn added 16 commits July 22, 2024 14:41

Move Parser to a higher level

d7309e9

Dirty implementation of Pydantic models for Systems

c52baf7

Remove System Parsers

34b331e

Merge branch 'main' into am/pydantic-system

b99043c

Simplify code a bit

fa49ad9

Enable 'groups' parsing

264623a

Handle group name the same way as partition name

fb7cab9

Make ruff happy

775bda8

Add missing module

0b8a376

Fixes

e13c0b5

Add pydantic to requirements

9182274

Fix tests

903b970

Fix tests

cbdaf99

Update test, still fails

b073aea

Make it work

3762b89

Add missing file

6f90327

amaslenn changed the title ~~Am/pydantic system~~ Use Pydantic to verify System schemas Jul 23, 2024

srinivas212 added the Oct24 Oct'24 release feature label Jul 28, 2024

TaekyungHeo added the feature label Aug 29, 2024

amaslenn added 6 commits August 30, 2024 10:23

Merge branch 'main' into am/pydantic-system

5157524

Test all systems

5929468

Add mode for verifying system TOMLs

fcf4e37

Fixes

8186189

Make ruff happy

3fd6206

Extend testing

d7bf3dd

amaslenn marked this pull request as ready for review August 30, 2024 11:39

amaslenn requested review from TaekyungHeo, srivatsankrishnan and srinivas212 August 30, 2024 11:40

TaekyungHeo requested a review from artemry-nv September 3, 2024 16:42

amaslenn added 6 commits September 4, 2024 16:40

Merge branch 'main' into am/pydantic-system

2177598

Fixes

cc039c7

Update README

e3ab021

Merge branch 'main' into am/pydantic-system

50608bb

Merge branch 'main' into am/pydantic-system

8cc5381

Merge branch 'main' into am/pydantic-system

faf0397

TaekyungHeo reviewed Sep 10, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

conf/common/system/example_slurm_cluster.toml Show resolved Hide resolved

src/cloudai/__main__.py Show resolved Hide resolved

Address review comments

73069fc

srivatsankrishnan reviewed Sep 10, 2024

View reviewed changes

TaekyungHeo approved these changes Sep 10, 2024

View reviewed changes

srivatsankrishnan approved these changes Sep 11, 2024

View reviewed changes

amaslenn merged commit 11c5592 into main Sep 16, 2024
2 checks passed

amaslenn deleted the am/pydantic-system branch September 16, 2024 08:15

amaslenn mentioned this pull request Sep 19, 2024

Introduce Pydantic to verify Test schema #145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Pydantic to verify System schemas #158

Use Pydantic to verify System schemas #158

amaslenn commented Jul 22, 2024 •

edited

Loading

srivatsankrishnan left a comment

amaslenn commented Sep 10, 2024

srivatsankrishnan commented Sep 10, 2024

amaslenn commented Sep 10, 2024

srivatsankrishnan commented Sep 10, 2024

amaslenn commented Sep 11, 2024

srivatsankrishnan left a comment

Use Pydantic to verify System schemas #158

Use Pydantic to verify System schemas #158

Conversation

amaslenn commented Jul 22, 2024 • edited Loading

Summary

Test Plan

Additional Notes

A message for release notes

srivatsankrishnan left a comment

Choose a reason for hiding this comment

amaslenn commented Sep 10, 2024

srivatsankrishnan commented Sep 10, 2024

amaslenn commented Sep 10, 2024

srivatsankrishnan commented Sep 10, 2024

amaslenn commented Sep 11, 2024

srivatsankrishnan left a comment

Choose a reason for hiding this comment

amaslenn commented Jul 22, 2024 •

edited

Loading