-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Pydantic to verify System schemas #158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. The test plan only relies on CI/Unit tests. Is it also possible to test this on a simple slurm system in addition to CI? This is a big change in my opinion.
Without making modification to the system schemas in private repo, we cannot ideally use this feature. Should wait before this PR is also ready: https://github.com/Mellanox/cloudaix/pull/37?
What kind of tests would you like to add? Currently, we load our slurm example system config only.
This PR requires changes in Cloud AI, it cannot be ready until we make a release.
The PR above does this update for all System TOML where it is required. It is related to the first comment, but what tests would you add here to be sure it is safe? |
We can try a simple verification (gpt or grok or even nccl single node test) that Daria is running in one additional internal cluster? Just an additional test other than CI. I assume we can modify the system schema she is using to use the Pydantic. I think your draft PR here can still be used to test this PR?
Right, we usually install the a local copy of cloudAI to test features we add. We don't have to wait for a release right? Also, I assume we are also planning to add new features and want to ensure it doesn't break any features we have in the pipeline. Most of the changes in this PR looks transparent and I assume it won't break anything. If you have one example validation on simple verification other than CI that will helpful. |
This all comes to the single question: do we trust our tests or not. My question to you @srivatsankrishnan is what do you think is missing in our CI so we don't rely on real HW verification as it is slow and costly. Our unit tests could provide much better coverage, we just need to add more tests where needed. What additional coverage will real HW run provide? And is there a reason not to cover it with unit tests? |
You are making a big change to the system schema and how it's getting processed within cloudAI and it's not about trusting the CI test or not. This PR needs to be aligned with your other draft PR and it would be good if we can test this those changes. You already have changes that one need to make in system schema. I think even a simple sleep test on a real slurm system would be helpful. If we have to rebase of this version, I want to have a validated system schema to begin with. |
I still don't get how this is better compared to all our CI tests. But anyway, I run one test: cloudai --mode run --system-config conf/common/system/jazz.toml --test-templates-dir conf/common/test_template --tests-dir conf/common/test --test-scenario conf/common/test_scenario/sleep.toml
[INFO] System configuration file: conf/common/system/jazz.toml
[INFO] Test templates directory: conf/common/test_template
[INFO] Tests directory: conf/common/test
[INFO] Test scenario file: conf/common/test_scenario/sleep.toml
[INFO] Output directory: None
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[WARNING] No JsonGenStrategy found for TestTemplateParser and SlurmSystem
[INFO] System Name: hpchead
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: test_scenario_example
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: test_scenario_example
Section Name: Tests.1
Test Name: sleep_10
Description: sleep_10
No dependencies
[INFO] Initializing Runner
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: results/test_scenario_example_2024-09-11_09-57-39
[WARNING] Skipping directory 'results/test_scenario_example_2024-09-11_09-57-39/Tests.1/0' for test 'sleep_10'
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues. (for the rest of your message, please see my comment in Cloud AI X PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM
Summary
Use Pydantic to verify System schemas.
parse_system()
intoParser
class.SlurmSystem
: introducedSlurmPartition
andSlurmGroup
classes to manage node lists.verify-systems
:Test Plan
CI
Additional Notes
—
A message for release notes
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.
cloudai --mode verify-systems
.--system-config
can be a file or a directory to verify all configs in the directory.