Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NcclTestJobStatusRetrievalStrategy and add corresponding tests #53

Merged
merged 5 commits into from
May 31, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented May 30, 2024

Summary

Implement NcclTestJobStatusRetrievalStrategy and add corresponding tests

Test Plan

  1. Added the following tests.
  • TestNcclTestJobStatusRetrievalStrategy
    • test_no_stdout_file
    • test_successful_job
    • test_failed_job
  1. Tested on a real system manually.
Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_gather
  Description: all_gather
  No dependencies

Section Name: Tests.2
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  Start Post Comp: Tests.1, Time: 0 seconds

Section Name: Tests.3
  Test Name: nccl_test_reduce_scatter
  Description: reduce_scatter
  Start Post Comp: Tests.2, Time: 0 seconds

Section Name: Tests.4
  Test Name: nccl_test_alltoall
  Description: alltoall
  Start Post Comp: Tests.3, Time: 0 seconds
[Kill Tests.1]
Job 260660 for test Tests.1 failed: Missing success indicators in stdout.txt: '# Out of bounds values', '# Avg bus bandwidth'. These keywords are expected to be present in stdout.txt, usually towards the end of the file. Please ensure the NCCL test ran to completion. You can run the generated sbatch script manually and check if stdout.txt is created and contains the expected keywords.

@TaekyungHeo TaekyungHeo force-pushed the failure-nccl-test branch 4 times, most recently from eb178f9 to 7ef6735 Compare May 30, 2024 19:53
@TaekyungHeo TaekyungHeo marked this pull request as ready for review May 30, 2024 20:31
@srinivas212 srinivas212 requested a review from amaslenn May 31, 2024 01:39
@TaekyungHeo TaekyungHeo force-pushed the failure-nccl-test branch 2 times, most recently from 1515c4b to a687059 Compare May 31, 2024 01:50
Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, everything is OK and compliant with our design.

But IMO we overuse Strategy pattern. For example, in this case we can put get_job_status(path) into a particular test implementation itself (src/cloudai/schema/test_template/nccl_test/template.py):

  1. By default, TestTemplate has no-op returning OK.
  2. Each test overrides it if needed. Because only test knows how to define it.
  3. We won't need registrations and assertion if such implementation exists. Second part might be especially difficult because one might notice it only while running tests. So we might even consider making it an abstract method to highlight its importance.

I'm not saying we change it right now, but I want to discuss it and better understand why you made it like this.

src/cloudai/_core/base_runner.py Outdated Show resolved Hide resolved
src/cloudai/_core/base_runner.py Show resolved Hide resolved
@TaekyungHeo
Copy link
Member Author

TaekyungHeo commented May 31, 2024

Thanks, @amaslenn. This PR actually depends on #46. Therefore, I assume these comments are for that PR.

I agree that we may overuse the strategy pattern. However, I would like to keep the current design for future use cases. Currently, it is used in a slurm system where the path is given directly. However, this may change, and we may need different implementations for each scheduler. For example, in Meta, we cannot access the file system directly because it has an internal remote file system called manifolds and buckets. Therefore, simply providing an output path may not work. Moreover, the job status may need to be retrieved with a special library or interface as they have an internal scheduler and interfaces. For these reasons, I am still actively using the strategy pattern.

@srinivas212 srinivas212 merged commit c10f541 into NVIDIA:main May 31, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants