Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies (No Shared Dir / Mixin / Sep 3rd) #181

Closed
wants to merge 15 commits into from

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Aug 21, 2024

Summary

This PR relies on #180

  • Refactoring: Refactored code to separate Kubernetes and Slurm strategies, including output readers and report generation.
  • Kubernetes Support: Added Kubernetes-specific strategies for NCCL tests, including:
    • KubernetesNcclTestGradingStrategy
    • KubernetesNcclTestJobStatusRetrievalStrategy
    • NcclTestKubernetesJsonGenStrategy
    • KubernetesNcclTestReportGenerationStrategy
  • NCCL Test Templates: Removed the "_mpi" suffix from subtest_name in all NCCL test templates.
  • Test Coverage: Added unit tests for new Kubernetes strategies and updated existing tests to accommodate changes.

Test Plan

CI passes.

@TaekyungHeo TaekyungHeo added the Oct24 Oct'24 release feature label Aug 21, 2024
@TaekyungHeo TaekyungHeo changed the title K8s nccl Update NCCL test templates and introduce Kubernetes-specific strategies Aug 21, 2024
@TaekyungHeo TaekyungHeo changed the title Update NCCL test templates and introduce Kubernetes-specific strategies Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies Aug 21, 2024
@TaekyungHeo TaekyungHeo force-pushed the k8s-nccl branch 2 times, most recently from c05f803 to 9e8eae3 Compare August 22, 2024 11:08
TaekyungHeo and others added 7 commits August 29, 2024 16:03
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
Co-authored-by: Peng Wang <pengwang@nvidia.com>
@TaekyungHeo TaekyungHeo changed the title Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies (Sep 3rd) Sep 3, 2024
@TaekyungHeo TaekyungHeo changed the title Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies (Sep 3rd) Update NCCL Test Templates and Introduce Kubernetes-Specific Strategies (No Shared Dir / Mixin / Sep 3rd) Sep 3, 2024
@TaekyungHeo TaekyungHeo closed this Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant