Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase CI coverage for Gaudi2 #6728

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 47 additions & 29 deletions .github/workflows/hpu-gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,51 +49,69 @@ jobs:
TORCHINDUCTOR_COMPILE_THREADS: 1
TEST_LIST: |
test_accelerator.py
test_activation_checkpointing.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raza-sikander - it looks like the tests pass now but take 2.5 hours to run. Should we switch some of the tests to a cron job if they take that long? Since triggering on as many PRs as this yml file does means we may see queueing there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @loadams, We can split it as you mentioned.
Can you share the process for creation of cron job? So i can split the content two.
And also it would mean we would have two jobs one for CI and other like nightly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @raza-sikander - yes, we would need to split into 2 yml files, one for PRs and one for cron jobs, since they would run a different set of tests.

Here is a sample of the nv-nightly test that only runs on PRs that modify that file, and otherwise only runs once per day, set via the cron option. https://github.com/microsoft/DeepSpeed/blob/877aa0dba673c2aa2157029c28363b804d6ee03d/.github/workflows/nv-nightly.yml#L3C1-L9C24

Otherwise we would just remove the schedule section from the current yml file and it would continue to run on PRs, though with a smaller set of tests to ensure it ran more quickly, ideally the ~1 hour it takes now?

test_activation_checkpointing_non_reentrant.py
test_adamw.py
test_autocast.py
test_autotuning.py
test_bf16.py
test_coalesced_collectives.py
test_compression.py
test_csr.py
test_data.py
test_dist.py
test_elastic.py
test_ds_arguments.py
test_run.py
test_multinode_runner.py
test_ds_config_dict.py
test_ds_config_model.py
test_ds_initialize.py
test_dynamic_loss_scale.py
test_elastic.py
test_flops_profiler.py
test_fp16.py
test_get_optim_files.py
test_groups.py
test_hybrid_adam.py
test_ignore_unused_parameters.py
test_init_on_device.py
test_intX_quantization.py
test_latest_checkpoint.py
test_moe_checkpoint.py
test_moe_tp.py
test_monitor.py
(test_zero_optimizer.py and (TestSaveTensorClone or TestZeRONonDistributed))
(test_latest_checkpoint.py and test_missing_latest)
test_multi_output_model.py
test_multinode_runner.py
test_mup_optimizers.py
test_other_optimizer.py
test_partition.py
test_partition_balanced.py
test_pipe.py
test_pipe_module.py
test_pipe_schedule.py
test_pipeline.py
test_pld.py
test_reshape_checkpoint.py
test_run.py
test_runtime_utils.py
test_shared_weights.py
test_sparse.py
test_tag_validation.py
test_pipe_module.py
(test_flops_profiler.py and test_flops_profiler_in_inference)
test_get_optim_files.py
test_groups.py
test_partition_balanced.py
(test_adamw.py and TestAdamConfigs)
test_coalesced_collectives.py
test_activation_checkpointing_non_reentrant.py
test_activation_checkpointing.py
test_data.py
(test_ds_config_dict.py and (TestBasicConfig or TestBatchConfig))
test_ds_config_model.py
test_mup_optimizers.py
(test_pld.py and test_pld_schedule)
test_runtime_utils.py
test_pipe_schedule.py
test_topology.py
(test_ds_initialize.py and (TestClientOptimizer or TestClientLrScheduler))
test_csr.py
(test_fp16.py and (TestZeroEmptyGrad or TestZeroAllowUntestedOptimizer))
(test_bf16.py and TestZeroDtypeCocktail)
test_partition.py
test_ignore_unused_parameters.py
test_universal_checkpoint.py
test_user_args.py
test_zero.py
test_zero_config.py
test_zero_context.py
test_zero_context_ancestry.py
(test_zero_context.py and not TestSerialContext)
test_zero_context_return.py
test_zero_dynamic_class.py
test_zero_leaf_module.py
test_zero_nesting_init.py
test_zero_offloadpp.py
test_zero_optimizer.py
test_zero_tiled.py
test_zeropp.py
(test_zero.py and (TestZero3ParamPartitioningLargeParam or TestZero3ParamPartitioningLargeParam))



# Steps represent a sequence of tasks that will be executed as part of the job
steps:
Expand Down
Loading