warning msg/documentation on the tf32 related system flags and usage #6754

wyli · 2023-07-21T17:38:30Z

(follow up of #6525) My larger concern is that other operations in monai will be also affected by the tf32 issue (since all operations uses cuda.matmul are affected). This may lead to significant reproducibility issues.

My proposal is adding something like
https://github.com/Lightning-AI/lightning/pull/16037/files#diff-909e246d6c36514f952ae5023bd9fbcc3e8f2c6a0837ebf81d7dc96790b5f938R190-R210
to related classes/functions in monai. Then, monai will print warnings when the flag is True. Not sure when it is better to print warnings, maybe during import? Maybe warnings can be suppressed when the flage is explicitly set by users, but it seems technically challenging.
&
adding a part in the documentation to educate users how to use tf32 properly.

Originally posted by @qingpeng9802 in #6525 (comment)

The text was updated successfully, but these errors were encountered:

qingpeng9802 · 2023-07-25T19:27:04Z

Also, found a related part in the repo, fyi

MONAI/tests/utils.py

Lines 173 to 198 in 2800a76

    
           def is_tf32_env(): 
        
               """ 
        
               The environment variable NVIDIA_TF32_OVERRIDE=0 will override any defaults 
        
               or programmatic configuration of NVIDIA libraries, and consequently, 
        
               cuBLAS will not accelerate FP32 computations with TF32 tensor cores. 
        
               """ 
        
               global _tf32_enabled 
        
               if _tf32_enabled is None: 
        
                   _tf32_enabled = False 
        
                   if ( 
        
                       torch.cuda.is_available() 
        
                       and not version_leq(f"{torch.version.cuda}", "10.100") 
        
                       and os.environ.get("NVIDIA_TF32_OVERRIDE", "1") != "0" 
        
                       and torch.cuda.device_count() > 0  # at least 11.0 
        
                   ): 
        
                       try: 
        
                           # with TF32 enabled, the speed is ~8x faster, but the precision has ~2 digits less in the result 
        
                           g_gpu = torch.Generator(device="cuda") 
        
                           g_gpu.manual_seed(2147483647) 
        
                           a_full = torch.randn(1024, 1024, dtype=torch.double, device="cuda", generator=g_gpu) 
        
                           b_full = torch.randn(1024, 1024, dtype=torch.double, device="cuda", generator=g_gpu) 
        
                           _tf32_enabled = (a_full.float() @ b_full.float() - a_full @ b_full).abs().max().item() > 0.001  # 0.1713 
        
                       except BaseException: 
        
                           pass 
        
                   print(f"tf32 enabled: {_tf32_enabled}") 
        
               return _tf32_enabled

about #6754 . ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [x] Documentation updated, tested `make html` command in the `docs/` folder. --------- Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

about #6754 . ### Description show a warning if any thing may enable tf32 is detected ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [x] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. --------- Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

myron · 2023-08-15T07:08:45Z

@qingpeng9802 @wyli

Guys, I understand what you're trying to do, but I train on multi-gpu and the screens starts full of Warnings, which is a bit overwhelming

a) is there a way to disable these warnings? ( I do know that TF32 is enabled)
b) does this new check introduce some overhead? seems like every process in DDP ran it separately

/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating

qingpeng9802 · 2023-08-15T07:55:07Z

a) There is currently no way to disable it, maybe we can add an environment variable MONAI_ALLOW_TF32 like other libs did.
b) Not sure how the code calls. My guess is that each subprocess import monai once here. If my guess is correct, the overhead should be okay.

Could you provide the code snippets related to import monai and DDP?

wyli · 2023-08-15T08:43:10Z

I think the main ambiguity from a user's perspective is often from this particular setting: export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 and torch.backends.cuda.matmul.allow_tf32=False, which will enable tf32. how about we only warn this setting?

qingpeng9802 · 2023-08-15T09:41:20Z

I think the main ambiguity from a user's perspective is often from this particular setting: export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 and torch.backends.cuda.matmul.allow_tf32=False, which will enable tf32. how about we only warn this setting?

The thing is actually a bit complicated.
When TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, PyTorch will set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32.
When NVIDIA_TF32_OVERRIDE=1, PyTorch will not set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32 (by NVIDIA lib internally)
Thus, it is kind of hard to infer the user's intention.

wyli · 2023-08-15T10:03:43Z

When TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, PyTorch will set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32.

ok, looks like it's consistent in pytorch 2.0, then I think there's no need to warn in this case?

When NVIDIA_TF32_OVERRIDE=1, PyTorch will not set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32 (by NVIDIA lib internally) Thus, it is kind of hard to infer the user's intention.

I don't think in regular use cases NVIDIA_TF32_OVERRIDE should be set, because it potentially change all the underlying libs/frameworks, our current code correctly warn this case.

Since there are some changes in the previous versions of pytorch on this topic, perhaps we can focus on proper warnings for torch>=2.0 only. what do you think?

qingpeng9802 · 2023-08-15T10:47:07Z

TORCH_ALLOW_TF32_CUBLAS_OVERRIDE affects the precision by https://github.com/pytorch/pytorch/blob/v2.0.1-rc4/aten/src/ATen/Context.h#L294, and this is introduced by the issue Lightning-AI/pytorch-lightning#12997 mentioned. Thus, the version boundary should be 1.12? (not sure)

ok, looks like it's consistent in pytorch 2.0, then I think there's no need to warn in this case?

The behavior of PyTorch is consistent, but for the users, it seems a bit hard to troubleshoot, just like the root issue of this issue. This is essentially a tradeoff for bothering experienced and inexperienced users.

I would suggest to add an environment variable as a flag to suppress the warnings. There is a similar idea in huggingface/transformers#16588 (comment)

myron · 2023-08-27T06:13:15Z

Guys, I'm running the AutoRunner() from monai on 8 gpus, these WARNINGS are overwhelming. It printed them 16 times (probably form DataAnalyzer() which creates several parallel processes), then another 8 WARNINGS when training starts.

Can we please disable these warnings. Or at least show it just one time, and not so many. thank you.

wyli · 2023-08-27T09:17:12Z

thanks@myron I'm creating a feature request and will have a look soon.

wyli mentioned this issue Jul 21, 2023

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

Closed

wyli added Contribution wanted Feature request labels Jul 21, 2023

qingpeng9802 mentioned this issue Jul 25, 2023

Add tf32 doc #6770

Merged

7 tasks

qingpeng9802 mentioned this issue Aug 3, 2023

Tf32 warnings #6816

Merged

7 tasks

wyli closed this as completed Aug 10, 2023

wyli mentioned this issue Aug 27, 2023

by default less warnings for the tf32 flags #6907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warning msg/documentation on the tf32 related system flags and usage #6754

warning msg/documentation on the tf32 related system flags and usage #6754

wyli commented Jul 21, 2023

qingpeng9802 commented Jul 25, 2023

myron commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023

wyli commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023

wyli commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023 •

edited

Loading

myron commented Aug 27, 2023

wyli commented Aug 27, 2023

warning msg/documentation on the tf32 related system flags and usage #6754

warning msg/documentation on the tf32 related system flags and usage #6754

Comments

wyli commented Jul 21, 2023

qingpeng9802 commented Jul 25, 2023

myron commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023

wyli commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023

wyli commented Aug 15, 2023

qingpeng9802 commented Aug 15, 2023 • edited Loading

myron commented Aug 27, 2023

wyli commented Aug 27, 2023

qingpeng9802 commented Aug 15, 2023 •

edited

Loading