Improve FB metric for DDP #1805

KickItLikeShika · 2021-03-16T14:03:18Z

Improving FractionalBias implementation to be compatible with DDP and add the necessary tests

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

KickItLikeShika · 2021-03-16T14:19:17Z

Regarding the TPU test failure https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122146650, i guess we got the same issue in #1699, i will try to fix it the same way

vfdev-5 · 2021-03-16T14:22:21Z

@KickItLikeShika thanks ! yes, I think it should be division by zero somewhere...

KickItLikeShika · 2021-03-16T15:30:58Z

@vfdev-5 can you please check this https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122812029

vfdev-5 · 2021-03-16T15:34:17Z

@vfdev-5 can you please check this https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122812029

Yes, this can happend. Let's reduce the tolerence to 1e-5 for xla

KickItLikeShika · 2021-03-16T15:43:28Z

@vfdev-5 You mean by reducing the tolerence it should be smaller than 1e-15, right? It should be for example 1e-30

vfdev-5 · 2021-03-16T15:45:37Z

I mean pytest.approx tolerence, like here:

ignite/tests/ignite/metrics/test_root_mean_squared_error.py

Line 132 in f76c392

_test_distrib_integration(device, tol=1e-4)

KickItLikeShika · 2021-03-16T16:56:12Z

@vfdev-5 i have tried to manipulate the tolerence or the eps but it didn't work, still getting the same error

vfdev-5 · 2021-03-16T17:01:29Z

@KickItLikeShika let's try to repro the issue locally. Please, follow these installation steps:

ignite/.github/workflows/tpu-tests.yml

Lines 52 to 65 in b973213

    
                 - name: Install Torch XLA and others 
        
                   run: | 
        
                     ## Install openblas, mkl, gsutil 
        
                     sudo apt-get install -y libopenblas-dev libomp5 
        
                     pip install mkl requests gsutil 
        
                     ## Install torch & xla 
        
                     curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py 
        
                     python pytorch-xla-env-setup.py --version "nightly" 
        
                     ## Install test deps and Ignite 
        
                     pip install -r requirements-dev.txt 
        
                     python setup.py install

and run the test:

ignite/.github/workflows/tpu-tests.yml

Lines 67 to 74 in b973213

    
                 - name: Run Tests 
        
                   run: | 
        
                     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hostedtoolcache/Python/3.6.10/x64/lib 
        
                     export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0" 
        
                     export XRT_WORKERS="localservice:0;grpc://localhost:40934" 
        
                     python -c "import torch_xla; print('torch xla version:', torch_xla.__version__)" 
        
                     bash tests/run_tpu_tests.sh

or even

pytest tests/ignite/contrib/metrics/regression/test_fractional_bias.py::test_distrib_single_device_xla -vvv -m tpu

KickItLikeShika · 2021-03-17T12:40:23Z

@vfdev-5 it works now, Thanks!

sdesrozis

@KickItLikeShika Thank you for this work! LGTM

sdesrozis · 2021-03-17T22:13:47Z

Horovod is ko 😔

KickItLikeShika · 2021-03-17T22:16:29Z

@sdesrozis yes i see! but i think it's not related, right? as the tests for FractionBias have been passed. Check this please https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2134663201#step:8:1313

KickItLikeShika · 2021-03-17T22:25:11Z

@sdesrozis please restart the CI to see what's going to happen..

sdesrozis · 2021-03-17T22:31:12Z

Let’s see tomorrow about this failure @vfdev-5

KickItLikeShika · 2021-03-17T23:02:01Z

@sdesrozis it works fine, thanks!

vfdev-5 · 2021-03-17T23:23:27Z

@KickItLikeShika probably we have to adjust tolerance for GPU tests as well : https://app.circleci.com/pipelines/github/pytorch/ignite/1607/workflows/381ab87c-6226-466e-ab64-4bbe8a835f56/jobs/4649

Please, send a follow-up PR with tol ~1e-5

KickItLikeShika added 2 commits March 16, 2021 15:48

add dist settings and integration test

4c2fe70

add dist tests

f103756

github-actions bot added the module: contrib Contrib module label Mar 16, 2021

KickItLikeShika added 2 commits March 16, 2021 16:51

avoiding nans

c6ce293

avoiding nans

9737951

KickItLikeShika added 2 commits March 16, 2021 17:58

edit tolerence

3ac580f

edit eps

67a8599

reduce tolerence for xla

b8175b1

sdesrozis approved these changes Mar 17, 2021

View reviewed changes

Merge branch 'master' into improve-fractional-bias

887213e

vfdev-5 merged commit b3f7020 into pytorch:master Mar 17, 2021

KickItLikeShika deleted the improve-fractional-bias branch March 18, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FB metric for DDP #1805

Improve FB metric for DDP #1805

KickItLikeShika commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021 •

edited

Loading

KickItLikeShika commented Mar 16, 2021 •

edited

Loading

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 17, 2021 •

edited

Loading

sdesrozis left a comment

sdesrozis commented Mar 17, 2021

KickItLikeShika commented Mar 17, 2021 •

edited

Loading

KickItLikeShika commented Mar 17, 2021

sdesrozis commented Mar 17, 2021

KickItLikeShika commented Mar 17, 2021

vfdev-5 commented Mar 17, 2021 •

edited

Loading

Improve FB metric for DDP #1805

Improve FB metric for DDP #1805

Conversation

KickItLikeShika commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021 • edited Loading

KickItLikeShika commented Mar 16, 2021 • edited Loading

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 16, 2021

vfdev-5 commented Mar 16, 2021

KickItLikeShika commented Mar 17, 2021 • edited Loading

sdesrozis left a comment

Choose a reason for hiding this comment

sdesrozis commented Mar 17, 2021

KickItLikeShika commented Mar 17, 2021 • edited Loading

KickItLikeShika commented Mar 17, 2021

sdesrozis commented Mar 17, 2021

KickItLikeShika commented Mar 17, 2021

vfdev-5 commented Mar 17, 2021 • edited Loading

vfdev-5 commented Mar 16, 2021 •

edited

Loading

KickItLikeShika commented Mar 16, 2021 •

edited

Loading

KickItLikeShika commented Mar 17, 2021 •

edited

Loading

KickItLikeShika commented Mar 17, 2021 •

edited

Loading

vfdev-5 commented Mar 17, 2021 •

edited

Loading