Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FB metric for DDP #1805

Merged
merged 8 commits into from
Mar 17, 2021

Conversation

KickItLikeShika
Copy link
Contributor

Improving FractionalBias implementation to be compatible with DDP and add the necessary tests

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: contrib Contrib module label Mar 16, 2021
@KickItLikeShika
Copy link
Contributor Author

Regarding the TPU test failure https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122146650, i guess we got the same issue in #1699, i will try to fix it the same way

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@KickItLikeShika thanks ! yes, I think it should be division by zero somewhere...

@KickItLikeShika
Copy link
Contributor Author

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@vfdev-5 can you please check this https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122812029

Yes, this can happend. Let's reduce the tolerence to 1e-5 for xla

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 16, 2021

@vfdev-5 You mean by reducing the tolerence it should be smaller than 1e-15, right? It should be for example 1e-30

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

I mean pytest.approx tolerence, like here:

_test_distrib_integration(device, tol=1e-4)

@KickItLikeShika
Copy link
Contributor Author

@vfdev-5 i have tried to manipulate the tolerence or the eps but it didn't work, still getting the same error

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@KickItLikeShika let's try to repro the issue locally. Please, follow these installation steps:

- name: Install Torch XLA and others
run: |
## Install openblas, mkl, gsutil
sudo apt-get install -y libopenblas-dev libomp5
pip install mkl requests gsutil
## Install torch & xla
curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
python pytorch-xla-env-setup.py --version "nightly"
## Install test deps and Ignite
pip install -r requirements-dev.txt
python setup.py install

and run the test:

- name: Run Tests
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hostedtoolcache/Python/3.6.10/x64/lib
export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
export XRT_WORKERS="localservice:0;grpc://localhost:40934"
python -c "import torch_xla; print('torch xla version:', torch_xla.__version__)"
bash tests/run_tpu_tests.sh

or even

pytest tests/ignite/contrib/metrics/regression/test_fractional_bias.py::test_distrib_single_device_xla -vvv -m tpu

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 17, 2021

@vfdev-5 it works now, Thanks!

Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KickItLikeShika Thank you for this work! LGTM

@sdesrozis
Copy link
Contributor

Horovod is ko 😔

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 17, 2021

@sdesrozis yes i see! but i think it's not related, right? as the tests for FractionBias have been passed. Check this please https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2134663201#step:8:1313

@KickItLikeShika
Copy link
Contributor Author

@sdesrozis please restart the CI to see what's going to happen..

@sdesrozis
Copy link
Contributor

Let’s see tomorrow about this failure @vfdev-5

@KickItLikeShika
Copy link
Contributor Author

@sdesrozis it works fine, thanks!

@vfdev-5 vfdev-5 merged commit b3f7020 into pytorch:master Mar 17, 2021
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 17, 2021

@KickItLikeShika probably we have to adjust tolerance for GPU tests as well : https://app.circleci.com/pipelines/github/pytorch/ignite/1607/workflows/381ab87c-6226-466e-ab64-4bbe8a835f56/jobs/4649

Please, send a follow-up PR with tol ~1e-5

@KickItLikeShika KickItLikeShika deleted the improve-fractional-bias branch March 18, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: contrib Contrib module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants