Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of reproducibility when using Huggingface transformers library (TensorFlow version) #14

Open
dmitriydligach opened this issue Apr 28, 2020 · 11 comments

Comments

@dmitriydligach
Copy link

dmitriydligach commented Apr 28, 2020

Dear developers,

I included in my code all the steps listed in this repository but still could not achieve reproducibility either using TF 2.1 or TF 2.0. Here's the link to my code:

https://github.com/dmitriydligach/Thyme/blob/master/Keras/et.py

Please help.

@MFreidank
Copy link

@dmitriydligach Did you ever get this resolved?

@dmitriydligach
Copy link
Author

@MFreidank Nope. I switched to PyTorch, which has a more reliable way to enforce determinism.

@MFreidank
Copy link

@dmitriydligach Just to verify: your code becomes fully reproducible with pytorch?

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

@dmitriydligach
Copy link
Author

@MFreidank In most cases, I get the exact same results every time I run my PyTorch code (including loss and accuracy for each epoch). In some (relatively infrequent) cases, there's still a difference, but it's not nearly as large as in the case of tensorflow.

@MFreidank
Copy link

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

@duncanriach Thanks for your blazingly fast response! :)
I would still have an interest in resolving this issue in TF 2.2 and would highly appreciate it if you could help investigate.

A helpful starting point could be my colab example.

@dmitriydligach Thanks for those additional details, that sounds like there is still a slight non-determinism in pytorch as well, but it might not affect loss/accuracy as strongly. This is valuable information for me, thank you for sharing your experience :)

@duncanriach
Copy link
Collaborator

@dmitriydligach: I'm sorry that I didn't get to sorting this out for you in time to benefit from determinism in TensorFlow.

@MFreidank: I'll prioritize taking a look at these issues. They could have the same underlying cause, or source, or there could be different sources. Often in these kinds of problems there is an issue with setup that is easy to resolve. I intend to add better step-by-step instructions to the README for that. Sometimes a known (and not-yet-fixed) non-deterministic op is being used, and sometimes there is a new discovery, an op that is non-deterministic that we didn't know about about. We'll figure this out.

@MFreidank
Copy link

MFreidank commented Jun 16, 2020

@duncanriach Thanks a lot for taking the time to look into this and for your encouragement.
I feel much more confident about this now, knowing that someone with your experience will be having a look.

@duncanriach
Copy link
Collaborator

Hey @dmitriydligach, it looks like we have reproducibility in on issue 19 (Huggingface Transformers BERT for TensorFlow). @MFreidank is confirming. Looking at your code, I don't see any reason for there to be non-determinism. I want to repro what you're seeing so that I can debug it. I have it running, but it looks like I have to specify DATA_ROOT and provide data there. Can you give me instructions to repro with the data you're using?

@MFreidank
Copy link

@duncanriach Non-reproducibility of the code of @dmitriydligach may be related to him training for multiple epochs, see my update on issue #19.

@dmitriydligach
Copy link
Author

@duncanriach Thank you very much for looking into this issue.

Unfortunately, I'm not able to provide the data (this is medical data that can only be distributed via a data use agreement). However, perhaps it would help you to know that the data consists of relatively short text fragments (max_len ~ 150 word pieces)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants