Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/patch/softmax cross entropy with logits #21

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

MFreidank
Copy link

Addressing previously reported tensorflow issue #38185. Also tied to tfdeterminism issues #19, #14, #9.
Fused implementation of {sparse_,}softmax_cross_entropy_with_logits exhibits non-determinism in its backprop.
This PR will provide a patch that maintains the same function interface but computes the computational steps involved in a non-fused way that is fully deterministic.

MFreidank added 2 commits June 22, 2020 09:56
Tied to open tensorflow issue #38185.
Non-determinism in backprop of fused implementation
of `{sparse_,}softmax_cross_entropy_with_logits`
has been reported. This work will provide a
patch by routing calls to a deterministic workaround
first described in the tensorflow issue above.
@@ -70,13 +70,20 @@ def _patch():
if re.match("(1\.(14|15)|2\.0)", tf_version):
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
_patch_bias_add()
_patch_fused_softmax_cross_entropy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push changes that rough-out enable_determinism and then you can call this from the appropriate part of that.

same as `logits` and its shape is the same as `labels` except that it does
not have the last dimension of `labels`.
"""
raise NotImplementedError()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the baby steps. Good job!

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 22, 2020

Comments above. Heads-up, before you're done, please remember to implement test/test_fused_softmax_cross_entropy.py in the spirit of test/test_patch_bias_add.py. This will use the appropriate forward code from the upstream TensorFlow repo along with the deterministic gradient test code that I provided in TensorFlow Issue 38185 (also testing all the modified op bindings).

I'm going to work on adding enable_determinism, from which _patch_fused_softmax_cross_entropy can be called.

Comment on lines 149 to 151
tf.nn.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # access via public API
nn.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # called from tf.keras.layers.convolutional.Conv
nn_ops.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # called from tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at tensorflow/python/ops/nn_ops.py it seems like it might be necessary to remap the bindings for softmax_cross_entropy_with_logits_v2 as well. After a brief review, I'm not sure what the difference between this and the other versions is, though, apart from the deprecated dim parameter.


# The original, pre-patched method can be viewed at
# https://github.com/tensorflow/tensorflow/blob/v1.14.0/tensorflow/python/ops/nn_ops.py#L3182
def _new_softmax_cross_entropy_with_logits_1_14(labels, logits, axis=-1, name=None):
Copy link
Collaborator

@duncanriach duncanriach Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the reason that _new_bias_add_1_14 has the version number in the name is that before stock TensorFlow version 1.14, tf.nn.bias_add has a different API. I have a version of this patch for TensorFlow version 1.13 and earlier that I didn't release because it was not useful without TF_CUDNN_DETERMINISTIC, which was released in version 1.14 of stock TensorFlow. The deterministic bias_add was effectively released in the NVIDIA GPU Cloud (NGC) TensorFlow container version 19.06, which also implemented TF_DETERMINISTIC_OPS, and which enabled the functionality.

I don't think we need the _1_14 on the end of these method names. However, before enabling this patch to be applied to any given version of TensorFlow (as decided in enable_determinism) it should be confirmed that the patched version of the method has the same API as the original on in that version of TensorFlow.


# Non-sparse
tf.nn.sparse_softmax_cross_entropy_with_logits = _new_sparse_softmax_cross_entropy_with_logits_1_14 # access via public API
nn.sparse_softmax_cross_entropy_with_logits = _new_sparse_softmax_cross_entropy_with_logits_1_14 # called from tf.keras.layers.convolutional.Conv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on the end of this line is not relevant here (same for the non-sparse version). You should search for calls to sparse_softmax_cross_entropy_with_logits (and the non-sparse version) in tensorflow/python and do the re-binding as appropriate for the op (with similar, perhaps in some cases identical, comments about why you're doing it).

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 23, 2020

I just added enable_determinism with this commit. You should use enable_determinism to apply the patch in test/test_patch_fused_softmax_cross_entropy.py.

I'm going reorganize the testing quite a lot before release. For the testing part, I would like you to focus on getting the following complete and working:

$ cd test
$ echo "Test TF1 API:"
$ ./container.sh tensorflow/tensorflow:1.14.0-gpu python test_patch_fused_softmax_cross_entropy.py
$ echo "Test TF2 API:"
$ ./container.sh tensorflow/tensorflow:2.2.0-gpu python test_patch_fused_softmax_cross_entropy.py

@@ -52,7 +52,7 @@ def _enable_determinism(seed=None):
_patch_bias_add()
if in_ngc_cont and ngc_vers.at_least('19.06') or tf_vers.at_least('2.1'):
os.environ['TF_DETERMINISTIC_OPS'] = '1'
# TODO: Add patch crossentropy here as well? Issue seems to still be present on tf 2.1, 2.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is the condition for setting TF_DETERMINISTIC_OPS=1: NGC containers with version >= 19.06 or stock TensorFlow with version >= 2.1.

The condition below will ensure that the fused softmax/cross-entropy patch is applied to NGC containers with version >= 19.06 or stock TensorFlow with version >= 1.14 (which includes versions 2.1 and 2.2).

print("TensorFlow version %s has been patched "
"using tfdeterminism version %s" %
(tf_version, __version__), file=sys.stderr)
elif re.match("2\.1|2\.2", tf_version):
Copy link
Collaborator

@duncanriach duncanriach Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to deprecate patch, not make it work on newer versions of TensorFlow, and no longer advertise it in the documentation (will be updated before release). If folks want determinism in TensorFlow version 2.1 or 2.2 then they should be using enable_determinism. Not adding this does not break anything because there was no patch available for TensorFlow versions 2.1 and 2.2 before.

@phqtuyen
Copy link

phqtuyen commented Aug 5, 2020

Hi, my name is Thomas, I am working at the Oracle Digital Assistant team in which we made extensively use of Tensorflow and Tensorflow-determinism, we really appreciate this wonderful library. I just wish to ask is this issue going to be merged anytime soon?

@duncanriach
Copy link
Collaborator

Hi Thomas (@phqtuyen),

Thanks for letting us know. These ops seem to be generally high priority for determinism in TensorFlow.

@MFreidank: please can you give an update on this and/or move it forward. Most of the rest of the infrastructure is now in place for this.

@MFreidank
Copy link
Author

Hi @phqtuyen, @duncanriach
Sorry for the wait, I had been pulled into other things.
I've picked up work on this now and will have an update to the PR before end of this week.
I'll base my changes on the new infrastructure recently put in place.

@duncanriach
Copy link
Collaborator

Awesome. Just FYI, I'll be offline next week (2020-08-17 through 2020-08-21) and, since I'll need to be involved in reviews and iterations, that might slow things down a bit.

@duncanriach
Copy link
Collaborator

duncanriach commented Sep 2, 2020

Hey @MFreidank, are you still planning on completing this PR?

@duncanriach
Copy link
Collaborator

duncanriach commented Sep 4, 2020

Hi @MFreidank, I'm about to go offline again until 2020-09-16, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?

@MFreidank
Copy link
Author

MFreidank commented Sep 4, 2020 via email

@duncanriach
Copy link
Collaborator

Thank you, Moritz.

@MFreidank
Copy link
Author

MFreidank commented Sep 5, 2020

Hi again @duncanriach,
I've implemented a first version (on top of the most recent master branch with enable_determinism) that patches only tf.nn.sparse_softmax_cross_entropy_with_logits and tf.nn.softmax_cross_entropy_with_logits and integrated the unit test case you shared into test/test_softmax_cross_entropy_with_logits.py.
To test that this works as intended I'll need access to my GPU which will only be available to me in another 6 hours.
If the test cases pass, I'll update the PR with the above.
What remains to be determined is which other functions need to be patched after that (by scanning through tensorflow/python as outlined by you above).
I will first focus on getting the minimal patch for the functions in tf.nn to work including unit tests and will then broaden it to include the other functions.

@duncanriach
Copy link
Collaborator

Thanks, @MFreidank. Sounds good. Just to reiterate, I'll be offline until 2020-09-16 and I'll review this after that.

@duncanriach
Copy link
Collaborator

Hey @MFreidank, please will you submit those changes so that I can review them?

@duncanriach
Copy link
Collaborator

@MFreidank, we will soon release version 0.4.0. We intend to include the fused softmax/cross-entropy patch in version 0.5.0 and I want to get started on it ASAP. If I don't hear from you in the 24 hours, we will proceed to implement it without you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants