Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tf.function for list column operations #938

Merged
merged 1 commit into from
Dec 27, 2022

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Dec 26, 2022

Addresses NVIDIA-Merlin/dataloader#74.

Goals ⚽

In Tensorflow >= 2.10, there seems to be a race condition or some thread safety issue when Dataloader is loading list columns. The errors described in the above issue happen non-deterministically. When the list column tensors are copied successfully by the time tensorflow begins its execution, the tests run successfully. If the copy is incomplete, the tests fail. This PR proposes a workaround for this issue by using the @tf.function decorator on the methods that involve list columns.

Implementation Details 🚧

Some observations:

  1. This does not happen on CPU, and only happens on GPUs.
  2. This does not happen in graph mode, and only happens in eager mode.

The above two observations suggest two workarounds:

  1. Move the problematic operations to CPU (by using with tf.device("CPU")).
  2. Take the problematic operations out of eager execution and convert them to graph executions (by using tf.function).

The second workaround seems to be the superior approach. One could also argue that all methods, not just the ones involving list columns, that were written in eager mode should be wrapped with @tf.function for efficiency. However, I consider it to be out of scope for this PR. This PR has the minimum changes required to have the unit tests complete successfully; the @tf.function decorator is applied only to the methods that are failing unit tests in Tensorflow 2.10.

Also has a minor fix due to cudf.RangeIndex not having an empty property anymore, which was copied from #904 (comment).

Testing Details 🔍

Removed the upper bound on tensorflow from 2.10 to 2.11 in GPU CI / gpu-ci (pull_request).
This issue is currently blocking the 22.12 release pipeline. Tested manually with nvcr.io/nvstaging/merlin/tmp-merlin-tensorflow-stg:22.12 (which has tensorflow 2.10).

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-938

@edknv edknv self-assigned this Dec 27, 2022
@edknv edknv added bug Something isn't working area/tensorflow ci labels Dec 27, 2022
@edknv edknv added this to the Merlin 22.12 milestone Dec 27, 2022
@edknv edknv marked this pull request as ready for review December 27, 2022 00:54
Copy link
Member

@EvenOldridge EvenOldridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The wrapping of everything to a function for graph optimization may be something we want to explore in the future. Please create a brief issue describing what's proposed and why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tensorflow bug Something isn't working ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants