Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF integrations Python + Vulkan failures #9936

Closed
GMNGeoffrey opened this issue Jul 28, 2022 · 7 comments · Fixed by #9330 or #9962
Closed

TF integrations Python + Vulkan failures #9936

GMNGeoffrey opened this issue Jul 28, 2022 · 7 comments · Fixed by #9330 or #9962
Assignees
Labels
bindings/python Python wrapping IREE's C API hal/vulkan Runtime Vulkan GPU HAL backend

Comments

@GMNGeoffrey
Copy link
Contributor

Issue body

In #9927, I'm consistently hitting the following error in 25 vulkan TF integrations tests

python3: /home/runner/actions-runner/_work/iree/iree/third_party/vulkan_memory_allocator/include/vk_mem_alloc.h:11939: void VmaDeviceMemoryBlock::Destroy(VmaAllocator): Assertion `m_pMetadata->IsEmpty() && "Some allocations were not freed before destruction of this memory block!"' failed.

This is the same error message as #3381, but @stellaraccident suggested it was probably a different issue. Odd about this issue is that it happens consistently on the GitHub actions runner, but wasn't happening on Kokoro at all. I've tried reducing test parallelism to ensure this isn't some issue of competing for resources (one difference with the new machines is that they have more cores and #9905 hit this issue), but that did not help.

For now, I'm going to disable the Vulkan tests in this job, but this will block #9855.

These are the failing tests:

iree_tf_tests/layers/vulkan__Add.run
iree_tf_tests/layers/vulkan__AdditiveAttention.run
iree_tf_tests/layers/vulkan__Attention.run
iree_tf_tests/layers/vulkan__Average.run
iree_tf_tests/layers/vulkan__Concatenate.run
iree_tf_tests/layers/vulkan__Dot.run
iree_tf_tests/layers/vulkan__Maximum.run
iree_tf_tests/layers/vulkan__Minimum.run
iree_tf_tests/layers/vulkan__MultiHeadAttention.run
iree_tf_tests/layers/vulkan__Multiply.run
iree_tf_tests/layers/vulkan__Subtract.run
iree_tf_tests/layers/vulkan__dynamic_dims_Add.run
iree_tf_tests/layers/vulkan__dynamic_dims_Average.run
iree_tf_tests/layers/vulkan__dynamic_dims_Maximum.run
iree_tf_tests/layers/vulkan__dynamic_dims_Minimum.run
iree_tf_tests/layers/vulkan__dynamic_dims_Multiply.run
iree_tf_tests/layers/vulkan__dynamic_dims_Subtract.run
iree_tf_tests/layers/vulkan__full_api_AdditiveAttention.run
iree_tf_tests/layers/vulkan__full_api_Attention.run
iree_tf_tests/layers/vulkan__full_api_Concatenate.run
iree_tf_tests/math/vulkan__accumulate_n.run
iree_tf_tests/math/vulkan__add_n.run
iree_tf_tests/math/vulkan__dynamic_dim_polyval.run
iree_tf_tests/math/vulkan__polyval.run
iree_tf_tests/uncategorized/vulkan__pytree.run

Full logs: https://gist.githubusercontent.com/GMNGeoffrey/6e1ca63782858a2d9ee36157351032a1/raw/e7454d9b4a2fbc8dbdd65630b8bd5c145fdeffdd/logs.txt

@GMNGeoffrey GMNGeoffrey added hal/vulkan Runtime Vulkan GPU HAL backend bindings/python Python wrapping IREE's C API labels Jul 28, 2022
@ScottTodd
Copy link
Member

What's the Vulkan environment like on these runners? Vulkan SDK version, GPU driver version, physical GPU vs SwiftShader, etc.

@GMNGeoffrey
Copy link
Contributor Author

It's all in the swiftshader-frontends docker container

@GMNGeoffrey
Copy link
Contributor Author

Seems like @stellaraccident already has a draft PR for fixing this

GMNGeoffrey added a commit that referenced this issue Jul 28, 2022
I've had to disable Vulkan tests due to
#9936

Part of #9855
stellaraccident pushed a commit that referenced this issue Jul 29, 2022
APIs removed:
  * HalDriver.create() (use iree.runtime.get_driver(driver_name) to get a
    cached instance).
  * Environment variable IREE_DEFAULT_DRIVER renamed to
    IREE_DEFAULT_DEVICE to better reflect the new syntax.
  * Config.driver attribute (no longer captured by this class)

APIs added:
  * iree.runtime.query_available_drivers() (alias of HalDriver.query())
  * iree.runtime.get_driver(device_uri)
  * iree.runtime.get_device(device_uri)
  * iree.runtime.get_first_device(device_uris)
  * iree.runtime.Config(, device: HalDevice) (to configure with an
    explicit device)
  * HalDriver.create_device(device_id: Union[int, tuple])
  * HalDriver.query_available_devices()
  * HalDriver.create_device_by_uri(device_uri: str)

Both driver and device lookup is done by a device URI, as defined by the runtime (when creating a driver, only the 'scheme' is used). Driver instances are cached by name in the native code, which should avoid various bad behavior in terms of driver lifetimes and lack of care to process state. Devices are optionally (default True) cached at the Python level.

Fixes #9277
Expected to fix #9936
@stellaraccident
Copy link
Collaborator

Didn't mean to have this auto-close. I haven't verified but do expect that the root cause is fixed. Let me know if not.

@GMNGeoffrey
Copy link
Contributor Author

I'm still seeing the same failures in #9954

@stellaraccident
Copy link
Collaborator

Boo. What do I do to test this?

@GMNGeoffrey
Copy link
Contributor Author

Well there's the really slow way of iterating via GitHub actions 😆 I think I actually got the same errors on my 96-core workstation, but I can also point you at a VM with the same config as the CI

stellaraccident pushed a commit that referenced this issue Jul 29, 2022
This was a bit tricky to track down. It only affects structured argument
packing/unpacking. We were leaking a reference to the sub-list, which,
given that it usually contains buffers, was causing a leak of any
contained device backed memory buffers. Then much later, with asserts
enabled, the vulkan driver would notice that not all allocated memory
was freed and assert.

I am not completely sure on the cause of the non-determinism. Since the
problem manifests on shutdown, I'm (just guessing) that the normal
shutdown process may be bypassed sometimes? This does only affect a
subset of TF-specific invocation schemes, and until that was obvious, it
was difficult to see the pattern.

Fixes #9936
stellaraccident pushed a commit that referenced this issue Jul 30, 2022
* Move ref vs retain when stealing pointer for structured args.

This was a bit tricky to track down. It only affects structured argument
packing/unpacking. We were leaking a reference to the sub-list, which,
given that it usually contains buffers, was causing a leak of any
contained device backed memory buffers. Then much later, with asserts
enabled, the vulkan driver would notice that not all allocated memory
was freed and assert.

I am not completely sure on the cause of the non-determinism. Since the
problem manifests on shutdown, I'm (just guessing) that the normal
shutdown process may be bypassed sometimes? This does only affect a
subset of TF-specific invocation schemes, and until that was obvious, it
was difficult to see the pattern.

Fixes #9936

* Re-enable vulkan tests on new CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bindings/python Python wrapping IREE's C API hal/vulkan Runtime Vulkan GPU HAL backend
Projects
No open projects
Archived in project
3 participants