TF integrations Python + Vulkan failures #9936

GMNGeoffrey · 2022-07-28T17:25:44Z

Issue body

In #9927, I'm consistently hitting the following error in 25 vulkan TF integrations tests

python3: /home/runner/actions-runner/_work/iree/iree/third_party/vulkan_memory_allocator/include/vk_mem_alloc.h:11939: void VmaDeviceMemoryBlock::Destroy(VmaAllocator): Assertion `m_pMetadata->IsEmpty() && "Some allocations were not freed before destruction of this memory block!"' failed.

This is the same error message as #3381, but @stellaraccident suggested it was probably a different issue. Odd about this issue is that it happens consistently on the GitHub actions runner, but wasn't happening on Kokoro at all. I've tried reducing test parallelism to ensure this isn't some issue of competing for resources (one difference with the new machines is that they have more cores and #9905 hit this issue), but that did not help.

For now, I'm going to disable the Vulkan tests in this job, but this will block #9855.

These are the failing tests:

iree_tf_tests/layers/vulkan__Add.run
iree_tf_tests/layers/vulkan__AdditiveAttention.run
iree_tf_tests/layers/vulkan__Attention.run
iree_tf_tests/layers/vulkan__Average.run
iree_tf_tests/layers/vulkan__Concatenate.run
iree_tf_tests/layers/vulkan__Dot.run
iree_tf_tests/layers/vulkan__Maximum.run
iree_tf_tests/layers/vulkan__Minimum.run
iree_tf_tests/layers/vulkan__MultiHeadAttention.run
iree_tf_tests/layers/vulkan__Multiply.run
iree_tf_tests/layers/vulkan__Subtract.run
iree_tf_tests/layers/vulkan__dynamic_dims_Add.run
iree_tf_tests/layers/vulkan__dynamic_dims_Average.run
iree_tf_tests/layers/vulkan__dynamic_dims_Maximum.run
iree_tf_tests/layers/vulkan__dynamic_dims_Minimum.run
iree_tf_tests/layers/vulkan__dynamic_dims_Multiply.run
iree_tf_tests/layers/vulkan__dynamic_dims_Subtract.run
iree_tf_tests/layers/vulkan__full_api_AdditiveAttention.run
iree_tf_tests/layers/vulkan__full_api_Attention.run
iree_tf_tests/layers/vulkan__full_api_Concatenate.run
iree_tf_tests/math/vulkan__accumulate_n.run
iree_tf_tests/math/vulkan__add_n.run
iree_tf_tests/math/vulkan__dynamic_dim_polyval.run
iree_tf_tests/math/vulkan__polyval.run
iree_tf_tests/uncategorized/vulkan__pytree.run

Full logs: https://gist.githubusercontent.com/GMNGeoffrey/6e1ca63782858a2d9ee36157351032a1/raw/e7454d9b4a2fbc8dbdd65630b8bd5c145fdeffdd/logs.txt

The text was updated successfully, but these errors were encountered:

ScottTodd · 2022-07-28T17:44:14Z

What's the Vulkan environment like on these runners? Vulkan SDK version, GPU driver version, physical GPU vs SwiftShader, etc.

GMNGeoffrey · 2022-07-28T17:47:40Z

It's all in the swiftshader-frontends docker container

GMNGeoffrey · 2022-07-28T18:03:53Z

Seems like @stellaraccident already has a draft PR for fixing this

I've had to disable Vulkan tests due to #9936 Part of #9855

APIs removed: * HalDriver.create() (use iree.runtime.get_driver(driver_name) to get a cached instance). * Environment variable IREE_DEFAULT_DRIVER renamed to IREE_DEFAULT_DEVICE to better reflect the new syntax. * Config.driver attribute (no longer captured by this class) APIs added: * iree.runtime.query_available_drivers() (alias of HalDriver.query()) * iree.runtime.get_driver(device_uri) * iree.runtime.get_device(device_uri) * iree.runtime.get_first_device(device_uris) * iree.runtime.Config(, device: HalDevice) (to configure with an explicit device) * HalDriver.create_device(device_id: Union[int, tuple]) * HalDriver.query_available_devices() * HalDriver.create_device_by_uri(device_uri: str) Both driver and device lookup is done by a device URI, as defined by the runtime (when creating a driver, only the 'scheme' is used). Driver instances are cached by name in the native code, which should avoid various bad behavior in terms of driver lifetimes and lack of care to process state. Devices are optionally (default True) cached at the Python level. Fixes #9277 Expected to fix #9936

stellaraccident · 2022-07-29T05:01:07Z

Didn't mean to have this auto-close. I haven't verified but do expect that the root cause is fixed. Let me know if not.

GMNGeoffrey · 2022-07-29T06:04:11Z

I'm still seeing the same failures in #9954

stellaraccident · 2022-07-29T06:20:03Z

Boo. What do I do to test this?

GMNGeoffrey · 2022-07-29T15:26:51Z

Well there's the really slow way of iterating via GitHub actions 😆 I think I actually got the same errors on my 96-core workstation, but I can also point you at a VM with the same config as the CI

This was a bit tricky to track down. It only affects structured argument packing/unpacking. We were leaking a reference to the sub-list, which, given that it usually contains buffers, was causing a leak of any contained device backed memory buffers. Then much later, with asserts enabled, the vulkan driver would notice that not all allocated memory was freed and assert. I am not completely sure on the cause of the non-determinism. Since the problem manifests on shutdown, I'm (just guessing) that the normal shutdown process may be bypassed sometimes? This does only affect a subset of TF-specific invocation schemes, and until that was obvious, it was difficult to see the pattern. Fixes #9936

* Move ref vs retain when stealing pointer for structured args. This was a bit tricky to track down. It only affects structured argument packing/unpacking. We were leaking a reference to the sub-list, which, given that it usually contains buffers, was causing a leak of any contained device backed memory buffers. Then much later, with asserts enabled, the vulkan driver would notice that not all allocated memory was freed and assert. I am not completely sure on the cause of the non-determinism. Since the problem manifests on shutdown, I'm (just guessing) that the normal shutdown process may be bypassed sometimes? This does only affect a subset of TF-specific invocation schemes, and until that was obvious, it was difficult to see the pattern. Fixes #9936 * Re-enable vulkan tests on new CI.

GMNGeoffrey added hal/vulkan Runtime Vulkan GPU HAL backend bindings/python Python wrapping IREE's C API labels Jul 28, 2022

This was referenced Jul 28, 2022

vulkan memory leak with pyiree #3381

Closed

Build and test TF integrations #9927

Merged

GMNGeoffrey assigned stellaraccident Jul 28, 2022

GMNGeoffrey added a commit that referenced this issue Jul 28, 2022

Build and test TF integrations (#9927)

8aa3c3d

I've had to disable Vulkan tests due to #9936 Part of #9855

stellaraccident mentioned this issue Jul 29, 2022

Rework Python driver/device creation. #9330

Merged

stellaraccident closed this as completed in #9330 Jul 29, 2022

stellaraccident reopened this Jul 29, 2022

stellaraccident mentioned this issue Jul 29, 2022

Move ref vs retain when stealing pointer for structured args. #9962

Merged

stellaraccident closed this as completed in #9962 Jul 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF integrations Python + Vulkan failures #9936

TF integrations Python + Vulkan failures #9936

GMNGeoffrey commented Jul 28, 2022

ScottTodd commented Jul 28, 2022

GMNGeoffrey commented Jul 28, 2022

GMNGeoffrey commented Jul 28, 2022

stellaraccident commented Jul 29, 2022

GMNGeoffrey commented Jul 29, 2022

stellaraccident commented Jul 29, 2022

GMNGeoffrey commented Jul 29, 2022

TF integrations Python + Vulkan failures #9936

TF integrations Python + Vulkan failures #9936

Comments

GMNGeoffrey commented Jul 28, 2022

Issue body

ScottTodd commented Jul 28, 2022

GMNGeoffrey commented Jul 28, 2022

GMNGeoffrey commented Jul 28, 2022

stellaraccident commented Jul 29, 2022

GMNGeoffrey commented Jul 29, 2022

stellaraccident commented Jul 29, 2022

GMNGeoffrey commented Jul 29, 2022