-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF integrations Python + Vulkan failures #9936
Comments
What's the Vulkan environment like on these runners? Vulkan SDK version, GPU driver version, physical GPU vs SwiftShader, etc. |
It's all in the swiftshader-frontends docker container |
Seems like @stellaraccident already has a draft PR for fixing this |
APIs removed: * HalDriver.create() (use iree.runtime.get_driver(driver_name) to get a cached instance). * Environment variable IREE_DEFAULT_DRIVER renamed to IREE_DEFAULT_DEVICE to better reflect the new syntax. * Config.driver attribute (no longer captured by this class) APIs added: * iree.runtime.query_available_drivers() (alias of HalDriver.query()) * iree.runtime.get_driver(device_uri) * iree.runtime.get_device(device_uri) * iree.runtime.get_first_device(device_uris) * iree.runtime.Config(, device: HalDevice) (to configure with an explicit device) * HalDriver.create_device(device_id: Union[int, tuple]) * HalDriver.query_available_devices() * HalDriver.create_device_by_uri(device_uri: str) Both driver and device lookup is done by a device URI, as defined by the runtime (when creating a driver, only the 'scheme' is used). Driver instances are cached by name in the native code, which should avoid various bad behavior in terms of driver lifetimes and lack of care to process state. Devices are optionally (default True) cached at the Python level. Fixes #9277 Expected to fix #9936
Didn't mean to have this auto-close. I haven't verified but do expect that the root cause is fixed. Let me know if not. |
I'm still seeing the same failures in #9954 |
Boo. What do I do to test this? |
Well there's the really slow way of iterating via GitHub actions 😆 I think I actually got the same errors on my 96-core workstation, but I can also point you at a VM with the same config as the CI |
This was a bit tricky to track down. It only affects structured argument packing/unpacking. We were leaking a reference to the sub-list, which, given that it usually contains buffers, was causing a leak of any contained device backed memory buffers. Then much later, with asserts enabled, the vulkan driver would notice that not all allocated memory was freed and assert. I am not completely sure on the cause of the non-determinism. Since the problem manifests on shutdown, I'm (just guessing) that the normal shutdown process may be bypassed sometimes? This does only affect a subset of TF-specific invocation schemes, and until that was obvious, it was difficult to see the pattern. Fixes #9936
* Move ref vs retain when stealing pointer for structured args. This was a bit tricky to track down. It only affects structured argument packing/unpacking. We were leaking a reference to the sub-list, which, given that it usually contains buffers, was causing a leak of any contained device backed memory buffers. Then much later, with asserts enabled, the vulkan driver would notice that not all allocated memory was freed and assert. I am not completely sure on the cause of the non-determinism. Since the problem manifests on shutdown, I'm (just guessing) that the normal shutdown process may be bypassed sometimes? This does only affect a subset of TF-specific invocation schemes, and until that was obvious, it was difficult to see the pattern. Fixes #9936 * Re-enable vulkan tests on new CI.
Issue body
In #9927, I'm consistently hitting the following error in 25 vulkan TF integrations tests
This is the same error message as #3381, but @stellaraccident suggested it was probably a different issue. Odd about this issue is that it happens consistently on the GitHub actions runner, but wasn't happening on Kokoro at all. I've tried reducing test parallelism to ensure this isn't some issue of competing for resources (one difference with the new machines is that they have more cores and #9905 hit this issue), but that did not help.
For now, I'm going to disable the Vulkan tests in this job, but this will block #9855.
These are the failing tests:
Full logs: https://gist.githubusercontent.com/GMNGeoffrey/6e1ca63782858a2d9ee36157351032a1/raw/e7454d9b4a2fbc8dbdd65630b8bd5c145fdeffdd/logs.txt
The text was updated successfully, but these errors were encountered: