-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asan leaks in Vulkan #5716
Comments
I found a few issues here, but haven't yet figured out the right solutions.
|
So long as IREE_ASSERT_OK is tied to gtest it's fine to wrap in an iree::Status - if we wanted to switch to a C unit testing framework then we'd want to make it safe. We could try making the macros something like: #define IREE_ASSERT_OK(rexpr) ASSERT_THAT(iree::Status(rexpr), ..) or #define IREE_ASSERT_OK(rexpr) \
{ \
iree::Status __status = {(rexpr)}; \
ASSERT_THAT(__status...) \
} though we may want to do some iree_status_t status = ...;
IREE_ASSERT_OK(status);
do_something(status); // status not valid here All for making it harder to leak them in tests in the normal path --- leaks in failure cases are fine (as barely any test code is expected to gracefully fail). |
Yeah, the status leaks came up when I broke dlopen by passing the wrong flags - I finally saw accurate asan stacks, but for status code and not Vulkan code :P Any dlopen()/dlclose() tweaks may need to reach into the Vulkan loader as well: https://github.com/KhronosGroup/Vulkan-Loader/blob/8198bebc7fe31c3da54b1dfacbb92e8697646701/loader/vk_loader_platform.h#L238-L257. I'm also trying some LD_PRELOAD tricks but no luck so far. |
Editing the Vulkan Loader code linked above helped me get useful reports from ASAN, which all pointed to leaks throughout Possible next steps:
|
Progress on #5715 and #5716 Leaks in the Vulkan-related libraries we use were hidden behind incomplete handling of shared library loading/unloading in ASan. By disabling calls to `dlclose()` in both `iree/base/internal/dynamic_library_posix.c` and the Vulkan Loader (`libvulkan.so.1`) so those libraries remained open for ASan to reference, I was able to get useful leak reports. Those reports showed that my NVIDIA system Vulkan ICD (`libnvidia-glcore.so`) was leaking and an up to date SwiftShader (`libvk_swiftshader.so`) was _not_ leaking. This PR updates SwiftShader to a commit that doesn't leak (with our usage, anyways) and enables most of the Vulkan tests that were previously excluded from running under ASan. --- A few tests are still failing with crashes in ASan, with logs like this: ``` Tracer caught signal 11: addr=0x0 pc=0x50c558 sp=0x7fb28fdffd10 ==50923==LeakSanitizer has encountered a fatal error. ==50923==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1 ==50923==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc) ``` ([full logs here](https://source.cloud.google.com/results/invocations/a37ab871-4cab-4591-a5d6-8ad849f196e3/targets/iree%2Fgcp_ubuntu%2Fcmake%2Flinux%2Fx86-swiftshader-asan%2Fpresubmit/log)), so I'm keeping those disabled explicitly.
Updating SwiftShader helped! Most Vulkan tests are fixed + enabled under ASan now. with logs like
Next step is to try that debugging hint ( |
Still seeing a few ASan+Vulkan tests fail occasionally with that LeakSanitizer crash
Seems like the failures are flakes, since they keep popping up one at a time? |
See #5716. I just excluded all lit tests from here: https://github.com/iree-org/iree/blob/db6b68773e4daab4da2de1e835c16c4323b583ec/tests/e2e/models/CMakeLists.txt#L13-L24 As there are other tests in that file using runners other than `lit` that have _not_ crashed (that I'm aware of), I did not exclude the entire directory. Most recent failure: https://github.com/iree-org/iree/runs/8146877750?check_suite_focus=true ``` 1030/1030 Test #46: iree/tests/e2e/models/resnet50_fake_weights.mlir.test ....................................................................***Failed 63.16 sec -- Testing: 1 tests, 1 workers -- FAIL: IREE :: e2e/models/resnet50_fake_weights.mlir (1 of 1) ******************** TEST 'IREE :: e2e/models/resnet50_fake_weights.mlir' FAILED ******************** Script: -- : 'RUN: at line 4'; iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=llvm-cpu /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir --function_input=1x224x224x3xf32 | FileCheck /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir : 'RUN: at line 5'; [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir --function_input=1x224x224x3xf32 | FileCheck /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir) -- Exit Code: 1 Command Output (stderr): -- Tracer caught signal 11: addr=0x0 pc=0x61dbf9 sp=0x7f63edb65d30 ==53069==LeakSanitizer has encountered a fatal error. ==53069==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1 ==53069==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc) ```
Seeing this test flake on a few presubmits: https://github.com/iree-org/iree/actions/runs/3316089449/jobs/5477477103 https://github.com/iree-org/iree/actions/runs/3316263166/jobs/5477856081 Looks like #5716 ``` FAIL: IREE :: e2e/regression/globals.mlir (1 of 1) ******************** TEST 'IREE :: e2e/regression/globals.mlir' FAILED ******************** Script: -- : 'RUN: at line 1'; iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vmvx /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir : 'RUN: at line 2'; [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir) -- Exit Code: 1 Command Output (stderr): -- Tracer caught signal 11: addr=0x0 pc=0x624e79 sp=0x7f87b02e3d30 ==39757==LeakSanitizer has encountered a fatal error. ==39757==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1 ==39757==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc) -- ******************** ******************** Failed Tests (1): IREE :: e2e/regression/globals.mlir ``` The other tests in that suite either don't use Vulkan or can lower to just VM calls (no HAL): https://github.com/iree-org/iree/blob/87c5f3f353f9b733f137004c01673f6385515823/tests/e2e/regression/CMakeLists.txt#L16-L22 skip-ci
…-org#10892) Seeing this test flake on a few presubmits: https://github.com/iree-org/iree/actions/runs/3316089449/jobs/5477477103 https://github.com/iree-org/iree/actions/runs/3316263166/jobs/5477856081 Looks like iree-org#5716 ``` FAIL: IREE :: e2e/regression/globals.mlir (1 of 1) ******************** TEST 'IREE :: e2e/regression/globals.mlir' FAILED ******************** Script: -- : 'RUN: at line 1'; iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vmvx /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir : 'RUN: at line 2'; [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir) -- Exit Code: 1 Command Output (stderr): -- Tracer caught signal 11: addr=0x0 pc=0x624e79 sp=0x7f87b02e3d30 ==39757==LeakSanitizer has encountered a fatal error. ==39757==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1 ==39757==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc) -- ******************** ******************** Failed Tests (1): IREE :: e2e/regression/globals.mlir ``` The other tests in that suite either don't use Vulkan or can lower to just VM calls (no HAL): https://github.com/iree-org/iree/blob/87c5f3f353f9b733f137004c01673f6385515823/tests/e2e/regression/CMakeLists.txt#L16-L22 skip-ci
We have leaks detected by asan (well lsan really) in our vulkan tests. For now I'm disabling vulkan from the asan bot I'm bringing up. I'm not sure if this is something we can fix, at least for swiftshader, or whether we just need to figure out the right sanitizer suppressions.
Logs: https://gist.github.com/13f0f1d8bf176ec8c0d04145fc92e021
The text was updated successfully, but these errors were encountered: