Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asan leaks in Vulkan #5716

Open
GMNGeoffrey opened this issue May 3, 2021 · 6 comments
Open

Asan leaks in Vulkan #5716

GMNGeoffrey opened this issue May 3, 2021 · 6 comments
Assignees
Labels
bug 🐞 Something isn't working infrastructure Relating to build systems, CI, or testing

Comments

@GMNGeoffrey
Copy link
Contributor

We have leaks detected by asan (well lsan really) in our vulkan tests. For now I'm disabling vulkan from the asan bot I'm bringing up. I'm not sure if this is something we can fix, at least for swiftshader, or whether we just need to figure out the right sanitizer suppressions.

Logs: https://gist.github.com/13f0f1d8bf176ec8c0d04145fc92e021

@ScottTodd
Copy link
Member

I found a few issues here, but haven't yet figured out the right solutions.


IREE_ASSERT_OK leaks iree_status_t structures for non-ok statuses

71: ==108401==ERROR: LeakSanitizer: detected memory leaks
71: 
71: Direct leak of 64 byte(s) in 1 object(s) allocated from:
71:     #0 0x4a45e7 in __interceptor_posix_memalign (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a45e7)
71:     #1 0x4e63f6 in iree_aligned_alloc /usr/local/google/home/scotttodd/code/iree/iree/base/allocator.h:135:10
71:     #2 0x4e612d in iree_status_allocate /usr/local/google/home/scotttodd/code/iree/iree/base/status.c:395:60
71:     #3 0x4e04fd in iree::hal::vulkan::DynamicSymbols::CreateFromSystemLoader(iree::ref_ptr<iree::hal::vulkan::DynamicSymbols>*) /usr/local/google/home/scotttodd/code/iree/iree/hal/vulkan/dynamic_symbols.cc:168:12

The code for that lives in https://github.com/google/iree/blob/main/iree/testing/status_matchers.h and was mostly written back before the C++ -> C HAL rewrite. Removing the C++ iree::Status class and updating the test macros to be leak free and built on top of the C API seems like good cleanup.

https://github.com/google/iree/blob/6258b50051e4aeb97ebc10f28e9e240760366c6f/iree/hal/vulkan/dynamic_symbols_test.cc#L43-L45


dynamic library loading (dlopen/dlsym/dlclose) interferes with ASAN reporting and symbolization

We open libvulkan.so, but ASAN has trouble understanding that and constructing a useful leak report. It seems like ASAN needs loaded libraries to remain open at process exit: see https://stackoverflow.com/questions/44627258/addresssanitizer-and-loading-of-dynamic-libraries-at-runtime-unknown-module and various issues on GitHub. Without any changes, I see all these <unknown module> addresses in the ASAN report:

71: ==110995==ERROR: LeakSanitizer: detected memory leaks
71: 
71: Direct leak of 72704 byte(s) in 1 object(s) allocated from:
71:     #0 0x4a3b7d in malloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3b7d)
71:     #1 0x7f8aadec3cd5  (<unknown module>)
71:     #2 0x7f8aadeb31e2  (<unknown module>)
71: 
71: Direct leak of 6304 byte(s) in 4 object(s) allocated from:
71:     #0 0x4a3d12 in __interceptor_calloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3d12)
71:     #1 0x7f8ab813c2db  (<unknown module>)
71: 
71: Direct leak of 896 byte(s) in 1 object(s) allocated from:
71:     #0 0x4a3e73 in __interceptor_realloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3e73)
71:     #1 0x7f8ab813b582  (<unknown module>)
71: 
71: Direct leak of 192 byte(s) in 2 object(s) allocated from:
71:     #0 0x4a3b7d in malloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3b7d)
71:     #1 0x7f8aae67bc97  (<unknown module>)
71: 
71: Indirect leak of 139616 byte(s) in 437 object(s) allocated from:
71:     #0 0x4a3d12 in __interceptor_calloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3d12)
71:     #1 0x7f8ab813c2db  (<unknown module>)
71: 
71: Indirect leak of 4923 byte(s) in 7 object(s) allocated from:
71:     #0 0x4a3b7d in malloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3b7d)
71:     #1 0x7f8aae67bc97  (<unknown module>)
71: 
71: Indirect leak of 4871 byte(s) in 361 object(s) allocated from:
71:     #0 0x4a3b7d in malloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3b7d)
71:     #1 0x7f8ab813be74  (<unknown module>)
71: 
71: Indirect leak of 752 byte(s) in 6 object(s) allocated from:
71:     #0 0x4a3e73 in __interceptor_realloc (/usr/local/google/home/scotttodd/code/iree-build/iree/hal/vulkan/dynamic_symbols_test+0x4a3e73)
71:     #1 0x7f8ab813b582  (<unknown module>)
71: 
71: SUMMARY: AddressSanitizer: 230258 byte(s) leaked in 819 allocation(s).

I tried some variations on our dlopen() flags (mainly adding RTLD_NODELETE) and removing our call to dlclose(), but I still can't get useful data from ASAN.
https://github.com/google/iree/blob/6258b50051e4aeb97ebc10f28e9e240760366c6f/iree/base/internal/dynamic_library_posix.c#L176-L179
https://github.com/google/iree/blob/6258b50051e4aeb97ebc10f28e9e240760366c6f/iree/base/internal/dynamic_library_posix.c#L234-L250

@benvanik
Copy link
Collaborator

So long as IREE_ASSERT_OK is tied to gtest it's fine to wrap in an iree::Status - if we wanted to switch to a C unit testing framework then we'd want to make it safe.

We could try making the macros something like:

#define IREE_ASSERT_OK(rexpr) ASSERT_THAT(iree::Status(rexpr), ..)

or

#define IREE_ASSERT_OK(rexpr) \
  { \
    iree::Status __status = {(rexpr)}; \
    ASSERT_THAT(__status...) \
  }

though we may want to do some && trickery to ensure they aren't used later; this would be bad:

iree_status_t status = ...;
IREE_ASSERT_OK(status);
do_something(status);  // status not valid here

All for making it harder to leak them in tests in the normal path --- leaks in failure cases are fine (as barely any test code is expected to gracefully fail).

@ScottTodd
Copy link
Member

Yeah, the status leaks came up when I broke dlopen by passing the wrong flags - I finally saw accurate asan stacks, but for status code and not Vulkan code :P

Any dlopen()/dlclose() tweaks may need to reach into the Vulkan loader as well: https://github.com/KhronosGroup/Vulkan-Loader/blob/8198bebc7fe31c3da54b1dfacbb92e8697646701/loader/vk_loader_platform.h#L238-L257. I'm also trying some LD_PRELOAD tricks but no luck so far.

@ScottTodd
Copy link
Member

Editing the Vulkan Loader code linked above helped me get useful reports from ASAN, which all pointed to leaks throughout libnvidia-glcore.so on my machine. When I switch to using SwiftShader, I see no leaks in IREE's Vulkan tests.

Possible next steps:

  • Debug inside our Docker images, with dlopen/dlclose changes to IREE's code and the Vulkan Loader
  • Update our SwiftShader build to a more recent commit - maybe there were leaks that were fixed
  • Investigate ASAN suppressions (these might also need the dlopen/dlclose changes?)

ScottTodd added a commit that referenced this issue Mar 15, 2022
Progress on #5715 and #5716

Leaks in the Vulkan-related libraries we use were hidden behind incomplete handling of shared library loading/unloading in ASan. By disabling calls to `dlclose()` in both `iree/base/internal/dynamic_library_posix.c` and the Vulkan Loader (`libvulkan.so.1`) so those libraries remained open for ASan to reference, I was able to get useful leak reports. Those reports showed that my NVIDIA system Vulkan ICD (`libnvidia-glcore.so`) was leaking and an up to date SwiftShader (`libvk_swiftshader.so`) was _not_ leaking.

This PR updates SwiftShader to a commit that doesn't leak (with our usage, anyways) and enables most of the Vulkan tests that were previously excluded from running under ASan.

---

A few tests are still failing with crashes in ASan, with logs like this:
```
Tracer caught signal 11: addr=0x0 pc=0x50c558 sp=0x7fb28fdffd10
==50923==LeakSanitizer has encountered a fatal error.
==50923==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==50923==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
```

([full logs here](https://source.cloud.google.com/results/invocations/a37ab871-4cab-4591-a5d6-8ad849f196e3/targets/iree%2Fgcp_ubuntu%2Fcmake%2Flinux%2Fx86-swiftshader-asan%2Fpresubmit/log)), so I'm keeping those disabled explicitly.
@ScottTodd
Copy link
Member

Updating SwiftShader helped!

Most Vulkan tests are fixed + enabled under ASan now.
These remain:
https://github.com/google/iree/blob/bcb6199e19d5a6320397258a8028fc5ec3459d0d/build_tools/kokoro/gcp_ubuntu/cmake/linux/x86-swiftshader-asan/build.sh#L105-L111

with logs like

Tracer caught signal 11: addr=0x0 pc=0x50c558 sp=0x7fb28fdffd10
==50923==LeakSanitizer has encountered a fatal error.
==50923==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==50923==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

(full logs here)

Next step is to try that debugging hint (LSAN_OPTIONS), possibly with dlclose() disabled in IREE code and in libvulkan.so.1 (the Vulkan Loader). I wasn't able to repro those crashes on my local machine, so we might need to debug under Docker.

@ScottTodd
Copy link
Member

Still seeing a few ASan+Vulkan tests fail occasionally with that LeakSanitizer crash

e2e/models/mnist_fake_weights.mlir: https://github.com/iree-org/iree/runs/7768946982?check_suite_focus=true
e2e/models/fragment_000.mlir.test: #10004

Seems like the failures are flakes, since they keep popping up one at a time?

ScottTodd added a commit that referenced this issue Sep 7, 2022
See #5716. I just excluded all
lit tests from here:
https://github.com/iree-org/iree/blob/db6b68773e4daab4da2de1e835c16c4323b583ec/tests/e2e/models/CMakeLists.txt#L13-L24

As there are other tests in that file using runners other than `lit`
that have _not_ crashed (that I'm aware of), I did not exclude the
entire directory.

Most recent failure:
https://github.com/iree-org/iree/runs/8146877750?check_suite_focus=true

```
1030/1030 Test   #46: iree/tests/e2e/models/resnet50_fake_weights.mlir.test ....................................................................***Failed   63.16 sec
-- Testing: 1 tests, 1 workers --
FAIL: IREE :: e2e/models/resnet50_fake_weights.mlir (1 of 1)
******************** TEST 'IREE :: e2e/models/resnet50_fake_weights.mlir' FAILED ********************
Script:
--
: 'RUN: at line 4';   iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=llvm-cpu /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir --function_input=1x224x224x3xf32 | FileCheck /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir
: 'RUN: at line 5';   [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir --function_input=1x224x224x3xf32 | FileCheck /home/runner/actions-runner/_work/iree/iree/tests/e2e/models/resnet50_fake_weights.mlir)
--
Exit Code: 1

Command Output (stderr):
--
Tracer caught signal 11: addr=0x0 pc=0x61dbf9 sp=0x7f63edb65d30
==53069==LeakSanitizer has encountered a fatal error.
==53069==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==53069==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
```
@GMNGeoffrey GMNGeoffrey removed their assignment Oct 3, 2022
ScottTodd added a commit that referenced this issue Oct 25, 2022
Seeing this test flake on a few presubmits:
https://github.com/iree-org/iree/actions/runs/3316089449/jobs/5477477103
https://github.com/iree-org/iree/actions/runs/3316263166/jobs/5477856081

Looks like #5716

```
FAIL: IREE :: e2e/regression/globals.mlir (1 of 1)
******************** TEST 'IREE :: e2e/regression/globals.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vmvx /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir
: 'RUN: at line 2';   [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir)
--
Exit Code: 1

Command Output (stderr):
--
Tracer caught signal 11: addr=0x0 pc=0x624e79 sp=0x7f87b02e3d30
==39757==LeakSanitizer has encountered a fatal error.
==39757==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==39757==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

--

********************
********************
Failed Tests (1):
  IREE :: e2e/regression/globals.mlir
```

The other tests in that suite either don't use Vulkan or can lower to
just VM calls (no HAL):
https://github.com/iree-org/iree/blob/87c5f3f353f9b733f137004c01673f6385515823/tests/e2e/regression/CMakeLists.txt#L16-L22

skip-ci
PhaneeshB pushed a commit to PhaneeshB/iree that referenced this issue Oct 31, 2022
…-org#10892)

Seeing this test flake on a few presubmits:
https://github.com/iree-org/iree/actions/runs/3316089449/jobs/5477477103
https://github.com/iree-org/iree/actions/runs/3316263166/jobs/5477856081

Looks like iree-org#5716

```
FAIL: IREE :: e2e/regression/globals.mlir (1 of 1)
******************** TEST 'IREE :: e2e/regression/globals.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vmvx /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir
: 'RUN: at line 2';   [[ $IREE_VULKAN_DISABLE == 1 ]] || (iree-run-mlir --iree-input-type=mhlo --iree-hal-target-backends=vulkan-spirv /work/tests/e2e/regression/globals.mlir | FileCheck /work/tests/e2e/regression/globals.mlir)
--
Exit Code: 1

Command Output (stderr):
--
Tracer caught signal 11: addr=0x0 pc=0x624e79 sp=0x7f87b02e3d30
==39757==LeakSanitizer has encountered a fatal error.
==39757==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==39757==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

--

********************
********************
Failed Tests (1):
  IREE :: e2e/regression/globals.mlir
```

The other tests in that suite either don't use Vulkan or can lower to
just VM calls (no HAL):
https://github.com/iree-org/iree/blob/87c5f3f353f9b733f137004c01673f6385515823/tests/e2e/regression/CMakeLists.txt#L16-L22

skip-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working infrastructure Relating to build systems, CI, or testing
Projects
No open projects
Status: No status
Development

No branches or pull requests

5 participants