Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asan: problem calling NVIDIA CUDA libraries #629

Open
maleadt opened this issue Dec 2, 2015 · 17 comments
Open

asan: problem calling NVIDIA CUDA libraries #629

maleadt opened this issue Dec 2, 2015 · 17 comments

Comments

@maleadt
Copy link

maleadt commented Dec 2, 2015

I'm having an issue using ASAN with the NVIDIA CUDA libraries on a x86_64 system, returning bogus when calling cuInit. Given this small test-case (compiled with clang -lcuda):

#include <cuda.h>
#include <stdio.h>

int main() {
    printf("cuInit(0) = %d\n", cuInit(0));
    return 0;
}

When using -fsanitize=address, this call returns 2 (an impossible return value for this call) instead of the expected 0. I tried tweaking almost every ASAN option, to no avail.

A verbose log:

$ ASAN_OPTIONS=verbosity=4,debug=1 ./cuinit_asan
==9400==Parsed ASAN_OPTIONS: verbosity=4,debug=1
==9400==AddressSanitizer: failed to intercept '__isoc99_printf'
==9400==AddressSanitizer: failed to intercept '__isoc99_sprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_snprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_fprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_vprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_vsprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_vsnprintf'
==9400==AddressSanitizer: failed to intercept '__isoc99_vfprintf'
==9400==AddressSanitizer: libc interceptors initialized
|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem    ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap  ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow  ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem     ||
MemToShadow(shadow): 0x00008fff7000 0x000091ff6dff 0x004091ff6e00 0x02008fff6fff
redzone=16
max_redzone=2048
quarantine_size=256M
malloc_context_size=30
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 7fff8000
==9400==Installed the sigaction for signal 11
==9400==SetCurrentThread: 0x7fe10a946000 for thread 0x7fe10a940780
==9400==T0: stack [0x7fff9d022000,0x7fff9d822000) size 0x800000; local=0x7fff9d8219c0
==9400==AddressSanitizer Init done
cuInit(0) = 2
__tls_get_addr: DTLS_Resize 0x7fe10a940760 0
__tls_get_addr: 0x7fe108c81d90 {0x000000000002,0x000000000000} => 0x7fe10a9406a8; tls_beg: 0x7fe10a9406a8; sp: 0x7fff9d821980 num_live_dtls 1
__tls_get_addr: Can't guess glibc version
==9401==Attached to thread 9400.
==9401==Detached from thread 9400.

System specs: clang 3.5.0 from Debian repositories, on a x86_64 system running linux 3.16.0 with NVIDIA drivers 340.65 (corresponding with CUDA 6.0). On an otherwise identical 32-bit system, the error does not occur.
Also tried with llvm/clang/compiler-rt from svn trunk, same issue.

@kcc
Copy link
Contributor

kcc commented Dec 2, 2015

I've recently added a flag protect_shadow_gap exactly for this purpose.
cuInit() is trying to mmap a large chunk of address space which intersects with the mapping
that asan makes to protect its "shadow gap".
With ASAN_OPTIONS=protect_shadow_gap=0 cuInit() passes for me, as well as a tiny test using cuda (C++ is built with clang+asan, cuda code is built with clang, see http://llvm.org/docs/CompileCudaWithLLVM.html).
I have not tried anything more complex.

Note that the cuda driver is closed source and no one know what it does, why it maps a large chunk at a fixed address and what will happen with that mapping at run-time. So, you are in a warranty void zone :(

@maleadt
Copy link
Author

maleadt commented Dec 2, 2015

What an (awesome) coincidence; thanks for the swift reply! My test case now works, I'll be testing it out on my more complex CUDA back-end soon. Let's see whether hell really breaks loose 😄

@maleadt maleadt closed this as completed Dec 2, 2015
@kcc
Copy link
Contributor

kcc commented Dec 2, 2015

FTR, the allocations done in cuInit() are probably related to CUDA's unified-virtual-address-space.
As soon as you call cudaMallocHost() and use that pointer in asan-instrumented code -- you are in trouble. Since I am not a CUDA expert myself I can't (easily) make a test case for that, so any help here is welcome.

@maleadt
Copy link
Author

maleadt commented Dec 3, 2015

As you suspected, disabling protect_shadow_gap isn't enough; even though it makes cuInit succeed other API calls fail. For example, cuCtxCreate_v2 crashes in the memset interceptor, complaining that Address 0x000200000000 is located in the shadow gap area. I'll see about making a small test-case.

However, it seems that the CUDA allocations are very much fixed: always 0x200000000-0xd00000000. I tried altering the shadow offset (kDefaultShort64bitShadowOffset) to 0xd00000000, or back to 1ULL << 44 which seemed like the previous behavior, but that only led to segfaults when compiling more complex code. Maybe I'm doing something wrong, changing the offset should work right?

Seeing how there's prelink 'detection', I also tried pre-occupying the CUDA range (from the constructor of a LD_PRELOAD library), which forced ASAN to split the shadow gap (/proc/self/map):

7fff7000-8fff7000 rw-p 00000000 00:11 6387124                            /dev/shm/13891 [low shadow] (deleted)
8fff7000-200000000 ---p 00000000 00:11 6387126                           /dev/shm/13891 [shadow gap] (deleted)
200000000-d00000000 ---p 00000000 00:00 0                                CUDA blob
d00000000-2008fff7000 ---p c70009000 00:11 6387126                       /dev/shm/13891 [shadow gap] (deleted)
2008fff7000-10007fef8000 rw-p 00000000 00:11 6387125                     /dev/shm/13891 [high shadow] (deleted)
10007fef8000-10007fff7000 rw-p 00000000 00:00 0 
10007fff7000-10007fff8000 rw-p dfff0000000 00:11 6387125                 /dev/shm/13891 [high shadow] (deleted)

Then I hooked cuInit to unmap the memory before performing the actual call. However, the memset interceptor still complains about 0x000200000000 being part of the shadow gap. Is this expected? Although the mapping is clearly split, what ASAN prints at verbosity=4 still seems like the default mapping:

|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem    ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap  ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow  ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem     ||

Any pointers? Thanks!

@kcc
Copy link
Contributor

kcc commented Dec 3, 2015

Try ASAN_OPTIONS=protect_shadow_gap=0:replace_intrin=0:detect_leaks=0.
I'll update this bug with a test a bit later.

@kcc
Copy link
Contributor

kcc commented Dec 3, 2015

Here is my test (derived from http://llvm.org/docs/CompileCudaWithLLVM.html):

axpy.cu:

#include <helper_cuda.h> // for checkCudaErrors
#include <iostream>
__global__ void axpy(float a, float* x, float* y) {
  y[threadIdx.x] = a * x[threadIdx.x];
}
void test_axpy() {
  const int kDataLen = 4;
  float a = 2.0f;
  float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
  float host_y[kDataLen];
  float* device_x;
  float* device_y;
  checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
  checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
  checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
                             cudaMemcpyHostToDevice));
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
  checkCudaErrors(cudaDeviceSynchronize());
  checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
                             cudaMemcpyDeviceToHost));
  for (int i = 0; i < kDataLen; ++i) {
    std::cout << "y[" << i << "] = " << host_y[i] << "\n";
  }
  checkCudaErrors(cudaDeviceReset());
}

axpy_main.cc:

void test_axpy();
int main() { test_axpy(); }

Build:

clang --cuda-path=cuda -I cuda/samples/common/inc/ axpy.cu -c
clang++ axpy_main.cc axpy.o -lcudart_static -lcuda -ldl -lrt -pthread  -L /usr/lib/x86_64-linux-gnu/ -L cuda/lib64 -fsanitize=address

Run:

ASAN_OPTIONS=protect_shadow_gap=0:replace_intrin=0:detect_leaks=0 ./a.out 

W/o protect_shadow_gap=0 cuInit() returns 2

W/o replace_intrin=0 (which disables checks inside memset interceptor) I get this:

==17300==ERROR: AddressSanitizer: unknown-crash on address 0x000200000000 at pc 0x00000049b6c6 bp 0x7fffa85cb210 sp 0x7fffa85ca9c8
WRITE of size 65536 at 0x000200000000 thread T0
    #0 0x49b6c5 in __interceptor_memset
    #1 0x7fb0c82e6813  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x225813)
    #2 0x7fb0c82e6ee7  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x225ee7)
    #3 0x7fb0c82e761d  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x22661d)
    #4 0x7fb0c8241bc4  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x180bc4)
    #5 0x7fb0c83d7a4e  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x316a4e)
    #6 0x7fb0c822761b  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x16661b)
    #7 0x7fb0c81f325a in cuDevicePrimaryCtxRetain (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x13225a)
...
Address 0x000200000000 is located in the shadow gap area.

This failure should be easy to fix, but for now let's use replace_intrin=0

W/o detect_leaks=0 I get this at the end:

==17352==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 4904 byte(s) in 1 object(s) allocated from:
    #0 0x4b1154 in calloc 
    #1 0x7fe8667fd4a2  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x2264a2)
    #2 0x7fe866757bc4  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x180bc4)
    #3 0x7fe8668eda4e  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x316a4e)
    #4 0x7fe86673d61b  (/usr/lib/x86_64-linux-gnu/libcuda.so.1+0x16661b)
    #5 0x7fe86670925a in cuDevicePrimaryCtxRetain

Now I'd like to have a test that does something more interesting and fails somewhere else.

@kcc kcc reopened this Dec 3, 2015
@maleadt
Copy link
Author

maleadt commented Dec 4, 2015

Concerning your test-case, I'm seeing mostly identical behaviour, except for the case without detect_leaks which works properly at my end. Probably some behavioural change in libcuda; I'm using v340.65.
Even my tests now work properly, as soon as I disable replace_intrin. However, like your testcase my host code (ie. the code invoking libcuda) isn't ASAN instrumented yet, so maybe new errors could pop up if that's the case. I'll keep my eyes open for new types of errors and post them here.

That said, shouldn't my two attempted workarounds have succeeded (changing the shadow offset, and forcing a split shadow allocation by mmapping the memory)? I'm willing to have a closer look at both, if it's unexpected behaviour of course.

FTR, it seems like device pointers returned by cudaMalloc and the like overlap with the large blob, which is allocated with PROT_NONE, so maybe NVIDIA is just doing this to catch using device pointers in host code.

@kcc
Copy link
Contributor

kcc commented Dec 4, 2015

My libcuda is 346.96. Looks like they've introduced a leak :)

Instrumenting the host (GPU) code with asan is most likely not going to work today,
at least for me adding -fsanitize-address to compile command of .cu file crashes the compiler.
If you want to track that part, file a bug at llvm.org/bugs, but I am pretty sure nobody will look into
it for at least several months -- the entire cuda support is too new.

Your attempted workarounds might have worked, need to look closer why they didn't.
But it's unlikely we'll be able to accept any such changes.

Apparently, the large blob at 0x200000000 is used by cuda's allocator as an arena.
The main question: is this memory going to be used by the CPU code, or only by the GPU code.
If this memory is commonly used by CPU code (which will break asan), I'd like to have a realistic example.

@bchretien
Copy link

bchretien commented Oct 7, 2016

This may be fixed in the next clang release, or at least there are LLVM contributors interested in this issue: https://reviews.llvm.org/D24640 (sanitizer skipped on NVPTX)

@kcc
Copy link
Contributor

kcc commented Oct 7, 2016

I've also made ASAN_OPTIONS=protect_shadow_gap=0 be more friendly to cuda
and tests work in my environment. But YMMV.

@jeremysalwen
Copy link

I am not getting any errors with kcc's test program/options, but the following short test program is giving a SEGV with the same options:

#include <cuda_runtime.h>
#include <iostream>

int main() {
  void *  host_data;
  cudaError_t status = cudaMalloc(&host_data, 3000);
  std::cout << "status " << status <<"\n";
  std::cout << "value "<<((int*)host_data)[0]<<"\n";
}

Compilation flags:

clang++-3.8 -fsanitize=address axpy_main.cc -lcudart_static -lcuda -ldl -lrt -pthread  -L /usr/lib/x86_64-linux-gnu/ -I /usr/local/cuda-8.0/include/ -L /usr/local/cuda-8.0/lib64/

Error:

ASAN:DEADLYSIGNAL
=================================================================
==9844==ERROR: AddressSanitizer: SEGV on unknown address 0x0004e1470000 (pc 0x0000004f61c5 bp 0x7ffdbfe3c780 sp 0x7ffdbfe3c6a0 T0)
    #0 0x4f61c4  (/tmp/a.out+0x4f61c4)
    #1 0x7f1e8c1b0f44  (/lib/x86_64-linux-gnu/libc.so.6+0x21f44)
    #2 0x41ae47  (/tmp/a.out+0x41ae47)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/tmp/a.out+0x4f61c4)
==9844==ABORTING
status 0

@morehouse
Copy link
Contributor

Has there been any progress on this front?

@yegord
Copy link

yegord commented Aug 7, 2018

I am not getting any errors with kcc's test program/options, but the following short test program is giving a SEGV with the same options:

I think you actually wanted to call cudaMallocHost, not cudaMalloc. After I replaced the latter with the former, your example ran fine with both clang-3.9 and clang 6.0.

Has there been any progress on this front?

I personally managed to successfully run under asan a binary computing a heavy TesnorFlow graph on a GPU using clang-6.0 and ASAN_OPTIONS=protect_shadow_gap=0:replace_intrin=0:detect_leaks=0. So, I would say that things rather work than they do not.

@mustafaahmedhussien
Copy link

Hello,

Why don't asan provide an environment variable or a flag that isn't a compile time that would let the end user decide where the shadow memory shall be?

@kcc
Copy link
Contributor

kcc commented Nov 14, 2018

There is now a compile-time flag to enable "dynamic shadow": -mllvm -asan-force-dynamic-shadow=true
It will cause the compiler to emit code that doesn't depend on the fixed shadow location,
and then the run-time will allocate the shadow whenever it can. There is no env var that tells the run-time where to allocate -- no one has asked before.

The default mode (on Linux) still uses the fixed shadow offset 0x7fff8000 so that
the instrumentation doesn't use an extra register (as a performance and code size optimization)

@mustafaahmedhussien
Copy link

The driver I am using it doing an mmap of 500 MBytes at 0x3fffff000. This hits the default shadow memory. I tried using the force dynamic shadow flag with clang 6 but it doesn't work. Shadow memory is still at 0x7fff8000.

My idea is that if I can move the shadow memory outside my mmap address it would be safer than disabling the shadow memory protection.

@intractabilis
Copy link

Note that the cuda driver is closed source and no one know what it does, why it maps a large chunk at a fixed address and what will happen with that mapping at run-time.

CUDA has its own memory allocation routines, so I would assume it's for its own heap.

iiSeymour pushed a commit to nanoporetech/dorado that referenced this issue Oct 9, 2023
We already do this in ont_core_cpp since it was more of an issue when
talking to the CUDA API, but since torch talks to CUDA for us it gets
to decide how to handle the issue and it looks like it just prints a
warning and continues on assuming no CUDA support. This is fine for end
users, but for automated testing it means we were skipping tests
without realising.

See google/sanitizers#629 for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants