Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory indexing problem for 10k threads #1

Open
nitinsatpute opened this issue Sep 14, 2021 · 3 comments
Open

Memory indexing problem for 10k threads #1

nitinsatpute opened this issue Sep 14, 2021 · 3 comments

Comments

@nitinsatpute
Copy link

Hello, thank you for your implementation. Its working fine for 2500 threads, however if I increase the number of threads to 10k with 40GB of GPU memory it gives the following memory related error. Could you investigate this issue with me?

========= Invalid __global__ write of size 1
=========     at 0x000001e0 in shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int)
=========     by thread (999,0,0) in block (3,0,0)
=========     Address 0x7fcf040c2c9f is out of bounds
=========     Device Frame:shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) (shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) : 0x1e0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x252e4a]
=========     Host Frame:./exec [0x234f5]
=========     Host Frame:./exec [0x76bf0]
=========     Host Frame:./exec [0xa5dd]
=========     Host Frame:./exec [0x11ad8]
=========     Host Frame:./exec [0x11f45]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./exec [0x8c7e]
=========
========= Invalid __global__ write of size 1
=========     at 0x000001e0 in shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int)
=========     by thread (998,0,0) in block (3,0,0)
=========     Address 0x7fcf040c2c9e is out of bounds
=========     Device Frame:shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) (shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) : 0x1e0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x252e4a]
=========     Host Frame:./exec [0x234f5]
=========     Host Frame:./exec [0x76bf0]
=========     Host Frame:./exec [0xa5dd]
=========     Host Frame:./exec [0x11ad8]
=========     Host Frame:./exec [0x11f45]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./exec [0x8c7e]

@yeah1kim
Copy link
Owner

Hello, thank you for your comment.
In my code, (num_block×num_thread×1,000,000)-byte of GPU memory and the memory address is 32-bit.
Therefore, if (num_block×num_thread), that is, the number of threads, is 4,000, the memory address is less than 32-bit, and thus the operation is possible.

log2(4,000×1,000,000) = 31.90

However, if (num_block×num_thread) is 5000, it does not work because the address exceeds 32-bit.

log2(5,000×1,000,000) = 32.22

Because of this problem, even though I have 12GB of global memory, I couldn't get it to work at 10,000 threads.

<(num_block×num_thread) = 4,000>

 C:\Users\user>C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\cuda_iid\x64\Release\cuda_iid.exe
 <file_name>: Must be relative path to a binary file with 1 million entries (samples).
          ex) C:\Users\user\Desktop\test_data\truerand_1bit.bin
 C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\test_data\truerand_8bit.bin

[the_number_of_samples]: Must be at least 1 million samples. If 0, set as the default(= 1,000,000).
0

[bits_per_symbol]: Must be between 1-8, inclusive.
8

Enter: [num_iteration_in_parallel] [num_cuda_block] [num_cuda_thread]
   Must be [num_iteration_in_parallel] = [num_cuda_block] x [num_cuda_thread].
   If all 0, set as the defaults(2500, 10, 250), and at least 2.5GB of GPU global memory is used.
     - [num_iteration_in_parallel]: The number of iterations in parallel.
                                Must have ([num_iteration_in_parallel] x 1 million bytes) of GPU global memory.
     - [num_cuda_block]: The number of CUDA blocks.
     - [num_cuda_thread]: The number of CUDA blocks.
4000 4 1000

[verbose] Optional verbosity flag(0/1) for more output. 0(false) is the default.
0

Start the permutation testing.
Performing 19 statistical tests on the original data.
Performing 10,000 iterations(shuffling + 18 statisitcal tests) in parallel on the GPU.
Performing 10,000 iterations(shuffling + compression test) using OpenMP on the CPU.
End the permutation testing.

==> Assume that the noise source outputs are IID!

<(num_block×num_thread) = 5000>

 C:\Users\user>C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\cuda_iid\x64\Release\cuda_iid.exe
 <file_name>: Must be relative path to a binary file with 1 million entries (samples).
              ex) C:\Users\user\Desktop\test_data\truerand_1bit.bin
 C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\test_data\truerand_8bit.bin

[the_number_of_samples]: Must be at least 1 million samples. If 0, set as the default(= 1,000,000).
0

[bits_per_symbol]: Must be between 1-8, inclusive.
8

Enter: [num_iteration_in_parallel] [num_cuda_block] [num_cuda_thread]
   Must be [num_iteration_in_parallel] = [num_cuda_block] x [num_cuda_thread].
   If all 0, set as the defaults(2500, 10, 250), and at least 2.5GB of GPU global memory is used.
     - [num_iteration_in_parallel]: The number of iterations in parallel.
                                Must have ([num_iteration_in_parallel] x 1 million bytes) of GPU global memory.
     - [num_cuda_block]: The number of CUDA blocks.
     - [num_cuda_thread]: The number of CUDA blocks.
5000 5 1000

[verbose] Optional verbosity flag(0/1) for more output. 0(false) is the default.
0

Start the permutation testing.
Performing 19 statistical tests on the original data.
Performing 10,000 iterations(shuffling + 18 statisitcal tests) in parallel on the GPU.
**cudaMemcpy failed.**
No need to perform 10,000 compression tests using OpenMP on the CPU.
End the permutation testing.

==> Reject the IID assumption!

@nitinsatpute
Copy link
Author

Hi again, thank you for your response. But I would like to clarify the following point.
"Because of this problem, even though I have 12GB of global memory, I couldn't get it to work at 10,000 threads." Could you comment on the results from the Table 9? It shows the numbers for N=5,000 and 10,000 threads.
image

@yeah1kim
Copy link
Owner

Hi, thank you for your comment.
The previous response was the issue I was worried about while implementing. Unfortunately, I gave you the wrong response.

When N(=the total number of threads) is more than 5,000, the size(e.g., 10,000,000,000 or 5,000,000,000) must be entered directly into the input of the cudaMalloc function that allocates GPU memory of size dev_Ndata to work.
The result of Table 9 written in my thesis was the result of an experiment by directly inputting the memory size as an input to the function.
image
image

Thanks to your comment, I rechecked my source code and corrected the typecasting about dev_Ndata.
(gpu_permutation_testing.cu line 58,59)

As a result, it now works when N is 5000 or more without modifying the source code further.
I uploaded the modified source code.

Thanks for your comment again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants