Memory indexing problem for 10k threads #1

nitinsatpute · 2021-09-14T05:19:07Z

Hello, thank you for your implementation. Its working fine for 2500 threads, however if I increase the number of threads to 10k with 40GB of GPU memory it gives the following memory related error. Could you investigate this issue with me?

========= Invalid __global__ write of size 1
=========     at 0x000001e0 in shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int)
=========     by thread (999,0,0) in block (3,0,0)
=========     Address 0x7fcf040c2c9f is out of bounds
=========     Device Frame:shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) (shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) : 0x1e0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x252e4a]
=========     Host Frame:./exec [0x234f5]
=========     Host Frame:./exec [0x76bf0]
=========     Host Frame:./exec [0xa5dd]
=========     Host Frame:./exec [0x11ad8]
=========     Host Frame:./exec [0x11f45]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./exec [0x8c7e]
=========
========= Invalid __global__ write of size 1
=========     at 0x000001e0 in shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int)
=========     by thread (998,0,0) in block (3,0,0)
=========     Address 0x7fcf040c2c9e is out of bounds
=========     Device Frame:shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) (shuffling_kernel(unsigned char*, unsigned char const *, curandStateXORWOW*, unsigned int, unsigned int) : 0x1e0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x252e4a]
=========     Host Frame:./exec [0x234f5]
=========     Host Frame:./exec [0x76bf0]
=========     Host Frame:./exec [0xa5dd]
=========     Host Frame:./exec [0x11ad8]
=========     Host Frame:./exec [0x11f45]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./exec [0x8c7e]

The text was updated successfully, but these errors were encountered:

yeah1kim · 2021-09-24T01:55:06Z

Hello, thank you for your comment.
In my code, (num_block×num_thread×1,000,000)-byte of GPU memory and the memory address is 32-bit.
Therefore, if (num_block×num_thread), that is, the number of threads, is 4,000, the memory address is less than 32-bit, and thus the operation is possible.

log2(4,000×1,000,000) = 31.90

However, if (num_block×num_thread) is 5000, it does not work because the address exceeds 32-bit.

log2(5,000×1,000,000) = 32.22

Because of this problem, even though I have 12GB of global memory, I couldn't get it to work at 10,000 threads.

<(num_block×num_thread) = 4,000>

 C:\Users\user>C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\cuda_iid\x64\Release\cuda_iid.exe
 <file_name>: Must be relative path to a binary file with 1 million entries (samples).
          ex) C:\Users\user\Desktop\test_data\truerand_1bit.bin
 C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\test_data\truerand_8bit.bin

[the_number_of_samples]: Must be at least 1 million samples. If 0, set as the default(= 1,000,000).
0

[bits_per_symbol]: Must be between 1-8, inclusive.
8

Enter: [num_iteration_in_parallel] [num_cuda_block] [num_cuda_thread]
   Must be [num_iteration_in_parallel] = [num_cuda_block] x [num_cuda_thread].
   If all 0, set as the defaults(2500, 10, 250), and at least 2.5GB of GPU global memory is used.
     - [num_iteration_in_parallel]: The number of iterations in parallel.
                                Must have ([num_iteration_in_parallel] x 1 million bytes) of GPU global memory.
     - [num_cuda_block]: The number of CUDA blocks.
     - [num_cuda_thread]: The number of CUDA blocks.
4000 4 1000

[verbose] Optional verbosity flag(0/1) for more output. 0(false) is the default.
0

Start the permutation testing.
Performing 19 statistical tests on the original data.
Performing 10,000 iterations(shuffling + 18 statisitcal tests) in parallel on the GPU.
Performing 10,000 iterations(shuffling + compression test) using OpenMP on the CPU.
End the permutation testing.

==> Assume that the noise source outputs are IID!

<(num_block×num_thread) = 5000>

 C:\Users\user>C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\cuda_iid\x64\Release\cuda_iid.exe
 <file_name>: Must be relative path to a binary file with 1 million entries (samples).
              ex) C:\Users\user\Desktop\test_data\truerand_1bit.bin
 C:\Users\user\Downloads\yeah_GPU_SP800_90B_IID-master\test_data\truerand_8bit.bin

[the_number_of_samples]: Must be at least 1 million samples. If 0, set as the default(= 1,000,000).
0

[bits_per_symbol]: Must be between 1-8, inclusive.
8

Enter: [num_iteration_in_parallel] [num_cuda_block] [num_cuda_thread]
   Must be [num_iteration_in_parallel] = [num_cuda_block] x [num_cuda_thread].
   If all 0, set as the defaults(2500, 10, 250), and at least 2.5GB of GPU global memory is used.
     - [num_iteration_in_parallel]: The number of iterations in parallel.
                                Must have ([num_iteration_in_parallel] x 1 million bytes) of GPU global memory.
     - [num_cuda_block]: The number of CUDA blocks.
     - [num_cuda_thread]: The number of CUDA blocks.
5000 5 1000

[verbose] Optional verbosity flag(0/1) for more output. 0(false) is the default.
0

Start the permutation testing.
Performing 19 statistical tests on the original data.
Performing 10,000 iterations(shuffling + 18 statisitcal tests) in parallel on the GPU.
**cudaMemcpy failed.**
No need to perform 10,000 compression tests using OpenMP on the CPU.
End the permutation testing.

==> Reject the IID assumption!

nitinsatpute · 2021-09-26T05:41:28Z

Hi again, thank you for your response. But I would like to clarify the following point.
"Because of this problem, even though I have 12GB of global memory, I couldn't get it to work at 10,000 threads." Could you comment on the results from the Table 9? It shows the numbers for N=5,000 and 10,000 threads.

yeah1kim · 2021-10-26T04:45:01Z

Hi, thank you for your comment.
The previous response was the issue I was worried about while implementing. Unfortunately, I gave you the wrong response.

When N(=the total number of threads) is more than 5,000, the size(e.g., 10,000,000,000 or 5,000,000,000) must be entered directly into the input of the cudaMalloc function that allocates GPU memory of size dev_Ndata to work.
The result of Table 9 written in my thesis was the result of an experiment by directly inputting the memory size as an input to the function.

Thanks to your comment, I rechecked my source code and corrected the typecasting about dev_Ndata.
(gpu_permutation_testing.cu line 58,59)

As a result, it now works when N is 5000 or more without modifying the source code further.
I uploaded the modified source code.

Thanks for your comment again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory indexing problem for 10k threads #1

Memory indexing problem for 10k threads #1

nitinsatpute commented Sep 14, 2021

yeah1kim commented Sep 24, 2021

nitinsatpute commented Sep 26, 2021

yeah1kim commented Oct 26, 2021

Memory indexing problem for 10k threads #1

Memory indexing problem for 10k threads #1

Comments

nitinsatpute commented Sep 14, 2021

yeah1kim commented Sep 24, 2021

nitinsatpute commented Sep 26, 2021

yeah1kim commented Oct 26, 2021