GitHub - deeptavker/piNNS: Parallel implementation of Nearest Neighbour Search algorithm

CUDA implementation of cell-linked-list nearest neighbour search algorithm on a uniform grid.

Code : neigbour.cu

Cite as : D. Tavker, Parallel Neighbour Search Implementation, https://github.com/deeptavker/piNNS (2018).

Neighbour Algorithm

1 Introduction

The neighbour module is intended for use as a base module for implementations of sub-routines in the MPS and SPH codes. The current serial implementation of the neighbour algorithm which is similar to the cell-linked-list approach involves a uniform cell grid onto which the particles are allocated after which the neighbour search is carried out by searching only the particles in the neighbouring cells (9 in case of 2D and 27 in case of 3D). The parallel implementation follows a paper written by Simon Green based on Particle Simulation in CUDA. The parallel algorithm is implemented in C++ CUDA and uses the CUDA Thrust library for the RadixSort operation. There are two different parallel neighbour search functions available depending upon the global memory consumption and memory transfer capabilities of the connector bus. A serial imeplementation is also available.

2 How to use the module?

There are three functions available for use. NEIGHBOR() which is serial and neighbour_cuda_1() and neighbour_cuda_2() which are both GPU-parallelised.

2.1 The first function available is `neighbour_cuda_1(args)`

The required arguments are :

int* - x, y, z
int - Xmax, Xmin, Ymax, Ymin, Zmax, Zmin, re, DELTA, NUM, MAX_NEIGHB
int** - neighb

The function modifies the neighb array which is passed by reference in the arguments and populates it with the required values of particle IDs and number of neighbours. For any particle ID i the number of neighbours can be found by querying neighb[i][1]. And all the neighbour particle IDs can be iterated over by doing the following.

for(int j=0; j<neighb[i][1]; j++){
  neighb[i][j+2];
}

2.2 There is another function available which is called `neighbour_cuda_2(args)`

The required arguments are :

int* - x, y, z, particleHash, particleID, cellStart, cellEnd
int - Xmax, Xmin, Ymax, Ymin, Zmax, Zmin, re, DELTA, NUM, MAX_NEIGHB

This function modifies particleHash, particleID, cellStart, cellEnd just like the previous function modifies neighb but these 4 arrays combined take less space on the global memory than the neighb array and hence constitute a preferable mode of implementation. Together, they make up O(2N + 2M) space where N is the total number of particles and M is the total number of cells. Whereas the neighb takes up O(N * MAX_NEIGHB) space.

The neighbours in this can can be looped over for all particles like so:

For an index i find the Particle Id and particle cell number from particleiD and particleHash.
Find the coordinates of the cell in terms of i, j, and k. Here we use Cnum = (i-1) + (j-1)*ncx + (k-1)*ncx*ncy.
Find the neighbouring cell numbers and iterate over the particles in those cells using cellStart, cellEnd, and particleId. cellstart and cellEnd are already populated according to the key-sorted particleHash with particleId as the key-array.

3 Performance Measures of the code

All the tests are run on a 64-bit Linux system with GeForce GT 730 NVIDIA GPU and 16GB RAM on an intel core i7-7700 @ 3.60GHz x 8 processor.

3.1 A basic time study for NNS using CUDA and Serial Code

The time study is done using a separate code which generates random particles in a domain and then Cuda NNS and serial NNS is operated on that domain. The resulting neighb arrays are compared for number of neighbours and particle ids. As of now, all the tests are passing. The time study is done using the chrono module in C++. The functions used are neighbour_cuda_1() and NEIGHBOUR_serial(). The time is averaged over three trials.

Number of threads per block are currently set to 512 using THREADS_PER_BLOCK definition. Also, the maximum number of neighbours is restricted to 1500 using MAX_NEIGHB variable. Both of these can be changed if required.

The first plot depicts the time taken for the neighbour serach and data allocation to the neighb variable by both the functions. The second plot depicts the variation of factor by which the CUDA code is faster than the serial code. As can be seen the CUDA code can get more than 200x speedup with large number of particles (~50000) for this particular NNS algorithm.

3.2 Case study

For a case of 3D landslide, the overall speedup is over 1.45x which is not much less than the theoretical speedup of 1.66x if the neighbour search is considered to consume 40% of the computation time and the GPU essentially blazes through the search. For this test, the values of MAX_NEIGHB and THREADS_PER_BLOCK were set to 1500 and 512 respectively. The number of particles was equal to 264815.

3.3 Variation of time with respect to `THREADS_PER_BLOCK`

The plots are for different number of particles. Throughout the tests, MAX_NEIGHB is set to 1500.

3.4 Variation of time with respect to `MAX_NEIGHB`

The value of this parameter directly dictates the size of the neighb array. These tests show that in case of GPU, memory considerations become critical to speedup because of the memory transfers from device to host of the neighb array.

Following are plots which depict curves in which MAX_NEIGHB is varied while keeping number of particles constant. Time taken for NNS is recorded for Parallel and Serial neighbour search functions. Speedup is also compared. As can be seen, increasing memory transfers (for the neighb array) are handled quite easily by the CPU but for the GPU code, memory transfers slow down the code by a certain amount. This demonstrates the throughput optimisation in GPUs as compared to Latency Optimisation in CPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
test		test
LICENSE		LICENSE
README.md		README.md
neighbour.cu		neighbour.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neighbour Algorithm

1 Introduction

2 How to use the module?

2.1 The first function available is `neighbour_cuda_1(args)`

2.2 There is another function available which is called `neighbour_cuda_2(args)`

3 Performance Measures of the code

3.1 A basic time study for NNS using CUDA and Serial Code

3.2 Case study

3.3 Variation of time with respect to `THREADS_PER_BLOCK`

3.4 Variation of time with respect to `MAX_NEIGHB`

About

Releases

Packages

Languages

License

deeptavker/piNNS

Folders and files

Latest commit

History

Repository files navigation

Neighbour Algorithm

1 Introduction

2 How to use the module?

2.1 The first function available is neighbour_cuda_1(args)

2.2 There is another function available which is called neighbour_cuda_2(args)

3 Performance Measures of the code

3.1 A basic time study for NNS using CUDA and Serial Code

3.2 Case study

3.3 Variation of time with respect to THREADS_PER_BLOCK

3.4 Variation of time with respect to MAX_NEIGHB

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

2.1 The first function available is `neighbour_cuda_1(args)`

2.2 There is another function available which is called `neighbour_cuda_2(args)`

3.3 Variation of time with respect to `THREADS_PER_BLOCK`

3.4 Variation of time with respect to `MAX_NEIGHB`

Packages