Possible fix for the GPU transpose utility #1996

viclafargue · 2021-07-23T15:48:36Z

I identified a bug in the GPU transpose utility and could create an attempt to fix it. The bug is visible when using the GPU approximate nearest neighbors method IVFPQ and is reproducible with the following code (requires cuML) :

import cupy as cp
from cuml.neighbors import NearestNeighbors

n_dims = 2
n_neighbors = 10
n_samples = 65_536
n_unknown = 10_000

index = cp.random.random((n_samples, n_dims))
query = cp.random.random((n_unknown, n_dims))

knn_model = NearestNeighbors(n_neighbors=n_neighbors,  algorithm='ivfpq')
knn_model.fit(index)
nearest_neighbors = knn_model.kneighbors(query)

This code throws the following error:
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::runTransposeAny(faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, int, int, faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, cudaStream_t) [with T = float; int Dim = 3; cudaStream_t = CUstream_st*] at <...>/faiss/faiss/gpu/utils/Transpose.cuh:207; details: CUDA error 9 invalid configuration argument Aborted (core dumped)

The problem appears when the number of samples in the index is above 65535. The issue seems to stem from the fact that in preparation of the launch of the transposeOuter CUDA kernel, the y dimension of the grid of thread blocks is parametrized with the number of rows/samples. Indeed, this causes a problem as the maximum y-, or z-dimension of a grid of thread blocks is 65535 (see CUDA compute compatibility technical specifications). In my understanding, this is what throws a CUDA error 9 invalid configuration argument during the launch of the CUDA kernel that follows.

Here is the part of the code that is concerned by the problem:

faiss/faiss/gpu/utils/Transpose.cuh

Lines 160 to 175 in 7c2d238

    
           auto grid = dim3(in.getSize(1), in.getSize(0)); 
        
           int block = (innerSize < maxThreads) ? innerSize : maxThreads; 
        
           if (totalSize <= (size_t) std::numeric_limits<int>::max()) { 
        
             transposeOuter<T, int32_t><<<grid, block, 0, stream>>>(in.data(), 
        
                                                                    out.data(), 
        
                                                                    in.getSize(0), 
        
                                                                    in.getSize(1), 
        
                                                                    innerSize); 
        
           } else { 
        
             transposeOuter<T, int64_t><<<grid, block, 0, stream>>>(in.data(), 
        
                                                                    out.data(), 
        
                                                                    in.getSize(0), 
        
                                                                    in.getSize(1), 
        
                                                                    innerSize); 
        
           }

Tagging issues:
rapidsai/cuml#4020
#1771
#1835

facebook-github-bot · 2021-07-23T15:48:40Z

Hi @viclafargue!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

wickedfoo · 2021-07-24T19:20:16Z

Thanks for finding this!

Flipping the grid indices doesn't really fix the issue though, it would come up in other circumstances. We'd need to cap the y block dimension and add a loop to the kernel I think.

viclafargue · 2021-07-26T08:49:47Z

Thanks for finding this!

Flipping the grid indices doesn't really fix the issue though, it would come up in other circumstances. We'd need to cap the y block dimension and add a loop to the kernel I think.

Yes indeed, it looks like the issue could come back if somehow in.getSize(1) > 65535 among others. This is just a little attempt to find a fix, but it's probably better that the people who really master this part of the code write good one.

Summary: This is a fix for facebookresearch#1996 namely, that large transposition jobs (where one of the dimensions is > 65535) will still work by performing a loop over the gridDim.y Differential Revision: D31481494 fbshipit-source-id: f8e67502322dae532b4d13018ff85cc6a589bd69

wickedfoo · 2021-10-07T21:01:49Z

fix is in #2075

Summary: Pull Request resolved: #2075 This is a fix for #1996 namely, that large transposition jobs (where one of the dimensions is > 65535) will still work by performing a loop over the gridDim.y Reviewed By: mdouze Differential Revision: D31481494 fbshipit-source-id: af35af36cce27ce7d44128d95cc229dd5c4b4b56

wickedfoo · 2021-10-12T03:59:27Z

should be fixed now

Fix for transpose utility

850baf5

mdouze added the GPU label Jul 30, 2021

wickedfoo mentioned this pull request Oct 7, 2021

Faiss GPU large transpose fix #2075

Closed

wickedfoo closed this Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible fix for the GPU transpose utility #1996

Possible fix for the GPU transpose utility #1996

viclafargue commented Jul 23, 2021

facebook-github-bot commented Jul 23, 2021

wickedfoo commented Jul 24, 2021

viclafargue commented Jul 26, 2021

wickedfoo commented Oct 7, 2021

wickedfoo commented Oct 12, 2021

	auto grid = dim3(in.getSize(1), in.getSize(0));
	int block = (innerSize < maxThreads) ? innerSize : maxThreads;

	if (totalSize <= (size_t) std::numeric_limits<int>::max()) {
	transposeOuter<T, int32_t><<<grid, block, 0, stream>>>(in.data(),
	out.data(),
	in.getSize(0),
	in.getSize(1),
	innerSize);
	} else {
	transposeOuter<T, int64_t><<<grid, block, 0, stream>>>(in.data(),
	out.data(),
	in.getSize(0),
	in.getSize(1),
	innerSize);
	}

Possible fix for the GPU transpose utility #1996

Possible fix for the GPU transpose utility #1996

Conversation

viclafargue commented Jul 23, 2021

facebook-github-bot commented Jul 23, 2021

Action Required

Process

wickedfoo commented Jul 24, 2021

viclafargue commented Jul 26, 2021

wickedfoo commented Oct 7, 2021

wickedfoo commented Oct 12, 2021