Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fix for the GPU transpose utility #1996

Closed

Conversation

viclafargue
Copy link

I identified a bug in the GPU transpose utility and could create an attempt to fix it. The bug is visible when using the GPU approximate nearest neighbors method IVFPQ and is reproducible with the following code (requires cuML) :

import cupy as cp
from cuml.neighbors import NearestNeighbors

n_dims = 2
n_neighbors = 10
n_samples = 65_536
n_unknown = 10_000

index = cp.random.random((n_samples, n_dims))
query = cp.random.random((n_unknown, n_dims))

knn_model = NearestNeighbors(n_neighbors=n_neighbors,  algorithm='ivfpq')
knn_model.fit(index)
nearest_neighbors = knn_model.kneighbors(query)

This code throws the following error:
Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::runTransposeAny(faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, int, int, faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, cudaStream_t) [with T = float; int Dim = 3; cudaStream_t = CUstream_st*] at <...>/faiss/faiss/gpu/utils/Transpose.cuh:207; details: CUDA error 9 invalid configuration argument Aborted (core dumped)

The problem appears when the number of samples in the index is above 65535. The issue seems to stem from the fact that in preparation of the launch of the transposeOuter CUDA kernel, the y dimension of the grid of thread blocks is parametrized with the number of rows/samples. Indeed, this causes a problem as the maximum y-, or z-dimension of a grid of thread blocks is 65535 (see CUDA compute compatibility technical specifications). In my understanding, this is what throws a CUDA error 9 invalid configuration argument during the launch of the CUDA kernel that follows.

Here is the part of the code that is concerned by the problem:

auto grid = dim3(in.getSize(1), in.getSize(0));
int block = (innerSize < maxThreads) ? innerSize : maxThreads;
if (totalSize <= (size_t) std::numeric_limits<int>::max()) {
transposeOuter<T, int32_t><<<grid, block, 0, stream>>>(in.data(),
out.data(),
in.getSize(0),
in.getSize(1),
innerSize);
} else {
transposeOuter<T, int64_t><<<grid, block, 0, stream>>>(in.data(),
out.data(),
in.getSize(0),
in.getSize(1),
innerSize);
}

Tagging issues:
rapidsai/cuml#4020
#1771
#1835

@facebook-github-bot
Copy link
Contributor

Hi @viclafargue!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@wickedfoo
Copy link
Contributor

Thanks for finding this!

Flipping the grid indices doesn't really fix the issue though, it would come up in other circumstances. We'd need to cap the y block dimension and add a loop to the kernel I think.

@viclafargue
Copy link
Author

Thanks for finding this!

Flipping the grid indices doesn't really fix the issue though, it would come up in other circumstances. We'd need to cap the y block dimension and add a loop to the kernel I think.

Yes indeed, it looks like the issue could come back if somehow in.getSize(1) > 65535 among others. This is just a little attempt to find a fix, but it's probably better that the people who really master this part of the code write good one.

@mdouze mdouze added the GPU label Jul 30, 2021
wickedfoo added a commit to wickedfoo/faiss that referenced this pull request Oct 7, 2021
Summary:
This is a fix for facebookresearch#1996

namely, that large transposition jobs (where one of the dimensions is > 65535) will still work by performing a loop over the gridDim.y

Differential Revision: D31481494

fbshipit-source-id: f8e67502322dae532b4d13018ff85cc6a589bd69
@wickedfoo
Copy link
Contributor

fix is in #2075

facebook-github-bot pushed a commit that referenced this pull request Oct 11, 2021
Summary:
Pull Request resolved: #2075

This is a fix for #1996

namely, that large transposition jobs (where one of the dimensions is > 65535) will still work by performing a loop over the gridDim.y

Reviewed By: mdouze

Differential Revision: D31481494

fbshipit-source-id: af35af36cce27ce7d44128d95cc229dd5c4b4b56
@wickedfoo
Copy link
Contributor

should be fixed now

@wickedfoo wickedfoo closed this Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants