Manually perform 8-bit dot product (__dp4a) #2348

alibayeh · 2024-07-21T20:26:24Z

Hi,
The candle-kernel crate throws an error on the Maxwell architecture when migrating code that works fine in PyTorch.
It seems the only problem is '__dp4a' which used for performing four 8-bit integer dot product operations for quantization.
It would be great if the kernel crate has implementation to compute the dot product for 'CUDA_ARCH < 610'.

#if __CUDA_ARCH__ < 610
    // Manually perform 8-bit dot product
__device__ int manual_dp4a(int a, int b, int c) {
    int result = c;
    for (int i = 0; i < 4; ++i) {
        int8_t a_byte = (a >> (i * 8)) & 0xFF;
        int8_t b_byte = (b >> (i * 8)) & 0xFF;
        result += a_byte * b_byte;
    }
    return result;
}
#endif


#if __CUDA_ARCH__ >= 610
    ...
    sumi = __dp4a(vi0, u[2*i+0], sumi);    
    ...
#else
    ...
    sumi = manual_dp4a(vi0, u[2*i+0], sumi);    
    ...
#endif

EricLBuehler mentioned this issue Jul 31, 2024

Add dp4a for CC < 610 #2372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually perform 8-bit dot product (__dp4a) #2348

Manually perform 8-bit dot product (__dp4a) #2348

alibayeh commented Jul 21, 2024

Manually perform 8-bit dot product (__dp4a) #2348

Manually perform 8-bit dot product (__dp4a) #2348

Comments

alibayeh commented Jul 21, 2024