You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I have a bit of a crazy idea.
Background: Sum in Spark also lets us do average. So this can end up applying a little more generally. One of the main reasons we cannot full support sum on all decimal128 values is because Spark will do the sum as a BigDecimal value. The at different points in the process it will check if hat BigDecimal is too large (overflow checking). This means that no matter how many values are in an aggregation Spark is guaranteed to be able to detect the overflow and do the right thing*. Except Spark has this optimization where if the output size is small enough to fit in a Long then it will do the aggregation as a long instead. This speeds things up a lot and everyone is happy. However this means that there is a range of values where Spark can no longer detect overflow. It would take just over 97-million max or min values being summed to overflow. So generally we have taken the approach that if we can detect overflow for 100-million values then we can run it on the GPU without issues. That results in us not being able to detect overflow on anything with a precision higher than 28, so we fall back to the CPU, max precision of decimal-128 is 38.
So in the worst case we would need to be able to do a sum with a precision of 49 to detect the overflow on a Decimal value with a precision of 38. But we should be able to do that if we split the sum into two separate parts, each of which can handle doing the sum on 100-million entries without overflowing.
If we want to do a sum on a value with a precision of 29. We can split this into two separate sums. One with a precision of 1 (higher order data) and another with a precision of 28 (lower order data). Then we add the results back together at the end (checking for overflow on each sub-part first). Similarly if we wanted to do a sum on a value with a precision of 38. We would split it into a sum with a precision of 10 and another with a precision of 28. Then add them back together at the end.
To get the higher order data we cudf cast the input to a (scale - 28) from the Spark definiton of scale, we should also be smart about it and figure out if we can drop the precision too. We do a CUDF cast instead of a Spark cast because Spark rounds on casting, and we just want to truncate the result. To get the lower order data we keep the same scale/precision as before, but we subtract the higher order data from the lower order data. This might involve casting things back to the same precision/scale but it should work.
One we have done the sum on the separate parts we need to check for overflow. I don't 100% know the math for this, but I think if we cudf cast the lower order sum to the same scale as the higher order sum, and add the higher order and casted lower order together, we can check the higher order bytes to know if the output would fit. Then we can cast the higher order data to the original scale and add it to the lower order data and do a final check for overflow. The final overflow results would be an overflow for the higher order check, or for the lower order check.
This is likely to use a lot more memory and slow down processing. But it will mean we only have to worry about multiply and divide on the GPU for decimal 128 after this.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I have a bit of a crazy idea.
Background: Sum in Spark also lets us do average. So this can end up applying a little more generally. One of the main reasons we cannot full support sum on all decimal128 values is because Spark will do the sum as a BigDecimal value. The at different points in the process it will check if hat BigDecimal is too large (overflow checking). This means that no matter how many values are in an aggregation Spark is guaranteed to be able to detect the overflow and do the right thing*. Except Spark has this optimization where if the output size is small enough to fit in a Long then it will do the aggregation as a long instead. This speeds things up a lot and everyone is happy. However this means that there is a range of values where Spark can no longer detect overflow. It would take just over 97-million max or min values being summed to overflow. So generally we have taken the approach that if we can detect overflow for 100-million values then we can run it on the GPU without issues. That results in us not being able to detect overflow on anything with a precision higher than 28, so we fall back to the CPU, max precision of decimal-128 is 38.
So in the worst case we would need to be able to do a sum with a precision of 49 to detect the overflow on a Decimal value with a precision of 38. But we should be able to do that if we split the sum into two separate parts, each of which can handle doing the sum on 100-million entries without overflowing.
If we want to do a sum on a value with a precision of 29. We can split this into two separate sums. One with a precision of 1 (higher order data) and another with a precision of 28 (lower order data). Then we add the results back together at the end (checking for overflow on each sub-part first). Similarly if we wanted to do a sum on a value with a precision of 38. We would split it into a sum with a precision of 10 and another with a precision of 28. Then add them back together at the end.
To get the higher order data we cudf cast the input to a (scale - 28) from the Spark definiton of scale, we should also be smart about it and figure out if we can drop the precision too. We do a CUDF cast instead of a Spark cast because Spark rounds on casting, and we just want to truncate the result. To get the lower order data we keep the same scale/precision as before, but we subtract the higher order data from the lower order data. This might involve casting things back to the same precision/scale but it should work.
One we have done the sum on the separate parts we need to check for overflow. I don't 100% know the math for this, but I think if we cudf cast the lower order sum to the same scale as the higher order sum, and add the higher order and casted lower order together, we can check the higher order bytes to know if the output would fit. Then we can cast the higher order data to the original scale and add it to the lower order data and do a final check for overflow. The final overflow results would be an overflow for the higher order check, or for the lower order check.
This is likely to use a lot more memory and slow down processing. But it will mean we only have to worry about multiply and divide on the GPU for decimal 128 after this.
The text was updated successfully, but these errors were encountered: