-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Fully implement multiply and divide for decimal128 #3974
Comments
Removing from 22.02 until we have a more clear customer need. |
Multiply looks relatively simple to implement. Divide however is much more difficult. Sadly, it looks like in the worst case we are going to need more than 256-bits.
For divide we need to have the large values because we have to shift the dividend/lhs over to make enough room so the output gets the precision/scale it needs + 1. The + 1 is so we can round the result afterwards to match Spark. In the worst case the output is (38, 38) the dividend/lhs is (38, 0) and the divisor/rhs is (38, 38). This is highly contrived and unrealistic but possible. So, if we can support it we should try to.This would make it, so we had to shift the lhs to have a scale of 77 = 38 + 38 + 1 and a precision of 115 = 77 + 38. spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/arithmetic.scala Lines 851 to 863 in 528c67b
We might be able to play some games if we can get the remainder out of the divide algorithm.
With that we would only need a precision of 114 and scale of 76. But 384-bits can support everything that we need so there really isn't a lot of reason to do it unless it makes the rounding work much much simpler. Not sure that what I am proposing actually will work. For average #6142 we know that the divisor/rhs is a long so it would end up being a (20, 0) and the output is the input to average with 4 added to the scale, but bound to 38. So worst case we would end up with an output of (38, 38), a dividend/lhs of (38, 34) and a divisor/rhs of (20, 0). This means the lhs would need to be shifted to have a scale of 39 and a precision of 43. Not sure if we want to have multiple kernels for different sizes of just one. We probably can start with the hardest one first and then see what kind of a performance boost we get for smaller sizes. The good news is that in either case the rhs stays as a smaller value 128-bits at most, but could be 64-bits or even 32-bits. This allows us to potentially save a lot of register space when doing the divide. Still need to think about how this might impact the choice we have to algorithms. 384-bits is still a lot. |
The basics of multiply and divide are in place now. |
Is your feature request related to a problem? Please describe.
This is a bit of a crazy idea. I already had a crazy idea about SUM #3944. This is similar but for multiply and divide.
There are algorithms for both multiply and divide that work for large arbitrary precision.
https://en.wikipedia.org/wiki/Multiplication_algorithm
https://en.wikipedia.org/wiki/Division_algorithm
We also would need to write out own arbitrary precision rounding algorithm, which we could base off of what we and CUDF do already or try to come up with something new and more efficient.
I think we can adapt these to work with the operations we already have access to on the GPU. If we had to we could also write custom cuda kernels to take the inputs. Do the operations as 256-bit integer values, and then round the result back.
This is not as fully thought through as the SUM operation is. This is because there are a number of possible algorithms. I am no expert in this so I think we could start with something simpler that we know we can implement, even if it is not the most memory/compute efficient, and then we can refine it later if needed.
Lets take multiply as an example and to keep things simple in the example lets say we only can support decimal values up to a precision of 4. In this example we will do Decimal(4, 2) * Decimal(4, 2) and get an output of Decimal(4, 1) (trying to emulate what Spark does with overflow). We will do the math for 10.27 * 0.51.
The Spark way.
10.27 * 0.51 => 5.2377
5.2377 => 5.2
if (999.9 >= 5.2 >= -999.9) 5.2 else null
For us we don't have arbitrary precision so we can do a long multiply with some adds and existing multiply operations.
10.27 => 1027 => 10 and 27
0.51 => 51 => 00 and 51
00 * 10 = 0000 * 10^(6 - 4) // the 6 comes from the fact it was the two higher order numbers we multiplied. The -4 is because that is the resulting scale we would see after doing the original multiply
00 * 27 = 0000 * 10^(4 - 4)
10 * 51 = 0510 * 10^(2 - 4)
27 * 51 = 1377 * 10^(0 - 4)
999.9 >= 0000 * 10^2 >= -999.9
999.9 >= 0000 >= -999.9
999.9 >= 05.10 >= -999.9
999.9 >= .1377 >= -999.9
If any of them would overflow we know that the final result would also overflow.
add the results and round/truncate as we go. We start by adding all of the numbers together that have values below the desired output scale. I think with how Spark works we can just truncate the lower number to the precision we need before rounding, but I am not 100% sure on that. Will need to do some experimentation to see.
.1377 truncate to scale 2 => .13
.13 + 05.10 => 05.23 round to the final precision
05.2
Move the remaining values to the final output scale and add them together. The rounding could have added another value to the maximum precision for the result. But because we know that a decimal128 has enough space to hold this without really overflowing we don't need to worry about it because we already checked the ranges ahead of time.
0000 * 10^2 + 0000.0 + 05.2 => 5.2
This is already long enough, but I think we can do something similar for divide, and I think we can also reduce the complexity of multiply with a little work.
The text was updated successfully, but these errors were encountered: