[FEA] Fully implement multiply and divide for decimal128 #3974

revans2 · 2021-10-30T14:15:38Z

Is your feature request related to a problem? Please describe.
This is a bit of a crazy idea. I already had a crazy idea about SUM #3944. This is similar but for multiply and divide.

There are algorithms for both multiply and divide that work for large arbitrary precision.

https://en.wikipedia.org/wiki/Multiplication_algorithm

https://en.wikipedia.org/wiki/Division_algorithm

We also would need to write out own arbitrary precision rounding algorithm, which we could base off of what we and CUDF do already or try to come up with something new and more efficient.

I think we can adapt these to work with the operations we already have access to on the GPU. If we had to we could also write custom cuda kernels to take the inputs. Do the operations as 256-bit integer values, and then round the result back.

This is not as fully thought through as the SUM operation is. This is because there are a number of possible algorithms. I am no expert in this so I think we could start with something simpler that we know we can implement, even if it is not the most memory/compute efficient, and then we can refine it later if needed.

Lets take multiply as an example and to keep things simple in the example lets say we only can support decimal values up to a precision of 4. In this example we will do Decimal(4, 2) * Decimal(4, 2) and get an output of Decimal(4, 1) (trying to emulate what Spark does with overflow). We will do the math for 10.27 * 0.51.

The Spark way.

Do the multiply with arbitrary precision.
10.27 * 0.51 => 5.2377
Round the result to the desired scale
5.2377 => 5.2
Check if the result fits in the desired precision.
if (999.9 >= 5.2 >= -999.9) 5.2 else null

For us we don't have arbitrary precision so we can do a long multiply with some adds and existing multiply operations.

Convert the numbers to "integers" (scale 0) and split them up into values we know we can multiply without overflowing.
10.27 => 1027 => 10 and 27
0.51 => 51 => 00 and 51
Do the 4 multiplies with numbers we know that we can do without overflow, and keep track of the multiply digits we would need
00 * 10 = 0000 * 10^(6 - 4) // the 6 comes from the fact it was the two higher order numbers we multiplied. The -4 is because that is the resulting scale we would see after doing the original multiply
00 * 27 = 0000 * 10^(4 - 4)
10 * 51 = 0510 * 10^(2 - 4)
27 * 51 = 1377 * 10^(0 - 4)
Do some initial overflow checks. Essentially do the same kind of overflow checks we do now for max and min values that can fit in the output, but we scale it to match the scale of the partial answer.
999.9 >= 0000 * 10^2 >= -999.9
999.9 >= 0000 >= -999.9
999.9 >= 05.10 >= -999.9
999.9 >= .1377 >= -999.9

If any of them would overflow we know that the final result would also overflow.

add the results and round/truncate as we go. We start by adding all of the numbers together that have values below the desired output scale. I think with how Spark works we can just truncate the lower number to the precision we need before rounding, but I am not 100% sure on that. Will need to do some experimentation to see.
.1377 truncate to scale 2 => .13
.13 + 05.10 => 05.23 round to the final precision
05.2
Move the remaining values to the final output scale and add them together. The rounding could have added another value to the maximum precision for the result. But because we know that a decimal128 has enough space to hold this without really overflowing we don't need to worry about it because we already checked the ranges ahead of time.

0000 * 10^2 + 0000.0 + 05.2 => 5.2

finally check the range of the result, because we did rounding.

This is already long enough, but I think we can do something similar for divide, and I think we can also reduce the complexity of multiply with a little work.

sameerz · 2022-01-18T17:39:13Z

Removing from 22.02 until we have a more clear customer need.

sameerz · 2022-08-02T16:44:20Z

This blocks issues #6142 and #6143

revans2 · 2022-08-03T21:27:45Z

Multiply looks relatively simple to implement. Divide however is much more difficult.

Sadly, it looks like in the worst case we are going to need more than 256-bits.

192-bits can hold 57 singed digits (see below for average)
256-bits can hold 76 signed digits
384-bits can hold 115 signed digits (see below for divide in general)
512-bits can hold 153 signed digits

For divide we need to have the large values because we have to shift the dividend/lhs over to make enough room so the output gets the precision/scale it needs + 1. The + 1 is so we can round the result afterwards to match Spark. In the worst case the output is (38, 38) the dividend/lhs is (38, 0) and the divisor/rhs is (38, 38). This is highly contrived and unrealistic but possible. So, if we can support it we should try to.This would make it, so we had to shift the lhs to have a scale of 77 = 38 + 38 + 1 and a precision of 115 = 77 + 38.

spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/arithmetic.scala

Lines 851 to 863 in 528c67b

    
           // For Spark the final desired output is 
        
           // new_scale = max(6, lhs.scale + rhs.precision + 1) 
        
           // new_precision = lhs.precision - lhs.scale + rhs.scale + new_scale 
        
           // But Spark will round the final result, so we need at least one more 
        
           // decimal place on the scale to be able to do the rounding too. 
        
           def lhsNeededScale(rhs: DecimalType, outputType: DecimalType): Int = 
        
             outputType.scale + rhs.scale + 1 
        
           def lhsNeededPrecision(lhs: DecimalType, rhs: DecimalType, outputType: DecimalType): Int = { 
        
             val neededLhsScale = lhsNeededScale(rhs, outputType) 
        
             (lhs.precision - lhs.scale) + neededLhsScale 
        
           }

).

We might be able to play some games if we can get the remainder out of the divide algorithm.

if (remainder >= dividend/2) {
  output += 1
}

With that we would only need a precision of 114 and scale of 76. But 384-bits can support everything that we need so there really isn't a lot of reason to do it unless it makes the rounding work much much simpler. Not sure that what I am proposing actually will work.

For average #6142 we know that the divisor/rhs is a long so it would end up being a (20, 0) and the output is the input to average with 4 added to the scale, but bound to 38. So worst case we would end up with an output of (38, 38), a dividend/lhs of (38, 34) and a divisor/rhs of (20, 0). This means the lhs would need to be shifted to have a scale of 39 and a precision of 43.

Not sure if we want to have multiple kernels for different sizes of just one. We probably can start with the hardest one first and then see what kind of a performance boost we get for smaller sizes.

The good news is that in either case the rhs stays as a smaller value 128-bits at most, but could be 64-bits or even 32-bits. This allows us to potentially save a lot of register space when doing the divide. Still need to think about how this might impact the choice we have to algorithms. 384-bits is still a lot.

revans2 · 2022-08-30T12:48:08Z

The basics of multiply and divide are in place now.

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 30, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 2, 2021

jlowe self-assigned this Feb 23, 2022

sameerz assigned revans2 and unassigned jlowe Aug 2, 2022

revans2 added this to the Aug 22 - Sep 2 milestone Aug 30, 2022

revans2 closed this as completed Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Fully implement multiply and divide for decimal128 #3974

[FEA] Fully implement multiply and divide for decimal128 #3974

revans2 commented Oct 30, 2021

sameerz commented Jan 18, 2022

sameerz commented Aug 2, 2022

revans2 commented Aug 3, 2022

revans2 commented Aug 30, 2022

[FEA] Fully implement multiply and divide for decimal128 #3974

[FEA] Fully implement multiply and divide for decimal128 #3974

Comments

revans2 commented Oct 30, 2021

sameerz commented Jan 18, 2022

sameerz commented Aug 2, 2022

revans2 commented Aug 3, 2022

revans2 commented Aug 30, 2022