[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

Totoketchup · 2020-08-10T06:33:21Z

The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vector was containing any missing values. This would make fail the partition key computation and most of the rows would end up in the same partition and therefore leading to memory issues on one of the executor. We fixed this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing missing values by checking that the partitions size variance is below a certain threshold.

Another remark about the computation of the key:

If the values contained in your DataFrame are only below let's say 1e-4 / 1e-5, then this computation:
rowHashCode.toLong + featureValue is equivalent to rowHashCode.toLong, wouldn't it better to make it more random by computing featureValue.toString().hashCode() ? It would even fix what I did here too. WDYT ?

The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case. We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold. Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>

codecov-commenter · 2020-08-10T08:33:22Z

Codecov Report

Merging #5996 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5996   +/-   ##
=======================================
  Coverage   78.52%   78.52%           
=======================================
  Files          12       12           
  Lines        3013     3013           
=======================================
  Hits         2366     2366           
  Misses        647      647

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b2a26f...8dd1d5c. Read the comment docs.

Totoketchup · 2020-08-10T09:35:49Z

Hello @CodingCat, I see that you are the one that designed the deterministic repartitioning for XGboost looking at #4786.

I have a questiion about the computation of the partition key:

Why math.abs(row.hashCode) % numWorkers would not be enough to compute the key ? Looking at the implementation of row.hashCode (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala) we can see that it already uses the values inside the row to compute this hash. What does it bring to get a feature from the Row and add it to the computation of the key ?

CodingCat

LGTM thanks! the failed test seems irrelevant, restarted the test

hcho3 · 2020-08-14T00:25:40Z

I'm looking at the failing tests now: #6014

hcho3 · 2020-08-17T23:48:54Z

Restarting the tests.

Totoketchup force-pushed the fix/determistic_partitioning_mv branch from 65c81d7 to 8dd1d5c Compare August 10, 2020 07:00

CodingCat changed the title ~~Fix deterministic partitioning with dataset containing Double.NaN~~ [jvm-packagegs] Fix deterministic partitioning with dataset containing Double.NaN Aug 13, 2020

CodingCat approved these changes Aug 13, 2020

View reviewed changes

Totoketchup changed the title ~~[jvm-packagegs] Fix deterministic partitioning with dataset containing Double.NaN~~ [jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN Aug 16, 2020

hcho3 merged commit f58e41b into dmlc:master Aug 19, 2020

hcho3 mentioned this pull request Aug 19, 2020

[jvm-packages] Clean the way deterministic partitioning is computed #6033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

Totoketchup commented Aug 10, 2020 •

edited

Loading

codecov-commenter commented Aug 10, 2020 •

edited

Loading

Totoketchup commented Aug 10, 2020

CodingCat left a comment

hcho3 commented Aug 14, 2020

hcho3 commented Aug 17, 2020

[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

Conversation

Totoketchup commented Aug 10, 2020 • edited Loading

codecov-commenter commented Aug 10, 2020 • edited Loading

Codecov Report

Totoketchup commented Aug 10, 2020

CodingCat left a comment

Choose a reason for hiding this comment

hcho3 commented Aug 14, 2020

hcho3 commented Aug 17, 2020

Totoketchup commented Aug 10, 2020 •

edited

Loading

codecov-commenter commented Aug 10, 2020 •

edited

Loading