Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN #5996

Merged
merged 1 commit into from
Aug 19, 2020

Conversation

Totoketchup
Copy link
Contributor

@Totoketchup Totoketchup commented Aug 10, 2020

The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vector was containing any missing values. This would make fail the partition key computation and most of the rows would end up in the same partition and therefore leading to memory issues on one of the executor. We fixed this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing missing values by checking that the partitions size variance is below a certain threshold.

Another remark about the computation of the key:

  • If the values contained in your DataFrame are only below let's say 1e-4 / 1e-5, then this computation:
    rowHashCode.toLong + featureValue is equivalent to rowHashCode.toLong, wouldn't it better to make it more random by computing featureValue.toString().hashCode() ? It would even fix what I did here too. WDYT ?

The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold.

Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>
@Totoketchup Totoketchup force-pushed the fix/determistic_partitioning_mv branch from 65c81d7 to 8dd1d5c Compare August 10, 2020 07:00
@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2020

Codecov Report

Merging #5996 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5996   +/-   ##
=======================================
  Coverage   78.52%   78.52%           
=======================================
  Files          12       12           
  Lines        3013     3013           
=======================================
  Hits         2366     2366           
  Misses        647      647           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b2a26f...8dd1d5c. Read the comment docs.

@Totoketchup
Copy link
Contributor Author

Hello @CodingCat, I see that you are the one that designed the deterministic repartitioning for XGboost looking at #4786.

I have a questiion about the computation of the partition key:

Why math.abs(row.hashCode) % numWorkers would not be enough to compute the key ? Looking at the implementation of row.hashCode (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala) we can see that it already uses the values inside the row to compute this hash. What does it bring to get a feature from the Row and add it to the computation of the key ?

@CodingCat CodingCat changed the title Fix deterministic partitioning with dataset containing Double.NaN [jvm-packagegs] Fix deterministic partitioning with dataset containing Double.NaN Aug 13, 2020
Copy link
Member

@CodingCat CodingCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! the failed test seems irrelevant, restarted the test

@hcho3
Copy link
Collaborator

hcho3 commented Aug 14, 2020

I'm looking at the failing tests now: #6014

@Totoketchup Totoketchup changed the title [jvm-packagegs] Fix deterministic partitioning with dataset containing Double.NaN [jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN Aug 16, 2020
@hcho3
Copy link
Collaborator

hcho3 commented Aug 17, 2020

Restarting the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants