[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

CodingCat · 2019-04-09T18:17:27Z

it's a successive PR for ad4de0d

Spark's vector assembler transformer only accepts 0 as the missing value, which creates problems when the user takes 0 as the meaningful value and there are enough number of 0-valued features leading vector assembler to use SparseVector to represent the feature vector. The reason is that those 0-valued features has been filtered out by Spark.

It's fine if the user's DataFrame only contains DenseVector, as the 0 is kept.

This PR changes the behavior of XGBoost-Spark as:

when the user sets a non-zero missing value and we detect there is sparse vector, we stop the application to prevent error

CodingCat · 2019-04-09T20:13:59Z

@yanboliang the PR is changed to this

codecov-io · 2019-04-09T21:56:32Z

Codecov Report

Merging #4349 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4349   +/-   ##
=======================================
  Coverage   67.84%   67.84%           
=======================================
  Files         132      132           
  Lines       12206    12206           
=======================================
  Hits         8281     8281           
  Misses       3925     3925

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b72eab3...00a70c7. Read the comment docs.

superbobry · 2019-04-18T09:34:13Z

I'm sorry, I'm not an active user of XGBoost any more, so I don't feel confident reviewing this. @alois-bissuel, do you want to have a look?

alois-bissuel · 2019-04-18T16:27:48Z

A small comment on the aim of the review, as I did not have enough time to look at the code in details (and won't have time before next Tuesday).
I might be mistaken, but I think using SparseVector is a way (without specifying a special number which signals a missing value) to correctly handle missing value.
There is a specific use of it if one has only integer (and obviously ordinal) features (there is no NaN for ints).
I also looked at the code of VectorAssembler in Spark MLlib, and it seems that it correctly handles null features only since Spark v2.4 (see VectorAssembler.scala#L282 in Spark 2.4).
So after this review, there should be no other way of handling missing values than using a special value (either Nan or set it through the parameter "missing")

CodingCat · 2019-04-18T17:18:50Z

@alois-bissuel thanks for the review

and to clarify the point of the PR and help the further review, this PR is to prevent the following case happening

user sets parameter missing as -1 (anything other than 0 and NaN) and use 0 to represent some meaningful feature
vector assembler will filter those 0 values https://github.com/apache/spark/blob/7a8efc8fe32abb51a5c8803c6afceadf838e46e6/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L262
vector assembler decided to use sparse vector to represent that feature https://github.com/apache/spark/blob/branch-2.4/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L165-L173
then we lost those 0-valued features in step 3 and also filter -1 in XGBoost-Spark

...xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/MissingValueHandlingSuite.scala

CodingCat · 2019-04-25T19:47:26Z

@hcho3 I think windows test is broken somehow?

hcho3 · 2019-04-25T20:39:09Z

@CodingCat I’ll take a look

CodingCat · 2019-04-26T18:10:57Z

merged, thanks for the review @alois-bissuel

rongou · 2019-05-15T19:06:27Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

@@ -107,7 +106,8 @@ object XGBoost extends Serializable {
      removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing),
        missing, (v: Float) => v != missing)
    } else {
-      removeMissingValues(xgbLabelPoints, missing, (v: Float) => !v.isNaN)
+      removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing),


Is this change correct? In the else clause missing.isNaN is true, yet verifyMissingSetting() would throw an exception. Seems like the else clause would always throw an exception. Is that expected?

yes, it's expected

if you have NaN as missing value, that means you think 0 is a meaningful value in your dataset, however, vectorAssembler has produced a sparsevector for some of your rows (by filtering 0, since vectorassembler only thinks 0 as missing value)

instead of proceeding with this incompatible issue between XGBoost/VectorAssembler, we have to stop the training to avoid messing up the accuracy

else clause does not always throw exception, because when you have every row as dense vector, it would proceed (because we didn't find any 0 filtered by vector assembler...)

cpfarrell · 2019-08-02T00:50:43Z

This is coming in pretty late but would it be possible to add a parameter option that would avoid this check? It's possible to construct a SparseVector yourself that includes zeros in the SparseVector and has the values be missing from it indicate a missing value. Doing this allows you to solve the problem mentioned in the "Note" at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I understand the reason for adding this check as most users will have used the VectorAssembler class to construct their feature vector but if you have handled that separately it seems like it would be nice to be able to specify a parameter to the model saying essentially "I know what I'm doing" and have an easy way of handling feature sets that have both missing values and zeros.

If this isn't the right place to be asking this question I'd be happy to file a separate feature request ticket. I'd also be happy to work on this myself if open to the change.

CodingCat · 2019-08-02T03:03:24Z

This is coming in pretty late but would it be possible to add a parameter option that would avoid this check? It's possible to construct a SparseVector yourself that includes zeros in the SparseVector and has the values be missing from it indicate a missing value. Doing this allows you to solve the problem mentioned in the "Note" at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I understand the reason for adding this check as most users will have used the VectorAssembler class to construct their feature vector but if you have handled that separately it seems like it would be nice to be able to specify a parameter to the model saying essentially "I know what I'm doing" and have an easy way of handling feature sets that have both missing values and zeros.

If this isn't the right place to be asking this question I'd be happy to file a separate feature request ticket. I'd also be happy to work on this myself if open to the change.

feel free to open a PR

CodingCat changed the title ~~[jvm-packages] fix the nan and non-zero missing value handling~~ [jvm-packages] fix the non-zero missing value handling Apr 9, 2019

CodingCat requested a review from superbobry April 9, 2019 20:13

CodingCat changed the title ~~[jvm-packages] fix the non-zero missing value handling~~ [breaking][jvm-packages] fix the non-zero missing value handling Apr 9, 2019

CodingCat changed the title ~~[breaking][jvm-packages] fix the non-zero missing value handling~~ [BREAKING][jvm-packages] fix the non-zero missing value handling Apr 9, 2019

CodingCat mentioned this pull request Apr 16, 2019

[jvm-packages] support spark 2.4 and compatibility test with previous xgboost version #4377

Merged

superbobry removed their request for review April 18, 2019 09:34

kkraus14 mentioned this pull request Apr 19, 2019

[REVIEW] Add Python coverage test to gpu build rapidsai/cudf#1461

Merged

CodingCat mentioned this pull request Apr 22, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

alois-bissuel approved these changes Apr 23, 2019

View reviewed changes

alois-bissuel reviewed Apr 23, 2019

View reviewed changes

...xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/MissingValueHandlingSuite.scala Show resolved Hide resolved

alois-bissuel approved these changes Apr 25, 2019

View reviewed changes

CodingCat force-pushed the missing_test branch from 0be80cf to 3d2370e Compare April 25, 2019 16:01

Nan Zhu and others added 6 commits April 26, 2019 08:36

fix the nan and non-zero missing value handling

a9ee41d

fix nan handling part

0ebbac5

add missing value

2fb6dd4

Update MissingValueHandlingSuite.scala

342ea95

Update MissingValueHandlingSuite.scala

c2c6529

stylistic fix

021097c

CodingCat force-pushed the missing_test branch from 3d2370e to 021097c Compare April 26, 2019 15:37

CodingCat merged commit 995698b into dmlc:master Apr 26, 2019

CodingCat deleted the missing_test branch April 26, 2019 18:10

CodingCat mentioned this pull request May 10, 2019

[jvm-packages]fix XGBoost-on-Spark SparseVector missing value problem #4455

Closed

rongou reviewed May 15, 2019

View reviewed changes

hcho3 mentioned this pull request May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

CodingCat mentioned this pull request Aug 5, 2019

[legacy versions] how do we support old versions #4734

Closed

cpfarrell mentioned this pull request Aug 23, 2019

[jvm-packages] Allow for bypassing spark missing value check #4805

Merged

lock bot locked as resolved and limited conversation to collaborators Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

CodingCat commented Apr 9, 2019 •

edited

Loading

CodingCat commented Apr 9, 2019

codecov-io commented Apr 9, 2019

superbobry commented Apr 18, 2019

alois-bissuel commented Apr 18, 2019

CodingCat commented Apr 18, 2019

CodingCat commented Apr 25, 2019

hcho3 commented Apr 25, 2019

CodingCat commented Apr 26, 2019

rongou May 15, 2019

CodingCat May 15, 2019 •

edited

Loading

cpfarrell commented Aug 2, 2019

CodingCat commented Aug 2, 2019

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

Conversation

CodingCat commented Apr 9, 2019 • edited Loading

CodingCat commented Apr 9, 2019

codecov-io commented Apr 9, 2019

Codecov Report

superbobry commented Apr 18, 2019

alois-bissuel commented Apr 18, 2019

CodingCat commented Apr 18, 2019

CodingCat commented Apr 25, 2019

hcho3 commented Apr 25, 2019

CodingCat commented Apr 26, 2019

rongou May 15, 2019

Choose a reason for hiding this comment

CodingCat May 15, 2019 • edited Loading

Choose a reason for hiding this comment

cpfarrell commented Aug 2, 2019

CodingCat commented Aug 2, 2019

CodingCat commented Apr 9, 2019 •

edited

Loading

CodingCat May 15, 2019 •

edited

Loading