-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING][jvm-packages] fix the non-zero missing value handling #4349
Conversation
@yanboliang the PR is changed to this |
Codecov Report
@@ Coverage Diff @@
## master #4349 +/- ##
=======================================
Coverage 67.84% 67.84%
=======================================
Files 132 132
Lines 12206 12206
=======================================
Hits 8281 8281
Misses 3925 3925 Continue to review full report at Codecov.
|
I'm sorry, I'm not an active user of XGBoost any more, so I don't feel confident reviewing this. @alois-bissuel, do you want to have a look? |
A small comment on the aim of the review, as I did not have enough time to look at the code in details (and won't have time before next Tuesday). |
@alois-bissuel thanks for the review and to clarify the point of the PR and help the further review, this PR is to prevent the following case happening
|
...xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/MissingValueHandlingSuite.scala
Show resolved
Hide resolved
@hcho3 I think windows test is broken somehow? |
@CodingCat I’ll take a look |
merged, thanks for the review @alois-bissuel |
@@ -107,7 +106,8 @@ object XGBoost extends Serializable { | |||
removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing), | |||
missing, (v: Float) => v != missing) | |||
} else { | |||
removeMissingValues(xgbLabelPoints, missing, (v: Float) => !v.isNaN) | |||
removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change correct? In the else
clause missing.isNaN
is true, yet verifyMissingSetting()
would throw an exception. Seems like the else
clause would always throw an exception. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's expected
if you have NaN as missing value, that means you think 0 is a meaningful value in your dataset, however, vectorAssembler has produced a sparsevector for some of your rows (by filtering 0, since vectorassembler only thinks 0 as missing value)
instead of proceeding with this incompatible issue between XGBoost/VectorAssembler, we have to stop the training to avoid messing up the accuracy
else clause does not always throw exception, because when you have every row as dense vector, it would proceed (because we didn't find any 0 filtered by vector assembler...)
This is coming in pretty late but would it be possible to add a parameter option that would avoid this check? It's possible to construct a SparseVector yourself that includes zeros in the SparseVector and has the values be missing from it indicate a missing value. Doing this allows you to solve the problem mentioned in the "Note" at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I understand the reason for adding this check as most users will have used the VectorAssembler class to construct their feature vector but if you have handled that separately it seems like it would be nice to be able to specify a parameter to the model saying essentially "I know what I'm doing" and have an easy way of handling feature sets that have both missing values and zeros. If this isn't the right place to be asking this question I'd be happy to file a separate feature request ticket. I'd also be happy to work on this myself if open to the change. |
feel free to open a PR |
it's a successive PR for ad4de0d
Spark's vector assembler transformer only accepts 0 as the missing value, which creates problems when the user takes 0 as the meaningful value and there are enough number of 0-valued features leading vector assembler to use SparseVector to represent the feature vector. The reason is that those 0-valued features has been filtered out by Spark.
It's fine if the user's DataFrame only contains DenseVector, as the 0 is kept.
This PR changes the behavior of XGBoost-Spark as:
when the user sets a non-zero missing value and we detect there is sparse vector, we stop the application to prevent error