Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

Merged
merged 6 commits into from
Apr 26, 2019

Conversation

CodingCat
Copy link
Member

@CodingCat CodingCat commented Apr 9, 2019

it's a successive PR for ad4de0d

Spark's vector assembler transformer only accepts 0 as the missing value, which creates problems when the user takes 0 as the meaningful value and there are enough number of 0-valued features leading vector assembler to use SparseVector to represent the feature vector. The reason is that those 0-valued features has been filtered out by Spark.

It's fine if the user's DataFrame only contains DenseVector, as the 0 is kept.

This PR changes the behavior of XGBoost-Spark as:

when the user sets a non-zero missing value and we detect there is sparse vector, we stop the application to prevent error

@CodingCat CodingCat changed the title [jvm-packages] fix the nan and non-zero missing value handling [jvm-packages] fix the non-zero missing value handling Apr 9, 2019
@CodingCat CodingCat requested a review from superbobry April 9, 2019 20:13
@CodingCat
Copy link
Member Author

@yanboliang the PR is changed to this

@CodingCat CodingCat changed the title [jvm-packages] fix the non-zero missing value handling [breaking][jvm-packages] fix the non-zero missing value handling Apr 9, 2019
@CodingCat CodingCat changed the title [breaking][jvm-packages] fix the non-zero missing value handling [BREAKING][jvm-packages] fix the non-zero missing value handling Apr 9, 2019
@codecov-io
Copy link

Codecov Report

Merging #4349 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4349   +/-   ##
=======================================
  Coverage   67.84%   67.84%           
=======================================
  Files         132      132           
  Lines       12206    12206           
=======================================
  Hits         8281     8281           
  Misses       3925     3925

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b72eab3...00a70c7. Read the comment docs.

@superbobry
Copy link
Contributor

I'm sorry, I'm not an active user of XGBoost any more, so I don't feel confident reviewing this. @alois-bissuel, do you want to have a look?

@superbobry superbobry removed their request for review April 18, 2019 09:34
@alois-bissuel
Copy link

A small comment on the aim of the review, as I did not have enough time to look at the code in details (and won't have time before next Tuesday).
I might be mistaken, but I think using SparseVector is a way (without specifying a special number which signals a missing value) to correctly handle missing value.
There is a specific use of it if one has only integer (and obviously ordinal) features (there is no NaN for ints).
I also looked at the code of VectorAssembler in Spark MLlib, and it seems that it correctly handles null features only since Spark v2.4 (see VectorAssembler.scala#L282 in Spark 2.4).
So after this review, there should be no other way of handling missing values than using a special value (either Nan or set it through the parameter "missing")

@CodingCat
Copy link
Member Author

@alois-bissuel thanks for the review

and to clarify the point of the PR and help the further review, this PR is to prevent the following case happening

@CodingCat
Copy link
Member Author

@hcho3 I think windows test is broken somehow?

@hcho3
Copy link
Collaborator

hcho3 commented Apr 25, 2019

@CodingCat I’ll take a look

@CodingCat CodingCat merged commit 995698b into dmlc:master Apr 26, 2019
@CodingCat CodingCat deleted the missing_test branch April 26, 2019 18:10
@CodingCat
Copy link
Member Author

merged, thanks for the review @alois-bissuel

@@ -107,7 +106,8 @@ object XGBoost extends Serializable {
removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing),
missing, (v: Float) => v != missing)
} else {
removeMissingValues(xgbLabelPoints, missing, (v: Float) => !v.isNaN)
removeMissingValues(verifyMissingSetting(xgbLabelPoints, missing),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change correct? In the else clause missing.isNaN is true, yet verifyMissingSetting() would throw an exception. Seems like the else clause would always throw an exception. Is that expected?

Copy link
Member Author

@CodingCat CodingCat May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's expected

if you have NaN as missing value, that means you think 0 is a meaningful value in your dataset, however, vectorAssembler has produced a sparsevector for some of your rows (by filtering 0, since vectorassembler only thinks 0 as missing value)

instead of proceeding with this incompatible issue between XGBoost/VectorAssembler, we have to stop the training to avoid messing up the accuracy

else clause does not always throw exception, because when you have every row as dense vector, it would proceed (because we didn't find any 0 filtered by vector assembler...)

@cpfarrell
Copy link
Contributor

This is coming in pretty late but would it be possible to add a parameter option that would avoid this check? It's possible to construct a SparseVector yourself that includes zeros in the SparseVector and has the values be missing from it indicate a missing value. Doing this allows you to solve the problem mentioned in the "Note" at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I understand the reason for adding this check as most users will have used the VectorAssembler class to construct their feature vector but if you have handled that separately it seems like it would be nice to be able to specify a parameter to the model saying essentially "I know what I'm doing" and have an easy way of handling feature sets that have both missing values and zeros.

If this isn't the right place to be asking this question I'd be happy to file a separate feature request ticket. I'd also be happy to work on this myself if open to the change.

@CodingCat
Copy link
Member Author

This is coming in pretty late but would it be possible to add a parameter option that would avoid this check? It's possible to construct a SparseVector yourself that includes zeros in the SparseVector and has the values be missing from it indicate a missing value. Doing this allows you to solve the problem mentioned in the "Note" at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I understand the reason for adding this check as most users will have used the VectorAssembler class to construct their feature vector but if you have handled that separately it seems like it would be nice to be able to specify a parameter to the model saying essentially "I know what I'm doing" and have an easy way of handling feature sets that have both missing values and zeros.

If this isn't the right place to be asking this question I'd be happy to file a separate feature request ticket. I'd also be happy to work on this myself if open to the change.

feel free to open a PR

@lock lock bot locked as resolved and limited conversation to collaborators Oct 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants