-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Allow for bypassing spark missing value check #4805
[jvm-packages] Allow for bypassing spark missing value check #4805
Conversation
@CodingCat FYI since talked with you about this in #4349 (comment) |
Thanks! LGTM, could you also update the doc in https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values accordingly? |
Thaks, and yea will do on updating docs. Mind if I do #4727 at the same time since would be modifying the same place in documentation? Or could do it in separate branch if you want. |
Sure, let’s do together
…On Fri, Aug 30, 2019 at 4:24 PM cpfarrell ***@***.***> wrote:
Thaks, and yea will do on updating docs.
Mind if I do #4727 <#4727> at the
same time since would be modifying the same place in documentation? Or
could do it in separate branch if you want.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4805?email_source=notifications&email_token=AAFFQ6HT7AV5QIWDEFLQIBTQHGT23A5CNFSM4IPEK5X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5S7ULQ#issuecomment-526776878>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFFQ6GQNRPH5SK3ROYDXMDQHGT23ANCNFSM4IPEK5XQ>
.
|
Due to various things I probably won't be able to get back to this for 3-4 weeks but am still planning on updating the docs so can push this |
e91adf9
to
2054ea9
Compare
I just updated review with a new version after rebasing on master and updating documentation for how to handle missing values. I ended up rewriting the documentation a decent amount as I've seen this issue cause lots of confusion with people trying to use spark on xgboost and I'm not sure what was there before was exactly right. Open to reverting those changes if desired though (or changed in any way). Also open to adding more explanation about what makes this so tricky (I've found it hard to explain to folks) but that might become kind of long in the documentation. Also, any thought of having an implementation of a VectorAssembler that would keep zeros checked into this project so could directly point people at it to use and thus make integration easier |
@CodingCat Is this ready to be merged? |
In traveling, no laptop, will look at early this next week
…On Tue, Dec 10, 2019 at 9:34 AM Philip Hyunsu Cho ***@***.***> wrote:
@CodingCat <https://github.com/CodingCat> Is this ready to be merged?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4805?email_source=notifications&email_token=AAFFQ6GDUMTPDXUL5VM7YPLQX7HL7A5CNFSM4IPEK5X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGQC4MQ#issuecomment-564145714>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFFQ6EMS4SID5D66R5TEXDQX7HL7ANCNFSM4IPEK5XQ>
.
|
merged, thanks for the contribution! |
@cpfarrell @hcho3 this is failing builds. can someone please help looking into this? |
@sriramch It appears a minor infraction in Scala style check. |
See #5134 |
This pull request adds a new parameter in Spark XGBoost that allows for bypassing assertion added as part of https://github.com/dmlc/xgboost/pull/4349/files (related comment #4349 (comment))
The reason for this change is that I have my own VectorAssembler implementation that allows for creating sparse vectors where any value can be encoded as missing. I had been using this assembler to create feature vectors with NaN encoded as missing which integrated well with XGBoost (both in Spark and saving the models in binary form for use on other platforms). This change would let me continue doing that by providing this special flag, essentially its a way of addressing the "Note" section at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I plan on providing my VectorAssembler implementation here in case any one else might be interested in it but will take just a little bit of time to do that.
For the actual code here: I mostly just copied what was happening for the missing parameter and passed my new variable around the same way. I tried to follow the style of the rest of the code base but I've never written Scala before so likely have a few style issues would be happy to hear about.
Probably the biggest thing here is how to make it clear why this exists, I am not at all sure on my naming so welcome feedback. Also should maybe document more what it does and how it should be used (including when it shouldn't be used).
For testing, I ran jvm-packages/dev/build-linux.sh and it succeeded. Would like to try actually running this inside a Spark job to verify it works how I think it does.
Let me know thoughts on this.