Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Allow for bypassing spark missing value check #4805

Merged
merged 2 commits into from
Dec 18, 2019

Conversation

cpfarrell
Copy link
Contributor

This pull request adds a new parameter in Spark XGBoost that allows for bypassing assertion added as part of https://github.com/dmlc/xgboost/pull/4349/files (related comment #4349 (comment))

The reason for this change is that I have my own VectorAssembler implementation that allows for creating sparse vectors where any value can be encoded as missing. I had been using this assembler to create feature vectors with NaN encoded as missing which integrated well with XGBoost (both in Spark and saving the models in binary form for use on other platforms). This change would let me continue doing that by providing this special flag, essentially its a way of addressing the "Note" section at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I plan on providing my VectorAssembler implementation here in case any one else might be interested in it but will take just a little bit of time to do that.

For the actual code here: I mostly just copied what was happening for the missing parameter and passed my new variable around the same way. I tried to follow the style of the rest of the code base but I've never written Scala before so likely have a few style issues would be happy to hear about.

Probably the biggest thing here is how to make it clear why this exists, I am not at all sure on my naming so welcome feedback. Also should maybe document more what it does and how it should be used (including when it shouldn't be used).

For testing, I ran jvm-packages/dev/build-linux.sh and it succeeded. Would like to try actually running this inside a Spark job to verify it works how I think it does.

Let me know thoughts on this.

@cpfarrell
Copy link
Contributor Author

@CodingCat FYI since talked with you about this in #4349 (comment)

@CodingCat CodingCat changed the title Allow for bypassing spark missing value check [jvm-packages] Allow for bypassing spark missing value check Aug 23, 2019
@CodingCat
Copy link
Member

Thanks! LGTM, could you also update the doc in https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values accordingly?

@cpfarrell
Copy link
Contributor Author

Thaks, and yea will do on updating docs.

Mind if I do #4727 at the same time since would be modifying the same place in documentation? Or could do it in separate branch if you want.

@CodingCat
Copy link
Member

CodingCat commented Aug 31, 2019 via email

@cpfarrell
Copy link
Contributor Author

Due to various things I probably won't be able to get back to this for 3-4 weeks but am still planning on updating the docs so can push this

@cpfarrell cpfarrell force-pushed the allow_bypassing_spark_missing_check branch from e91adf9 to 2054ea9 Compare November 19, 2019 23:19
@cpfarrell
Copy link
Contributor Author

I just updated review with a new version after rebasing on master and updating documentation for how to handle missing values. I ended up rewriting the documentation a decent amount as I've seen this issue cause lots of confusion with people trying to use spark on xgboost and I'm not sure what was there before was exactly right. Open to reverting those changes if desired though (or changed in any way). Also open to adding more explanation about what makes this so tricky (I've found it hard to explain to folks) but that might become kind of long in the documentation.

Also, any thought of having an implementation of a VectorAssembler that would keep zeros checked into this project so could directly point people at it to use and thus make integration easier

@hcho3
Copy link
Collaborator

hcho3 commented Dec 10, 2019

@CodingCat Is this ready to be merged?

@CodingCat
Copy link
Member

CodingCat commented Dec 10, 2019 via email

@CodingCat
Copy link
Member

merged, thanks for the contribution!

@sriramch
Copy link
Contributor

@cpfarrell @hcho3 this is failing builds. can someone please help looking into this?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 18, 2019

@sriramch It appears a minor infraction in Scala style check.

sriramch added a commit to sriramch/xgboost that referenced this pull request Dec 18, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Dec 18, 2019

See #5134

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants