[jvm-packages] Allow for bypassing spark missing value check #4805

cpfarrell · 2019-08-23T23:08:35Z

This pull request adds a new parameter in Spark XGBoost that allows for bypassing assertion added as part of https://github.com/dmlc/xgboost/pull/4349/files (related comment #4349 (comment))

The reason for this change is that I have my own VectorAssembler implementation that allows for creating sparse vectors where any value can be encoded as missing. I had been using this assembler to create feature vectors with NaN encoded as missing which integrated well with XGBoost (both in Spark and saving the models in binary form for use on other platforms). This change would let me continue doing that by providing this special flag, essentially its a way of addressing the "Note" section at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I plan on providing my VectorAssembler implementation here in case any one else might be interested in it but will take just a little bit of time to do that.

For the actual code here: I mostly just copied what was happening for the missing parameter and passed my new variable around the same way. I tried to follow the style of the rest of the code base but I've never written Scala before so likely have a few style issues would be happy to hear about.

Probably the biggest thing here is how to make it clear why this exists, I am not at all sure on my naming so welcome feedback. Also should maybe document more what it does and how it should be used (including when it shouldn't be used).

For testing, I ran jvm-packages/dev/build-linux.sh and it succeeded. Would like to try actually running this inside a Spark job to verify it works how I think it does.

Let me know thoughts on this.

cpfarrell · 2019-08-23T23:10:48Z

@CodingCat FYI since talked with you about this in #4349 (comment)

CodingCat · 2019-08-27T17:50:16Z

Thanks! LGTM, could you also update the doc in https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values accordingly?

cpfarrell · 2019-08-30T23:24:24Z

Thaks, and yea will do on updating docs.

Mind if I do #4727 at the same time since would be modifying the same place in documentation? Or could do it in separate branch if you want.

CodingCat · 2019-08-31T00:49:14Z

Sure, let’s do together

…

On Fri, Aug 30, 2019 at 4:24 PM cpfarrell ***@***.***> wrote: Thaks, and yea will do on updating docs. Mind if I do #4727 <#4727> at the same time since would be modifying the same place in documentation? Or could do it in separate branch if you want. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4805?email_source=notifications&email_token=AAFFQ6HT7AV5QIWDEFLQIBTQHGT23A5CNFSM4IPEK5X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5S7ULQ#issuecomment-526776878>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFFQ6GQNRPH5SK3ROYDXMDQHGT23ANCNFSM4IPEK5XQ> .

cpfarrell · 2019-09-11T21:29:35Z

Due to various things I probably won't be able to get back to this for 3-4 weeks but am still planning on updating the docs so can push this

cpfarrell · 2019-11-19T23:26:57Z

I just updated review with a new version after rebasing on master and updating documentation for how to handle missing values. I ended up rewriting the documentation a decent amount as I've seen this issue cause lots of confusion with people trying to use spark on xgboost and I'm not sure what was there before was exactly right. Open to reverting those changes if desired though (or changed in any way). Also open to adding more explanation about what makes this so tricky (I've found it hard to explain to folks) but that might become kind of long in the documentation.

Also, any thought of having an implementation of a VectorAssembler that would keep zeros checked into this project so could directly point people at it to use and thus make integration easier

hcho3 · 2019-12-10T17:34:54Z

@CodingCat Is this ready to be merged?

CodingCat · 2019-12-10T17:37:13Z

In traveling, no laptop, will look at early this next week

…

On Tue, Dec 10, 2019 at 9:34 AM Philip Hyunsu Cho ***@***.***> wrote: @CodingCat <https://github.com/CodingCat> Is this ready to be merged? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4805?email_source=notifications&email_token=AAFFQ6GDUMTPDXUL5VM7YPLQX7HL7A5CNFSM4IPEK5X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGQC4MQ#issuecomment-564145714>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFFQ6EMS4SID5D66R5TEXDQX7HL7ANCNFSM4IPEK5XQ> .

CodingCat · 2019-12-18T18:48:32Z

merged, thanks for the contribution!

sriramch · 2019-12-18T22:05:34Z

@cpfarrell @hcho3 this is failing builds. can someone please help looking into this?

hcho3 · 2019-12-18T22:21:24Z

@sriramch It appears a minor infraction in Scala style check.

…mlc#4805)" This reverts commit bc9d882.

hcho3 · 2019-12-18T22:31:14Z

See #5134

CodingCat changed the title ~~Allow for bypassing spark missing value check~~ [jvm-packages] Allow for bypassing spark missing value check Aug 23, 2019

CodingCat approved these changes Aug 27, 2019

View reviewed changes

CodingCat mentioned this pull request Aug 28, 2019

[jvm-packages] Refactor XGBoost.scala to put all params processing in one place #4815

Merged

Chris Farrell added 2 commits November 19, 2019 14:43

Allow for bypassing spark missing value check

cc1b349

Update documentation for dealing with missing values in spark xgboost

2054ea9

cpfarrell force-pushed the allow_bypassing_spark_missing_check branch from e91adf9 to 2054ea9 Compare November 19, 2019 23:19

talalryz mentioned this pull request Nov 26, 2019

Fix xgboost sparse vector support combust/mleap#605

Merged

firestarman mentioned this pull request Dec 11, 2019

[jvm-packages] Support specifying features via multiple columns #5057

Open

CodingCat merged commit bc9d882 into dmlc:master Dec 18, 2019

sriramch mentioned this pull request Dec 18, 2019

implementation of map ranking algorithm on gpu #5129

Merged

hcho3 mentioned this pull request Dec 18, 2019

[jvm-packages] Comply with scala style convention #5134

Merged

sriramch added a commit to sriramch/xgboost that referenced this pull request Dec 18, 2019

Revert "[jvm-packages] Allow for bypassing spark missing value check (d…

f6b3ff7

…mlc#4805)" This reverts commit bc9d882.

lucagiovagnoli mentioned this pull request Jan 24, 2020

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Allow for bypassing spark missing value check #4805

[jvm-packages] Allow for bypassing spark missing value check #4805

cpfarrell commented Aug 23, 2019

cpfarrell commented Aug 23, 2019

CodingCat commented Aug 27, 2019

cpfarrell commented Aug 30, 2019

CodingCat commented Aug 31, 2019 via email

cpfarrell commented Sep 11, 2019

cpfarrell commented Nov 19, 2019

hcho3 commented Dec 10, 2019

CodingCat commented Dec 10, 2019 via email

CodingCat commented Dec 18, 2019

sriramch commented Dec 18, 2019

hcho3 commented Dec 18, 2019

hcho3 commented Dec 18, 2019

[jvm-packages] Allow for bypassing spark missing value check #4805

[jvm-packages] Allow for bypassing spark missing value check #4805

Conversation

cpfarrell commented Aug 23, 2019

cpfarrell commented Aug 23, 2019

CodingCat commented Aug 27, 2019

cpfarrell commented Aug 30, 2019

CodingCat commented Aug 31, 2019 via email

cpfarrell commented Sep 11, 2019

cpfarrell commented Nov 19, 2019

hcho3 commented Dec 10, 2019

CodingCat commented Dec 10, 2019 via email

CodingCat commented Dec 18, 2019

sriramch commented Dec 18, 2019

hcho3 commented Dec 18, 2019

hcho3 commented Dec 18, 2019