Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] eval_set for xgboost4j-spark #3231

Closed
keanpantraw opened this issue Apr 9, 2018 · 8 comments
Closed

[jvm-packages] eval_set for xgboost4j-spark #3231

keanpantraw opened this issue Apr 9, 2018 · 8 comments
Assignees

Comments

@keanpantraw
Copy link

keanpantraw commented Apr 9, 2018

There is no way to set custom evaluation set for ml.dmlc.xgboost4j.scala.spark.XGBoost#trainDistributed. Code inside uses private ml.dmlc.xgboost4j.scala.spark.Watches class which just splits train with predefined trainTestRatio and doesn't accept any custom eval set through params.
Is there any particular reason for this limitation or it's just stub and can be extended for example with DMatrix passed through params? Is there any complications caused by fact that this is distributed XGBoost? How such dataset should be stored in params then, as DMatrix or RDD, or something else?

keanpantraw pushed a commit to keanpantraw/xgboost that referenced this issue Apr 9, 2018
keanpantraw pushed a commit to keanpantraw/xgboost that referenced this issue Apr 10, 2018
@CodingCat
Copy link
Member

I think there is a comment when bring the code in, #2710 (comment)

Would you like to give this requirement a shot?

@hcho3 hcho3 mentioned this issue Jul 4, 2018
32 tasks
@hcho3
Copy link
Collaborator

hcho3 commented Jul 4, 2018

All feature requests are now consolidated to #3439. This issue should be re-opened if someone decides to actively work on implementing this feature.

@hcho3 hcho3 closed this as completed Jul 4, 2018
@CodingCat CodingCat reopened this Jul 10, 2018
@CodingCat CodingCat self-assigned this Jul 10, 2018
@CodingCat
Copy link
Member

I will work on eval set this week

@hcho3
Copy link
Collaborator

hcho3 commented Aug 1, 2018

@CodingCat There is a work in progress to implement watchlist in the XGBoost4J Scala wrapper: #3544. Can we take advantage of this to implement watchlist in XGBoost4J-Spark?

@CodingCat
Copy link
Member

spark's problem is you have to find some way to pass in, join (or zip), multiple dataframes

and pass some part of each of them to each Spark task, create DMatrix, and take each DMatrix in each Spark task as each watch dataset.....

that part is kind of complicated and needs to refactor the current Watch thing, I think we can do it in the next version.....

@hcho3
Copy link
Collaborator

hcho3 commented Sep 7, 2018

Consolidating to the feature request tracker #3439. Feel free to re-open this issue when anyone starts working on this.

@CodingCat
Copy link
Member

the feature is implemented in #3910

@eliyara
Copy link

eliyara commented Feb 13, 2019

Hi @CodingCat,

I need to define a separate validation set for cross validation, using xgboost4j on spark. I tried the approach here. It does not look like that setting "eval_sets" -> Map("dev" -> dev_df) make any difference! Should I expect the following set up work as cross validation does (using TrainValidationSplit)?

        val params = scala.collection.mutable.Map(
            "eta" -> 0.1,
            "objective" -> "binary:logistic",
            "eval_sets" -> Map("dev" -> dev_df))
        val booster = new XGBoostClassifier(params.toMap)
        booster.setFeaturesCol("features")
        booster.setLabelCol("label")
        booster.setMaxDepth(5)
        booster.setNumRound(150)
        booster.setNumWorkers(4)
        val xgb_model = booster.fit(train_df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants