[jvm-packages]support multiple validation datasets in Spark #3910

CodingCat · 2018-11-16T05:48:14Z

converge the current training/test split with multi validation datasets
add unit test for multiple validation set
add support for ranking training tasks
add unit test for ranking training
update tutorial
fix early stopping feature

CodingCat · 2018-11-19T23:24:57Z

@yanboliang @weitian @superbobry anyone of you can review this?

CodingCat · 2018-11-19T23:27:20Z

it's how example run looks like after the change

CodingCat · 2018-11-20T00:22:30Z

~~find one more thing to fix, early stopping~~

CodingCat · 2018-11-25T08:00:27Z

@yanboliang take a further look?

codecov-io · 2018-11-26T00:03:58Z

Codecov Report

Merging #3910 into master will increase coverage by 0.18%.
The diff coverage is 72.02%.

@@             Coverage Diff              @@
##             master    #3910      +/-   ##
============================================
+ Coverage     56.23%   56.41%   +0.18%     
- Complexity      205      210       +5     
============================================
  Files           185      186       +1     
  Lines         14702    14818     +116     
  Branches        498      527      +29     
============================================
+ Hits           8267     8359      +92     
- Misses         6196     6202       +6     
- Partials        239      257      +18

Impacted Files	Coverage Δ	Complexity Δ
...lc/xgboost4j/scala/spark/params/CustomParams.scala	`84.21% <ø> (+20.21%)`	`0 <0> (ø)`	⬇️
.../xgboost4j/scala/example/spark/SparkTraining.scala	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
.../src/main/java/ml/dmlc/xgboost4j/java/XGBoost.java	`85.33% <100%> (+0.29%)`	`46 <0> (+3)`	⬆️
...boost4j/scala/spark/params/NonParamVariables.scala	`100% <100%> (ø)`	`0 <0> (?)`
...c/xgboost4j/scala/spark/params/GeneralParams.scala	`66.66% <100%> (ø)`	`0 <0> (ø)`	⬇️
...xgboost4j/scala/spark/XGBoostTrainingSummary.scala	`35.71% <11.11%> (-27.93%)`	`2 <1> (ø)`
...cala/ml/dmlc/xgboost4j/scala/spark/DataUtils.scala	`42.1% <60%> (+19.88%)`	`0 <0> (ø)`	⬇️
.../dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala	`63.9% <66.66%> (+1.52%)`	`18 <0> (+1)`	⬆️
...dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala	`65.06% <77.77%> (+0.77%)`	`19 <0> (+1)`	⬆️
.../scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala	`77.12% <79.52%> (-0.38%)`	`0 <0> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4ff50...49d578f. Read the comment docs.

CodingCat · 2018-11-28T16:25:41Z

@yanboliang ping?

CodingCat · 2018-12-07T23:24:40Z

@yanboliang ping

yanboliang · 2018-12-18T04:57:04Z

Looks good to me overall, thanks.

CodingCat · 2018-12-18T05:03:53Z

thanks @yanboliang

linghaogu · 2019-02-01T21:19:22Z

.../xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/NonParamVariables.scala

+trait NonParamVariables {
+  protected var evalSetsMap: Map[String, DataFrame] = Map.empty
+
+  def setEvalSets(evalSets: Map[String, DataFrame]): Unit = {


I am not sure if it is the right place to comment, let me know if it is not correct.

I am a user of xgboost and I am using 0.82 for our production training. One thing I think we can improve is that the return value of this function is better to be this.type
Because as a scala user I will write code like:

val xgb = new XGBoostClassifier(xgboostParam) .setFeaturesCol(MSDataSchema.FEATURE_VECTOR) .setLabelCol(MSDataSchema.RELEVANCE_LEVEL) .setEvalSets(Map("eval" -> testData))

xgb will be a None in this case

CodingCat · 2019-02-01T21:20:56Z

Oh, it’s a typo or autocomplete by ide, feel free to file a PR addressing this!

linghaogu · 2019-02-01T21:22:30Z

I will thank you!

CodingCat changed the title ~~[WIP][jvm-packages]support multiple validation datasets in Spark~~ [jvm-packages]support multiple validation datasets in Spark Nov 19, 2018

CodingCat changed the title ~~[jvm-packages]support multiple validation datasets in Spark~~ [WIP][jvm-packages]support multiple validation datasets in Spark Nov 20, 2018

CodingCat changed the title ~~[WIP][jvm-packages]support multiple validation datasets in Spark~~ [jvm-packages]support multiple validation datasets in Spark Nov 20, 2018

CodingCat and others added 16 commits November 24, 2018 21:14

add back train method but mark as deprecated

96d82da

fix scalastyle error

1bd9115

add back train method but mark as deprecated

d357155

fix scalastyle error

05a1676

add back train method but mark as deprecated

f6a53af

fix scalastyle error

910eafb

add back train method but mark as deprecated

372ed76

fix scalastyle error

98b6b58

wrap iterators

e3bb3c8

enable copartition training and validationset

182e6f0

add parameters

0460380

converge code path and have init unit test

2720421

enable multi evals for ranking

29785fd

unit test and doc

0c18ff7

update example

f34cc82

fix early stopping

0deeb8d

CodingCat force-pushed the multi_eval branch from a2f5019 to 0deeb8d Compare November 25, 2018 05:16

Nan Zhu added 4 commits November 24, 2018 23:14

address the offline comments

56aaaa9

udpate doc

068a39b

test eval metrics

d04cf88

fix compilation issue

49d578f

fix example

c4fa8d3

CodingCat merged commit c055a32 into dmlc:master Dec 18, 2018

CodingCat deleted the multi_eval branch December 18, 2018 05:04

CodingCat mentioned this pull request Jan 28, 2019

[jvm-packages] eval_set for xgboost4j-spark #3231

Closed

linghaogu reviewed Feb 1, 2019

View reviewed changes

hcho3 mentioned this pull request Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

lock bot locked as resolved and limited conversation to collaborators May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages]support multiple validation datasets in Spark #3910

[jvm-packages]support multiple validation datasets in Spark #3910

CodingCat commented Nov 16, 2018 •

edited

Loading

CodingCat commented Nov 19, 2018

CodingCat commented Nov 19, 2018

CodingCat commented Nov 20, 2018 •

edited

Loading

CodingCat commented Nov 25, 2018

codecov-io commented Nov 26, 2018 •

edited

Loading

CodingCat commented Nov 28, 2018

CodingCat commented Dec 7, 2018

yanboliang commented Dec 18, 2018

CodingCat commented Dec 18, 2018

linghaogu Feb 1, 2019

CodingCat commented Feb 1, 2019

linghaogu commented Feb 1, 2019

[jvm-packages]support multiple validation datasets in Spark #3910

[jvm-packages]support multiple validation datasets in Spark #3910

Conversation

CodingCat commented Nov 16, 2018 • edited Loading

CodingCat commented Nov 19, 2018

CodingCat commented Nov 19, 2018

CodingCat commented Nov 20, 2018 • edited Loading

CodingCat commented Nov 25, 2018

codecov-io commented Nov 26, 2018 • edited Loading

Codecov Report

CodingCat commented Nov 28, 2018

CodingCat commented Dec 7, 2018

yanboliang commented Dec 18, 2018

CodingCat commented Dec 18, 2018

linghaogu Feb 1, 2019

Choose a reason for hiding this comment

CodingCat commented Feb 1, 2019

linghaogu commented Feb 1, 2019

CodingCat commented Nov 16, 2018 •

edited

Loading

CodingCat commented Nov 20, 2018 •

edited

Loading

codecov-io commented Nov 26, 2018 •

edited

Loading