Fix xgboost sparse vector support #605

talalryz · 2019-11-26T01:42:49Z

What is the bug?
mleap xgboost does not support sparse vector

How can the bug be found?
I added some sparse rows to the testing data in xgboost_training.csv. This causes tests to fail with predictions from mleap not matching spark predictions.
I first noticed this bug in #596

How will the bug be fixed?
Create DMatrix directly from sparseTensor and sparseVector, instead of converting to denseVector and denseTensor first.

What does this PR do?

Implements the fix above.
Also fixes a bug in XGBoostClassifier where predict doesn't pass treeLimit

Tests
The regressor test now passes with updated dataset that includes sparse Data.
For the classifier, i added a new test that uses this mixed dataset. This also passes.

** WHAT THIS TEST DOES NOT FIX**
These changes have been tested with mleap v0.11.0, and should work until 0.14. These changes do not fix issues with v0.15.0 which uses xgboost v0.90. XGBoost v0.90 no longer supports using Float.NaN as missing values for sparse vectors, and I have not found a way to make mleap predictions match spark predictions.
There are issues that lie within xgboost itself, and we should be able to resolve after this approved PR to xgboost is merged: dmlc/xgboost#4805

EDIT: I updated the tests to match the xgboost specification for handling missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values

This branch will work with xgboost v0.90.

lucagiovagnoli · 2019-11-26T02:25:48Z

@talalryz Do you know why tests are not passing ?

lucagiovagnoli · 2019-11-26T01:56:26Z

mleap-xgboost-spark/src/test/resources/datasources/xgboost_training.csv

-21.17,58.16,1017.16,68.11,452.02
-19.94,58.96,1014.16,66.27,455.55
-8.73,41.92,1029.41,89.72,480.99
+21.91,0,0,0,445.04


What are these zero representing? Zeros or Nans ?

They are representing missing values. The spark vector assembler will form sparse vectors corresponding to these rows. The test setup is a pipeline of,

features -> vector assembler -> xgboost model

Having these rows with missing values means that the Mleap xgboost models will actually have to deal with SparseVectors ( or SparseTensors)

lucagiovagnoli · 2019-11-26T02:02:03Z

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala

@@ -106,7 +129,6 @@ class XGBoostClassificationModelParitySpec extends FunSpec
            }
            assert(Math.abs(v2 - v1) < 0.0000001)


Magic number in code :S
Could this maybe be
SEVEN_SIGNIFICANT_DIGITS = 0.0000001

lucagiovagnoli · 2019-11-26T02:24:05Z

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala

    val mleapSchema = TypeConverters.sparkSchemaToMleapSchema(dataset)

    val data = dataset.collect().map {
      r => Row(r.toSeq: _*)
    }
    val frame = DefaultLeapFrame(mleapSchema, data)
-    val mleapT = mleapTransformer(sparkTransformer)
+    val mleapT = mleapTransformer(sparkTransformer, dataset, bundleCache)
    val mleapDataset = mleapT.transform(frame).get


this should ideally be one responsibility

@talalryz Do you know why tests are not passing ?

the tests are not passing because xgboost 0.90 does not allow Float.NaN as missing value for sparse Vectors.

ERROR DataBatch: java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format

This is expected as this branch does not fix the issues introduced by the recent upgrade to xgboost 0.90.

updated the branch so that all tests pass

lucagiovagnoli · 2019-11-26T02:26:03Z

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala

    SparkParityBase.dataset(spark).select("fico_score_group_fnl", "dti").
      filter(col("fico_score_group_fnl") === "500 - 550" ||
        col("fico_score_group_fnl") === "600 - 650")
  }

-  val sparkTransformer: Transformer = {
+  val mixedDataset: DataFrame = {


This is supposed to be a Sparse dataset at some point along the way right ? Due to implicit class VectorOps being implicit I am not sure where that happens

I called it a mixed Dataset because it has both sparse and dense rows. I can rename it sparseDataset to be more clear

Mmh, I don't know if a dataset having both sparse rows and "full" rows would still sparse or dense. Is the glass half empty or half full?
But I think for readability it's good to have a sparkeDataset and a denseDataset, so the tests are clearer

updated naming to sparseDataset

…m/talalryz/mleap into fix_xgboost_sparse_vector_support

talalryz · 2019-12-02T19:14:37Z

I updated the tests in this branch to match the xgboost specification for dealing with missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values

All the tests now pass.

@ancasarb

lucagiovagnoli · 2019-12-06T09:45:31Z

...rk/src/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostRegressionModelParitySpec.scala

@@ -36,6 +36,7 @@ class XGBoostRegressionModelParitySpec extends FunSpec
  private val xgboostParams: Map[String, Any] = Map(
    "eta" -> 0.3,
    "max_depth" -> 2,
+    "missing" -> 0.0f,


Does this mean that missing are substituted with zeros ?

I'm wondering, why not putting nulls here ?

Short Answer:

This means that zeros will be assumed to be missing values.

We can't set it as anything other than 0.0 because xgboost throws the following error if we do that:

java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format

Detailed Answer:

The logic behind not allowing any value other than 0.0 is that Sparks VectorAssembler always treats 0 as a missing value, so any SparseVector created by the Spark VectorAssembler will omit 0.0 treating it as missing.
XGboost devs decided that to ensure compatibility with Sparks VectorAssembler they would enforce setting missing as 0.0 in their models whenever a SparseVector was used, otherwise; there would be incompatibility between DenseVector and SparseVector

There is more info here: https://github.com/dmlc/xgboost/pull/4349/files

lucagiovagnoli

This looks good to me. @ancasarb do you know if anyone could review and approve this?

ancasarb

Just some small test cleanup needed, but changes look great, thank you!

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala

ancasarb

looks great, thank you!

talalryz added 3 commits November 25, 2019 16:10

fixed xgboost mleap support for sparse vectors

57c289f

reverted change where i deleted the wrong file

a615b56

added more sparse rows to xgboost_training.csv

3ec55e6

lucagiovagnoli reviewed Nov 26, 2019

View reviewed changes

talalryz and others added 8 commits November 26, 2019 17:30

fixed compatibility with xgboost-0.90

efae746

Merge branch 'master' into fix_xgboost_sparse_vector_support

d4a1097

fixed xgboost mleap support for sparse vectors

576a968

reverted change where i deleted the wrong file

9b5c6db

added more sparse rows to xgboost_training.csv

2f0c557

fixed compatibility with xgboost-0.90

59fe861

updated compatibility to xgboost-0.9 for XgboostRegressor

5884a68

Merge branch 'fix_xgboost_sparse_vector_support' of https://github.co…

6a35d0d

…m/talalryz/mleap into fix_xgboost_sparse_vector_support

lucagiovagnoli reviewed Dec 6, 2019

View reviewed changes

lucagiovagnoli approved these changes Dec 18, 2019

View reviewed changes

ancasarb reviewed Jan 12, 2020

View reviewed changes

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala Outdated Show resolved Hide resolved

...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala Outdated Show resolved Hide resolved

removed unusesd class and variable renamed

dc0464d

ancasarb approved these changes Jan 13, 2020

View reviewed changes

lucagiovagnoli merged commit af92144 into combust:master Jan 14, 2020

talalryz mentioned this pull request Jan 15, 2020

MLeap Serialized Pipeline including a XGBoost Model does not predict same values as Spark Pipeline #625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix xgboost sparse vector support #605

Fix xgboost sparse vector support #605

talalryz commented Nov 26, 2019 •

edited

Loading

lucagiovagnoli commented Nov 26, 2019

lucagiovagnoli Nov 26, 2019

talalryz Nov 26, 2019

lucagiovagnoli Nov 26, 2019

lucagiovagnoli Nov 26, 2019

talalryz Nov 26, 2019

talalryz Dec 2, 2019

lucagiovagnoli Nov 26, 2019

talalryz Nov 26, 2019

lucagiovagnoli Nov 26, 2019

talalryz Dec 2, 2019

talalryz commented Dec 2, 2019

lucagiovagnoli Dec 6, 2019

lucagiovagnoli Dec 6, 2019

talalryz Dec 6, 2019 •

edited

Loading

lucagiovagnoli left a comment

ancasarb left a comment

ancasarb left a comment

		@@ -106,7 +129,6 @@ class XGBoostClassificationModelParitySpec extends FunSpec
		}
		assert(Math.abs(v2 - v1) < 0.0000001)

Fix xgboost sparse vector support #605

Fix xgboost sparse vector support #605

Conversation

talalryz commented Nov 26, 2019 • edited Loading

lucagiovagnoli commented Nov 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talalryz commented Dec 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talalryz Dec 6, 2019 • edited Loading

Choose a reason for hiding this comment

lucagiovagnoli left a comment

Choose a reason for hiding this comment

ancasarb left a comment

Choose a reason for hiding this comment

ancasarb left a comment

Choose a reason for hiding this comment

talalryz commented Nov 26, 2019 •

edited

Loading

talalryz Dec 6, 2019 •

edited

Loading