Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xgboost sparse vector support #605

Merged

Conversation

talalryz
Copy link
Contributor

@talalryz talalryz commented Nov 26, 2019

What is the bug?
mleap xgboost does not support sparse vector

How can the bug be found?
I added some sparse rows to the testing data in xgboost_training.csv. This causes tests to fail with predictions from mleap not matching spark predictions.
I first noticed this bug in #596

How will the bug be fixed?
Create DMatrix directly from sparseTensor and sparseVector, instead of converting to denseVector and denseTensor first.

What does this PR do?

  1. Implements the fix above.
  2. Also fixes a bug in XGBoostClassifier where predict doesn't pass treeLimit

Tests
The regressor test now passes with updated dataset that includes sparse Data.
For the classifier, i added a new test that uses this mixed dataset. This also passes.

** WHAT THIS TEST DOES NOT FIX**
These changes have been tested with mleap v0.11.0, and should work until 0.14. These changes do not fix issues with v0.15.0 which uses xgboost v0.90. XGBoost v0.90 no longer supports using Float.NaN as missing values for sparse vectors, and I have not found a way to make mleap predictions match spark predictions.
There are issues that lie within xgboost itself, and we should be able to resolve after this approved PR to xgboost is merged: dmlc/xgboost#4805

EDIT: I updated the tests to match the xgboost specification for handling missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values

This branch will work with xgboost v0.90.

@lucagiovagnoli
Copy link
Member

@talalryz Do you know why tests are not passing ?

21.17,58.16,1017.16,68.11,452.02
19.94,58.96,1014.16,66.27,455.55
8.73,41.92,1029.41,89.72,480.99
21.91,0,0,0,445.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these zero representing? Zeros or Nans ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are representing missing values. The spark vector assembler will form sparse vectors corresponding to these rows. The test setup is a pipeline of,

features -> vector assembler -> xgboost model

Having these rows with missing values means that the Mleap xgboost models will actually have to deal with SparseVectors ( or SparseTensors)

@@ -106,7 +129,6 @@ class XGBoostClassificationModelParitySpec extends FunSpec
}
assert(Math.abs(v2 - v1) < 0.0000001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number in code :S
Could this maybe be
SEVEN_SIGNIFICANT_DIGITS = 0.0000001

Comment on lines 177 to 184
val mleapSchema = TypeConverters.sparkSchemaToMleapSchema(dataset)

val data = dataset.collect().map {
r => Row(r.toSeq: _*)
}
val frame = DefaultLeapFrame(mleapSchema, data)
val mleapT = mleapTransformer(sparkTransformer)
val mleapT = mleapTransformer(sparkTransformer, dataset, bundleCache)
val mleapDataset = mleapT.transform(frame).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should ideally be one responsibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talalryz Do you know why tests are not passing ?

the tests are not passing because xgboost 0.90 does not allow Float.NaN as missing value for sparse Vectors.

ERROR DataBatch: java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format

This is expected as this branch does not fix the issues introduced by the recent upgrade to xgboost 0.90.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the branch so that all tests pass

SparkParityBase.dataset(spark).select("fico_score_group_fnl", "dti").
filter(col("fico_score_group_fnl") === "500 - 550" ||
col("fico_score_group_fnl") === "600 - 650")
}

val sparkTransformer: Transformer = {
val mixedDataset: DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be a Sparse dataset at some point along the way right ? Due to implicit class VectorOps being implicit I am not sure where that happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it a mixed Dataset because it has both sparse and dense rows. I can rename it sparseDataset to be more clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmh, I don't know if a dataset having both sparse rows and "full" rows would still sparse or dense. Is the glass half empty or half full?
But I think for readability it's good to have a sparkeDataset and a denseDataset, so the tests are clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated naming to sparseDataset

@talalryz
Copy link
Contributor Author

talalryz commented Dec 2, 2019

I updated the tests in this branch to match the xgboost specification for dealing with missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values

All the tests now pass.

@ancasarb

@@ -36,6 +36,7 @@ class XGBoostRegressionModelParitySpec extends FunSpec
private val xgboostParams: Map[String, Any] = Map(
"eta" -> 0.3,
"max_depth" -> 2,
"missing" -> 0.0f,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that missing are substituted with zeros ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, why not putting nulls here ?

Copy link
Contributor Author

@talalryz talalryz Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short Answer:

  1. This means that zeros will be assumed to be missing values.

  2. We can't set it as anything other than 0.0 because xgboost throws the following error if we do that:

java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format

Detailed Answer:

The logic behind not allowing any value other than 0.0 is that Sparks VectorAssembler always treats 0 as a missing value, so any SparseVector created by the Spark VectorAssembler will omit 0.0 treating it as missing.
XGboost devs decided that to ensure compatibility with Sparks VectorAssembler they would enforce setting missing as 0.0 in their models whenever a SparseVector was used, otherwise; there would be incompatibility between DenseVector and SparseVector

There is more info here: https://github.com/dmlc/xgboost/pull/4349/files

Copy link
Member

@lucagiovagnoli lucagiovagnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @ancasarb do you know if anyone could review and approve this?

Copy link
Member

@ancasarb ancasarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small test cleanup needed, but changes look great, thank you!

Copy link
Member

@ancasarb ancasarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants