-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix xgboost sparse vector support #605
Fix xgboost sparse vector support #605
Conversation
@talalryz Do you know why tests are not passing ? |
21.17,58.16,1017.16,68.11,452.02 | ||
19.94,58.96,1014.16,66.27,455.55 | ||
8.73,41.92,1029.41,89.72,480.99 | ||
21.91,0,0,0,445.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these zero representing? Zeros or Nans ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are representing missing values. The spark vector assembler will form sparse vectors corresponding to these rows. The test setup is a pipeline of,
features -> vector assembler -> xgboost model
Having these rows with missing values means that the Mleap xgboost models will actually have to deal with SparseVectors ( or SparseTensors)
@@ -106,7 +129,6 @@ class XGBoostClassificationModelParitySpec extends FunSpec | |||
} | |||
assert(Math.abs(v2 - v1) < 0.0000001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number in code :S
Could this maybe be
SEVEN_SIGNIFICANT_DIGITS = 0.0000001
val mleapSchema = TypeConverters.sparkSchemaToMleapSchema(dataset) | ||
|
||
val data = dataset.collect().map { | ||
r => Row(r.toSeq: _*) | ||
} | ||
val frame = DefaultLeapFrame(mleapSchema, data) | ||
val mleapT = mleapTransformer(sparkTransformer) | ||
val mleapT = mleapTransformer(sparkTransformer, dataset, bundleCache) | ||
val mleapDataset = mleapT.transform(frame).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should ideally be one responsibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talalryz Do you know why tests are not passing ?
the tests are not passing because xgboost 0.90 does not allow Float.NaN as missing value for sparse Vectors.
ERROR DataBatch: java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format
This is expected as this branch does not fix the issues introduced by the recent upgrade to xgboost 0.90.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the branch so that all tests pass
SparkParityBase.dataset(spark).select("fico_score_group_fnl", "dti"). | ||
filter(col("fico_score_group_fnl") === "500 - 550" || | ||
col("fico_score_group_fnl") === "600 - 650") | ||
} | ||
|
||
val sparkTransformer: Transformer = { | ||
val mixedDataset: DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to be a Sparse dataset at some point along the way right ? Due to implicit class VectorOps
being implicit I am not sure where that happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I called it a mixed Dataset because it has both sparse and dense rows. I can rename it sparseDataset to be more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmh, I don't know if a dataset having both sparse rows and "full" rows would still sparse or dense. Is the glass half empty or half full?
But I think for readability it's good to have a sparkeDataset and a denseDataset, so the tests are clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated naming to sparseDataset
…m/talalryz/mleap into fix_xgboost_sparse_vector_support
I updated the tests in this branch to match the xgboost specification for dealing with missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values All the tests now pass. |
@@ -36,6 +36,7 @@ class XGBoostRegressionModelParitySpec extends FunSpec | |||
private val xgboostParams: Map[String, Any] = Map( | |||
"eta" -> 0.3, | |||
"max_depth" -> 2, | |||
"missing" -> 0.0f, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that missing are substituted with zeros ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, why not putting nulls here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short Answer:
-
This means that zeros will be assumed to be missing values.
-
We can't set it as anything other than 0.0 because xgboost throws the following error if we do that:
java.lang.RuntimeException: you can only specify missing value as 0.0 (the currently set value NaN) when you have SparseVector or Empty vector as your feature format
Detailed Answer:
The logic behind not allowing any value other than 0.0 is that Sparks VectorAssembler always treats 0 as a missing value, so any SparseVector created by the Spark VectorAssembler will omit 0.0 treating it as missing.
XGboost devs decided that to ensure compatibility with Sparks VectorAssembler they would enforce setting missing
as 0.0 in their models whenever a SparseVector was used, otherwise; there would be incompatibility between DenseVector and SparseVector
There is more info here: https://github.com/dmlc/xgboost/pull/4349/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @ancasarb do you know if anyone could review and approve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small test cleanup needed, but changes look great, thank you!
...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala
Outdated
Show resolved
Hide resolved
...rc/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thank you!
What is the bug?
mleap xgboost does not support sparse vector
How can the bug be found?
I added some sparse rows to the testing data in
xgboost_training.csv
. This causes tests to fail with predictions from mleap not matching spark predictions.I first noticed this bug in #596
How will the bug be fixed?
Create DMatrix directly from sparseTensor and sparseVector, instead of converting to denseVector and denseTensor first.
What does this PR do?
Tests
The regressor test now passes with updated dataset that includes sparse Data.
For the classifier, i added a new test that uses this mixed dataset. This also passes.
** WHAT THIS TEST DOES NOT FIX**
These changes have been tested with mleap v0.11.0, and should work until 0.14.
These changes do not fix issues with v0.15.0 which uses xgboost v0.90. XGBoost v0.90 no longer supports using Float.NaN as missing values for sparse vectors, and I have not found a way to make mleap predictions match spark predictions.There are issues that lie within xgboost itself, and we should be able to resolve after this approved PR to xgboost is merged: dmlc/xgboost#4805
EDIT: I updated the tests to match the xgboost specification for handling missing values: https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values
This branch will work with xgboost v0.90.