-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLeap Serialized Pipeline including a XGBoost Model does not predict same values as Spark Pipeline #625
Comments
@irene3030 |
Hello again, First of all: thank you very much for your input and for fixing this issue. I did exactly what you told me: I downloaded master branch (0.16.0-SNAPSHOT) and built the whole project. It worked like a charm! I do not longer have the problem I had and the predictions are the same than the ones obtained using Spark. I did have one issue FYI (just in case anyone bumps into this as well): I had to manually package some of the modules: mleap-xgboost-spark My colleagues and I are very grateful :) |
great, thanks @talalryz for all your help! is it ok to close this issue if that's alright, changes will be included in the next release. |
This is great, thanks @talalryz ! |
We, at Yelp, had been struggling with this bug ourselves so we're glad we could help others out along the way :) |
Hello,
I am currently working in a project where a machine learning model has been created using Apache Spark & XGBoost4J. In order to deploy this model in a productive environment, I've used MLeap and its extension for XGBoost to serialize my pipeline, which include the following modules: StringIndexer, OneHotEncoderEstimator, VectorAssembler and a XGBoost regression model.
When reading the MLeap Bundle object I find that the predictions obtained using the serialized XGBoost model included in this object are very different than the ones obtained using the model XGBoost directly with Spark & XGboost4J-Spark.
Here is how I create my pipeline, train the model and wrap it in a MLeap object:
(Just in case it is not clear, PortatilesModelConstants contains constants such as the name of the columns I am working with).
And here you may find how I reading the MLeap object and testing the pipeline using the testSet. First I obtain my test set transformed through the serialized pipeline. Then I transform it back to Spark DataFrame and compute "MAE" metric :
And both the metrics and predicted values obtained with testSetTransformed and testSetTransformed2 are different:
Here you have a small sample of the test data, showing that the predictions are different:
Attached to this message, you may find
I would very much appreciate any help you could give me.
Thanks a lot,
Irene
mleap_issue.zip
The text was updated successfully, but these errors were encountered: