-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Integration with PySpark #1698
Comments
@CodingCat Do you know how big is PySpark community? Most people just use Scala API. It seems like a lot of things need to be re-implemented in Python - please correct me if I am wrong. |
I think PySpark is pretty prevalent in the community of Data Scientists, that means in the scenario of quick prototyping, etc. I heard about many cases that data scientists use pySpark to analyze the large volumes of data. On the other side, most of production-level scenarios are based on Scala API (I only know a single case that people are using PySpark in large scale production) |
Yeah I just feel like the current python API should be able to handle most prototyping needs. I personally care more about Spark when we want more production-ready stuff. Perhaps we should leave the discussion here so people can discuss their needs. In the meantime, it would be great if you can provide some details on approaches/estimates/steps for the integration. |
just noticed some discussions in the community http://apache-spark-developers-list.1001551.n3.nabble.com/Blocked-PySpark-changes-td19712.html it seems that the development of PySpark is lagging behind...as the downstream library, I vote to |
Yeah it's also hard to debug into any issues you encountered (at least when I was trying it last year)... |
In the roadmap (#873), it said Distributed python has been implemented. Does it mean that xgboost can run on a hadoop cluster with python? (I'm not meaning pyspark) |
yes, see the example posted in the link |
what is the difference of running xgboost on hadoop cluster with python vs. running xgboost on hadoop cluster with scala api? Are there major performance differences? |
@yiming-chen the goal of xgboost4j-spark is to unify ETL and model training in the same pipeline the question comes down to what language users use when doing ETL? based on my observation and experience, 95% users are building their ETL system with scala |
@CodingCat I don't know where you got your 95% stat from, but PySpark is definitely widely used in my experience. For example, we are trying to integrate Airflow to schedule the job for our pipeline and Python would be suitable in that situation. |
@berch PySpark is widely used by you and you are going to integrate with airflow...is it relevant with what I said? |
@CodingCat @tqchen Data Science community will definitely benefit from XGboost been implemented in PySpark, because:
|
feel free to send a PR , you will find the cost |
to avoid coming back to thread time and time again, I will close the discussion with the conclusion that
|
So we can't use pyspark to load XGBoost-spark model? @CodingCat |
so, actually scala from pyspark.ml.wrapper import JavaEstimator, JavaModel
from pyspark.ml.param.shared import *
from pyspark.ml.util import *
from pyspark.context import SparkContext
class XGBoost(JavaEstimator, JavaMLWritable, JavaMLReadable, HasRegParam, HasElasticNetParam):
def __init__(self, paramMap = {}):
super(XGBoost, self).__init__()
scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
self._java_obj = self._new_java_obj(
"ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
self._defaultParamMap = paramMap
self._paramMap = paramMap
def setParams(self, paramMap = {}):
return self._set(paramMap)
def _create_model(self, javaTrainingData):
return JavaModel(javaTrainingData) I think it still needs some work but I was able to run |
Wieslaw thanks for sharing the code snippet on XGBoost PySpark wrapper. Can you share the code on invoking the XGBoost class with the appropriate parameters? Thanks |
@wpopielarski it is an awesome job you did. Can you please share the code on invoking the XGBoost with the parameters needed? That would be a great help! |
this is something like: from app.xgboost import XGBoost
xgboost_params = {
"eta" : 0.023,
"max_depth" : 10,
"min_child_weight" : 0.3,
"subsample" : 0.7,
"colsample_bytree" : 0.82,
"colsample_bylevel" : 0.9,
"base_score" : base_score,
"eval_metric" : "auc",
"seed" : 49,
"silent" : 1,
"objective" : "binary:logistic",
"round" : 10,
"nWorkers" : 2,
"useExternalMemory" : True
}
xgboost_estimator = XGBoost.XGBoost(xgboost_params)
...
model = xgboost_estimator.fit(data) |
I am getting close to doing a PR with proper PySpark support. |
@thesuperzapper , that's great! How long do you think it would take to wrap it up? Do share insights while progressing. Thanks! |
hi, I write a simple version with
from pyspark.ml.classification import JavaClassificationModel, JavaMLWritable, JavaMLReadable, TypeConverters, Param, \
Params, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol, SparkContext
from pyspark.ml.wrapper import JavaModel, JavaWrapper, JavaEstimator
class XGBParams(Params):
'''
'''
eta = Param(Params._dummy(), "eta",
"step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative",
typeConverter=TypeConverters.toFloat)
max_depth = Param(Params._dummy(), "max_depth",
"maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. 0 indicates no limit, limit is required for depth-wise grow policy.range: [0,∞]",
typeConverter=TypeConverters.toInt)
min_child_weight = Param(Params._dummy(), "min_child_weight",
"minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will berange: [0,∞]",
typeConverter=TypeConverters.toFloat)
max_delta_step = Param(Params._dummy(), "max_delta_step",
"Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.",
typeConverter=TypeConverters.toInt)
subsample = Param(Params._dummy(), "subsample",
"subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.",
typeConverter=TypeConverters.toFloat)
colsample_bytree = Param(Params._dummy(), "colsample_bytree",
"subsample ratio of columns when constructing each tree",
typeConverter=TypeConverters.toFloat)
colsample_bylevel = Param(Params._dummy(), "colsample_bylevel",
"subsample ratio of columns for each split, in each level.",
typeConverter=TypeConverters.toFloat)
max_leaves = Param(Params._dummy(), "max_leaves",
"Maximum number of nodes to be added. Only relevant for the ‘lossguide’ grow policy.",
typeConverter=TypeConverters.toInt)
def __init__(self):
super(XGBParams, self).__init__()
class XGBoostClassifier(JavaEstimator, JavaMLWritable, JavaMLReadable, XGBParams,
HasFeaturesCol, HasLabelCol, HasPredictionCol, HasRawPredictionCol):
def __init__(self, paramMap={}):
super(XGBoostClassifier, self).__init__()
scalaMap = SparkContext._active_spark_context._jvm.PythonUtils.toScalaMap(paramMap)
self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
self._defaultParamMap = paramMap
self._paramMap = paramMap
def setParams(self, paramMap={}):
return self._set(paramMap)
def _create_model(self, java_model):
return XGBoostClassificationModel(java_model)
class XGBoostClassificationModel(JavaModel, JavaClassificationModel, JavaMLWritable, JavaMLReadable):
def getBooster(self):
return self._call_java("booster")
def saveBooster(self, save_path):
jxgb = JavaWrapper(self.getBooster())
jxgb._call_java("saveModel", save_path)
|
@AakashBasuRZT @haiy, we are now working on this properly in Issue #3370, with PR #3376 providing initial support. |
@haiy could you show me a snippet of code that fits the classifier on some arbitary dataset? I have followed points 1 and 2 which you have outlined, but I am not able to understand your 3rd point. |
@sagnik-rzt check this sample |
@haiy I'm trying to run this:
and it is giving this exception :
Environment: |
@sagnik-rzt |
@wpopielarski Hey no I haven't done that. Any idea where I can find that jar file? |
@sagnik-rzt hi, |
need to build it your own :), with maven and profile |
@sagnik-rzt not sure what you are going to do but to build fat jar for your OS just clone dmlc xgboost github project, cd to jvm-packages and run mvn with |
Okay so I have built a fat jar with dependencies and then copy-pasted it to $SPARK_HOME/jars.
|
sorry, but you run it on cluster, locally from some IDE project? If you are
using spark-submit it is better to add deps to --jars switch
2018-06-29 14:30 GMT+02:00 sagnik-rzt <notifications@github.com>:
… Okay so I have built a fat jar with dependencies and then copy-pasted it
to $SPARK_HOME/jars.
However, the same exception still pertains:
Traceback (most recent call last):
File "/home/sagnikb/PycharmProjects/xgboost/test_import.py", line 21, in <module>
clf = xgb(params)
File "/usr/lib/ml/dmlc/xgboost4j/scala/spark.py", line 48, in __init__
self._java_obj = self._new_java_obj("dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid, scalaMap)
File "/usr/local/lib/python3.6/dist-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1698 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALEzS3JmKjO0AZ6JMzcixwCce0_3zRM0ks5uBh3ogaJpZM4KgAY_>
.
|
I am currently working on rebasing #3376 to the new spark branch, In the mean time, a few people have asked for how to use the current code on XGBoost-0.72. Here is a zip file with the pyspark code for XGBoost-0.72. All you need to do is:
Note:
|
@thesuperzapper I'm trying to test this with pyspark on the jupyter notebook. My system: When I'm trying to load the XGBoostEstimator I get:
Is this a bug or am I missing some requirements? |
@BogdanCojocar It seems like your missing the xgboost library. You need both these jars for xgboost to work properly: You can download the required jars from those maven links. |
Thanks @thesuperzapper. Works fine. Great job with this integration to pyspark! |
Any suggestion on how to save the trained model to booster for loading in python module? |
@ericwang915 Typically to get a model which inter-operates with other XGBoost libraries, you would use the However, I forgot to add a method to call the save function in that version of the wrapper, I will do this tomorrow if I get time. (I live in NZ... so timezones) |
Thank you. By the way, during the training process, there is no log showing the evaluation metrics and boosting round even though the silent is set as 1. |
@thesuperzapper thanks for the instruction. I was able to follow your instruction to train/save xgboost model in pyspark. Any idea on how to access other xgboost model function like (scala)getFeatureScore()? |
@ccdtzccdtz currently I am rewiring the pyspark wrapper since 0.8 had massive changes to the Spark API, when finished, I aim to have feature parity with the Spark Scala API. I did not expose the native booster method in my initial pyspark wrapper, but if you use the Spark Scala API, you can call |
I have seen XGBoost on pyspark failing consistently if it is run 2 or more times. I am running it on the same dataset with the same code. First time it succeeds but the second time and subsequently it fails. I am using XGBoost 0.72 on Spark 2.3. I have to restart the pyspark shell to run the job successfully again. I use xgboost.trainWithDataFrame for training purposes. Has anyone seen this issue? |
Hi @thesuperzapper
This is the stack trace: The executor is stuck at: Environment: Python 3.5.4, Spark Version 2.3.1, Xgboost 0.72 |
Can you share your xgboost and spark configurations? How many
workers(xgboost workers), spark executors, cores etc.
…-Nitin
On Tue, Sep 4, 2018 at 5:03 AM sagnik-rzt ***@***.***> wrote:
Hi @thesuperzapper <https://github.com/thesuperzapper>
What you prescribed works for me on a single worker node.
However, when I try to run pyspark xgboost on a using more than one
worker, the executors become idle and shut down after a while.
This is the stack trace:
'''
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.16.1.5,
DMLC_TRACKER_PORT=9093, DMLC_NUM_WORKER=3}2018-09-04 08:52:55 ERROR
TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.43: Remote RPC
client disassociated. Likely due to containers exceeding thresholds, or
network issues. Check driver logs for WARN messages.2018-09-04 08:52:55
ERROR AsyncEventQueue:91 - Interrupted while posting to TaskFailedListener.
Removing that listener.java.lang.InterruptedException: ExecutorLost during
XGBoost Training: ExecutorLostFailure (executor 0 exited caused by one of
the running tasks) Reason: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs for
WARN messages. at
org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116)
at
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
'''
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1698 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJY-XklxsZB_FE7ZoAarV_fqw8D3JqxWks5uXmwqgaJpZM4KgAY_>
.
|
@sagnik-rzt I am surprised that works at all, as that pyspark wrapper only support XGboost 0.72, we are still working on the 0.8 one. |
@thesuperzapper, based on the version you provided, I redid some parts to support xgboost 0.80. All the codes are placed here. |
There is quite a lot more changes needed than you have made to make it work with 0.8. The main reason I haven't just pushed out a 0.8 version, is because I really don't want to make an xgboost specific pipeline object like I have in the 0.72, I am working on a way to hopefully have the pyspark xgboost object work with the default pipeline persistence. |
@thesuperzapper, while using the codes for 0.72, I used the If that's not the case, do you know why the training doesn't get distributed across the workers? Update Have you observed this behavior? How do I tackle it? |
I have mostly re-coded the wrapper for XGBoost 0.8, but as my work cluster is still on 2.2, I cant test it easily in distributed mode, as my Dockerized Spark 2.3 cluster cant even train Scala XGBoost distributed models without getting shuffle location missing issues. I think the issues @sagnik-rzt and others are experiencing, are related to your cluster config or some deeper issue with Spark-Scala XGBoost. Are you able to train a model in Spark-Scala XGBoost? |
Thanks @thesuperzapper, I thought shuffle locations were handled internally, that is, it would be taken care of independent of the cluster config. But I found this stackoverflow post so will be implementing those suggestions. Also, could you share your 0.8 version, if it is ready? I can test for distribution on my cluster. It has spark 2.3.1 and python 3.5. |
after saving the model and loading getting the following error IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel' can you please help with this . thanks
Tried the following option
Getting the following errror No module named ml.dmlc.xgboost4j.scala.spark |
DEV DOWNLOAD LINK: sparkxgb.zip This version will work with XGBoost-0.8, but please dont use it for anything other than testing, or contributing to this thread, as stuff will change. The main issue I am aware of with that version, is that classification models wont load back after being saved, giving the error: Regardless, I am attempting to properly implement DefaultParamsWritable, which would remove the need for the dedicated |
I just noticed that there are some requests for integration with PySpark http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
I also received some emails from the users discussing the same topic
I would like to initialize a discussion here on whether/when we shall start this work
@tqchen @terrytangyuan
The text was updated successfully, but these errors were encountered: