-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Distributed Lightgbm randomly hangs when multiple train calls are submitted #4942
Comments
@rudra0713 may I ask what you're trying to do? It seems like you want to train several models in parallel changing only the number of trees. If that's the case you could just train a single model for the maximum number of iterations that you want and set In case the number of trees is just to simulate setting a different hyperparameter and what you're trying to do is just train several models in parallel you could modify your fit function to just call the regular lightgbm sklearn API with something like the following: def train_local_model(X, y, number_of_estimators):
lgbm_cls = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', num_leaves=50,
learning_rate=0.1, n_estimators=number_of_estimators, max_depth=5,
bagging_fraction=0.9, feature_fraction=0.9, reg_lambda=0.2,
)
lgbm_cls.fit(X, y)
return lgbm_cls
def main(client):
X, y = make_classification(n_samples=n_samples, n_features=n_features, random_state=12345)
futures = []
# here I am invoking 5 train calls
for i in range(100, 105):
f1 = client.submit(train_local_model, X, y, i)
futures.append(f1)
results = client.gather(futures)
return results The dask API for lightgbm is meant to train a single model using several workers. The main use case is when your data doesn't fit in a single machine so you build a cluster of machines where each machine holds a piece of the data and train a model that way. |
Hi, thanks for your response. In my actual use case,
|
The distributed interface is designed to use a set of machines to train one model at a time. The dask interface takes your dask collection, checks which machines have at least one partition of that collection and assigns them to train the model. What you're doing here is asking each machine to train several models at the same time, which isn't possible. One possible solution would be to make smaller clusters (but big enough to hold the data) and use each one to train one model at a time. Let us know if this helps. |
Is there a way to at least pinpoint the technical reason for this failure? Random failures are making it hard for me to diagnose the problem. For ex: while training multiple models, can Lightgbm mess up the worker's knowledge regarding each data partition? |
Looking a bit closer at the logs you posted it seems that some workers are dying, are you seeing this in the dashboard?. It could be that they run out of memory. If one worker dies when performing distributed training in LightGBM the training process will hang because the others will wait forever for that workers input. |
I agree with @jmoralez, running multiple LightGBM workloads on the cluster at the same time could cause workers to die because of memory issues, and LightGBM distributed training (regardless of whether or not you use Dask) is not resilient to workers dying. I think it's important to note that part of the root cause here is that Dask can't see when those tasks allocate memory or talk to each other, and we don't allow Dask to pick them up and move them to other workers when a worker dies (#4132). Also see #3775 for a bit more information. |
I see. Thanks for your response. Is there any chance support for multiple simaltenous Lightgbm train calls will be added as a feature in near future? Also, if I want to implement this feature, do you have any suggestion on how can I do that? One thing I tried, is to provide 2 new ports in the |
Are you actually seeing a speedup as opposed to training sequentially? It seems like there's a massive oversubscription, since by default each training process uses all available cores and you have a couple of processes using all available resources. |
Comparing the runtimes for distributed and non-distributed lightgbm, I think I am getting some surprising numbers,
Currently, I am using the Non-distributed lightgbm by parallelly submitting multiple calls and my aim is to improve the numbers using the distributed lightgbm. But, based on the discussion so far, I do not see a way to improve upon my current numbers. Any suggestion? |
I'd say option 3 should be the fastest, because if your data fits in a single machine using (local) distributed training only introduces some overhead because of the communication between the workers. In the image above the number of connected squares is 16 (which is your number of threads) and the total number of squares is greater than 32, which could mean that every worker died while trying to train. |
@jmoralez I can agree that workers are dying, but I don't full understand why they will die because of memory issues.
Can there be any other reason for the workers dying? |
LightGBM will use some more memory because it needs to create a Dataset each time you call
I believe this might be at the end, once they have all finished/died so maybe in the process the memory starts to go up. There could be some more info in the workers' logs, have you tried looking there? There's a section in the dashboard where you can see the individual logs for each worker, if you look closely at one you might see the reason why it dies. |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
Faced the similar issue, Any updates?? The versions and all are same in this case |
Hi @vishalvvs, from #4942 (comment) it seems that the original problem in this issue wasn't the right approach to achieving a speed up in training multiple models. Are you trying the same? If you're facing a different problem please open a new issue with the description of your problem. |
I am receiving the Warning as bellow: import dask.array as da import lightgbm as lgb workspace = "C:/Users/pwang/PycharmProjects/Temp/" ----------------------------------------------------------------https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dask/regression.py----------------------------------------------------------------if name == "main":
Dask workers look good on dashboard, each worker has 15.9GiB memory and 6 threads. Thank you for any help! |
@philip-2020 thanks for using LightGBM. What you've reported looks very different from "randomly hangs when multiple train calls are submitted". Please open a new issue at https://github.com/microsoft/LightGBM/issues/new/choose. And please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax if you are new to writing text in GitHub-flavored markdown. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description:
While submitting more than 1 Distributed Lightgbm train calls at the same time using dask, sometimes the execution is successful, sometimes, it just hangs forever. My machine has 8 workers and 2 threads per worker.
Reproducible Example:
The main problem seems to be this one:
[LightGBM] [Fatal] Socket recv error, code: 104
, I have checked the issue #4625,However the proposed solution there was for Macos, mine is Linux.
Here is the complete log:
Environment:
Linux
Lightgbm: 3.2.1
dask: 2021.9.1
distributed: 2021.9.1
The text was updated successfully, but these errors were encountered: