-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Model is getting stuck in deploying state #2970
Comments
@Zhangxunmt I know you have some suggestion to enhance this part. Please help take a look. |
This PR is to remove the remote model auto redeploy during cluster change, it doesn't mean this issue is caused by model auto redeploy, in fact, the root cause of why the model stuck in deploying status is still unknown since it's very difficult to reproduce. The real solution for this issue is to support model undeploy when model status is deploying which will be implemented very soon, user can use this feature to undeploy the model and redeploy again to mitigate the pain. |
The root cause is when deploying the model, manager node sends out the deploy request to all eligible nodes in the cluster, but a node can crash at any moment, if it crashed right after the getEligibleNodes method ran, that node won’t send deploy response to manager node. The worker node won’t be count down to 0, so the model status won’t be updated and keeps at deploying status. To reproduce this issue, you need a small cluster with at least 3 nodes, one is manager node and others are data nodes. Start the manager node and one data node first, create a model and deploy, then start another data node, add debug breakpoint to deploy transport action on manager node(after getting all eligible node), when the debug triggered, shut down the first data node and continue the debug. Then you’ll see the model keeps at deploying status. |
@rbhavna Can you update the solution details that will be used to fix this? |
Another edge case: |
What is the bug?
Model is getting stuck in deploying state while registering it on the cluster. We have seen cases where the model is not found on the few nodes.
Scenario
What is the expected behavior?
Model should be undeployed.
The text was updated successfully, but these errors were encountered: