-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219
Comments
It's hard to say what happened without the logs before shutdown and during the first failed start to see where the error happened. |
Logs attached Manifests here:
|
I don't trust this manifest (it's from a third-party company). Does the issue reproduce if you run Keeper without Kubernetes? |
The manifest looks hairy; I advise throwing it away and writing your own from scratch. |
Do We have helm chart for clickhouse keeper? |
@alexey-milovidov we have not tried running keeper outside of K8s. We really weren't going to entertain that unless absolutely necessary since our install is a single application that will be using keeper at the moment Is there a better example of running clickhouse-keeper in kubernetes? The only part of the config that appears to be very specific is the StateFul set config which includes a block that write clickhouse-keeper config. Is there any examples of that I could base it on? |
This looks suspicious, not sure if it is correct. |
Let's ask @antonio2368 for the details. |
|
@tman5 When logs for the Keeper are included, it would be helpful to set |
So why would that IF block be in there? If we remove
|
@tman5 look to Altinity/clickhouse-operator#1234 |
So will your updated manifests work? Or do we also need to wait for that PR to merge? |
@tman5, these manifests are not part of the official ClickHouse product, and we don't support them. We have noticed at least one mistake in these manifests, so they cannot be used. |
@alexey-milovidov is there any plans to release an official helm chart for clickhouse-keeper? |
Currently, there are no plans, but we are considering it for the future. Note: it is hard to operate Keeper or ZooKeeper or any other distributed consensus system in Kubernetes. If you have frequent pod restarts and combine it either with a misconfiguration (in the example above) or with corrupted data on a single node, it can lead to a rollback of the Keeper's state, leading to "intersecting parts" errors and data loss. |
Upon rebooting an underlying Kubernetes node or re-creating a StateFul set for clickhouse-keeper in k8s, sometimes the pods will come back and be in a CrashLoop state with errors such as:
clickhouse-keeper version 23.9.1
This issue appears to be similar to this one #42668 however this one is on K8s using stateful set. This is the manifest we are using: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-3-nodes.yaml
It seems like it's an order of operations/race condition issue. I can't reproduce it reliably. Sometimes node reboots work fine. Other times the clickhouse-keeper pods will come up in this crashloop state.
A "fix" is to delete the pod and PVC and let it re-create. That will bring it back but it's not a long term solution.
The text was updated successfully, but these errors were encountered: