Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

tman5 · 2023-10-03T18:19:13Z

Upon rebooting an underlying Kubernetes node or re-creating a StateFul set for clickhouse-keeper in k8s, sometimes the pods will come back and be in a CrashLoop state with errors such as:

2023.10.03 18:13:18.061959 [ 30 ] {} <Error> bool DB::KeeperStateMachine::preprocess(const KeeperStorage::RequestForSession &): Failed to preprocess stored log, aborting to avoid inconsistent state: Code: 49. DB::Exception: Got new ZXID (52749) smaller or equal to current ZXID (52750). It's a bug. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e3247b in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<long&, long&>(int, FormatStringHelperImpl<std::type_identity<long&>::type, std::type_identity<long&>::type>, long&, long&) @ 0x0000000000871562 in /usr/bin/clickhouse-keeper
2. DB::KeeperStorage::preprocessRequest(std::shared_ptr<Coordination::ZooKeeperRequest> const&, long, long, long, bool, std::optional<DB::KeeperStorage::Digest>, long) @ 0x00000000008748c6 in /usr/bin/clickhouse-keeper
3. DB::KeeperStateMachine::preprocess(DB::KeeperStorage::RequestForSession const&) @ 0x000000000084ffa1 in /usr/bin/clickhouse-keeper
4. DB::KeeperStateMachine::pre_commit(unsigned long, nuraft::buffer&) @ 0x000000000084e762 in /usr/bin/clickhouse-keeper
5. nuraft::raft_server::handle_append_entries(nuraft::req_msg&) @ 0x0000000000c40371 in /usr/bin/clickhouse-keeper
6. nuraft::raft_server::process_req(nuraft::req_msg&, nuraft::raft_server::req_ext_params const&) @ 0x0000000000c0706e in /usr/bin/clickhouse-keeper
7. nuraft::rpc_session::read_complete(std::shared_ptr<nuraft::buffer>, std::shared_ptr<nuraft::buffer>) @ 0x0000000000bec4a8 in /usr/bin/clickhouse-keeper
8. nuraft::rpc_session::read_log_data(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long) @ 0x0000000000bed14a in /usr/bin/clickhouse-keeper
9. boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>::operator()(boost::system::error_code, unsigned long, int) @ 0x0000000000bf8e13 in /usr/bin/clickhouse-keeper
10. boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) @ 0x0000000000bf9739 in /usr/bin/clickhouse-keeper
11. boost::asio::detail::scheduler::run(boost::system::error_code&) @ 0x0000000000ba9eb9 in /usr/bin/clickhouse-keeper
12. nuraft::asio_service_impl::worker_entry() @ 0x0000000000ba2db2 in /usr/bin/clickhouse-keeper
13. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, std::__bind<void (nuraft::asio_service_impl::*)(), nuraft::asio_service_impl*>>>(void*) @ 0x0000000000bafa1a in /usr/bin/clickhouse-keeper
 (version 23.10.1.134 (official build))
2023.10.03 18:13:18.063310 [ 12 ] {} <Trace> BaseDaemon: Received signal 6
2023.10.03 18:13:18.063510 [ 34 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2023.10.03 18:13:18.063531 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) Received signal 6
2023.10.03 18:13:18.063538 [ 34 ] {} <Fatal> BaseDaemon: Signal description: Aborted
2023.10.03 18:13:18.063542 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063548 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063551 [ 34 ] {} <Fatal> BaseDaemon: ########################################
2023.10.03 18:13:18.063559 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) (no query) Received signal Aborted (6)
2023.10.03 18:13:18.063563 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063566 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063619 [ 34 ] {} <Fatal> BaseDaemon: 0. signalHandler(int, siginfo_t*, void*) @ 0x0000000000a3ff30 in /usr/bin/clickhouse-keeper
2023.10.03 18:13:18.063629 [ 34 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2023.10.03 18:13:18.063636 [ 34 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues
bash: line 27:    11 Aborted                 (core dumped) clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml

clickhouse-keeper version 23.9.1

This issue appears to be similar to this one #42668 however this one is on K8s using stateful set. This is the manifest we are using: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-3-nodes.yaml

It seems like it's an order of operations/race condition issue. I can't reproduce it reliably. Sometimes node reboots work fine. Other times the clickhouse-keeper pods will come up in this crashloop state.

A "fix" is to delete the pod and PVC and let it re-create. That will bring it back but it's not a long term solution.

The text was updated successfully, but these errors were encountered:

antonio2368 · 2023-10-10T07:44:34Z

It's hard to say what happened without the logs before shutdown and during the first failed start to see where the error happened.
It's possible something was incorrectly applied.

tman5 · 2023-10-11T13:53:30Z

Logs attached
keeper_logs.zip

Manifests here:

---
# Setup Service to provide access to ClickHouse keeper for clients
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper.namespace.svc
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 2181
      name: client
    - port: 7000
      name: prometheus
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper-0.clickhouse-keepers.namespace.svc
  name: clickhouse-keepers
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 9444
      name: raft
  clusterIP: None
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup max number of unavailable pods in StatefulSet
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-keeper-pod-disruption-budget
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  maxUnavailable: 1
---
# Setup ClickHouse Keeper settings
apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-keeper-settings
data:
  keeper_config.xml: |
    <clickhouse>
        <include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
        <logger>
            <level>trace</level>
            <console>true</console>
        </logger>
        <listen_host>0.0.0.0</listen_host>
        <keeper_server incl="keeper_server">
            <path>/var/lib/clickhouse-keeper</path>
            <tcp_port>2181</tcp_port>
            <coordination_settings>
                <!-- <raft_logs_level>trace</raft_logs_level> -->
                <raft_logs_level>information</raft_logs_level>
            </coordination_settings>
        </keeper_server>
        <prometheus>
            <endpoint>/metrics</endpoint>
            <port>7000</port>
            <metrics>true</metrics>
            <events>true</events>
            <asynchronous_metrics>true</asynchronous_metrics>
            <!-- https://github.com/ClickHouse/ClickHouse/issues/46136 -->
            <status_info>false</status_info>
        </prometheus>
    </clickhouse>

---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  serviceName: clickhouse-keepers
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: clickhouse-keeper
        what: node
      annotations:
        prometheus.io/port: '7000'
        prometheus.io/scrape: 'true'
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - clickhouse-keeper
              topologyKey: "kubernetes.io/hostname"
      volumes:
        - name: clickhouse-keeper-settings
          configMap:
            name: clickhouse-keeper-settings
            items:
              - key: keeper_config.xml
                path: keeper_config.xml
      containers:
        - name: clickhouse-keeper
          imagePullPolicy: IfNotPresent
          image: "hub.docker.io/dockerhub/clickhouse/clickhouse-keeper:head-alpine"
          resources:
            requests:
              memory: "256M"
              cpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
          volumeMounts:
            - name: clickhouse-keeper-settings
              mountPath: /etc/clickhouse-keeper/
            - name: clickhouse-keeper-datadir-volume
              mountPath: /var/lib/clickhouse-keeper
          env:
            - name: SERVERS
              value: "3"
            - name: RAFT_PORT
              value: "9444"
          command:
            - bash
            - -x
            - -c
            - |
              HOST=`hostname -s` &&
              DOMAIN=`hostname -d` &&
              if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
                  NAME=${BASH_REMATCH[1]}
                  ORD=${BASH_REMATCH[2]}
              else
                  echo "Failed to parse name and ordinal of Pod"
                  exit 1
              fi &&
              export MY_ID=$((ORD+1)) &&
              mkdir -p /tmp/clickhouse-keeper/config.d/ &&
              {
                echo "<yandex><keeper_server>"
                echo "<server_id>${MY_ID}</server_id>"
                echo "<raft_configuration>"
                for (( i=1; i<=$SERVERS; i++ )); do
                    echo "<server><id>${i}</id><hostname>$NAME-$((i-1)).${DOMAIN}</hostname><port>${RAFT_PORT}</port></server>"
                done
                echo "</raft_configuration>"
                echo "</keeper_server></yandex>"
              } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              if [[ "1" == "$MY_ID" ]]; then
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
              else
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
              fi
          livenessProbe:
            exec:
              command:
                - bash
                - -xc
                - 'date && OK=$(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;); if [[ "$OK" == "imok" ]]; then exit 0; else exit 1; fi'
            initialDelaySeconds: 20
            timeoutSeconds: 15
          ports:
            - containerPort: 7000
              name: prometheus
  volumeClaimTemplates:
    - metadata:
        name: clickhouse-keeper-datadir-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi

alexey-milovidov · 2023-10-13T21:53:00Z

This is the manifest we are using: https://github.com/Altinity...

I don't trust this manifest (it's from a third-party company). Does the issue reproduce if you run Keeper without Kubernetes?

alexey-milovidov · 2023-10-13T21:55:17Z

The manifest looks hairy; I advise throwing it away and writing your own from scratch.

jaigopinathmalempati · 2023-10-16T06:14:25Z

Do We have helm chart for clickhouse keeper?

tman5 · 2023-10-30T15:37:23Z

@alexey-milovidov we have not tried running keeper outside of K8s. We really weren't going to entertain that unless absolutely necessary since our install is a single application that will be using keeper at the moment

Is there a better example of running clickhouse-keeper in kubernetes? The only part of the config that appears to be very specific is the StateFul set config which includes a block that write clickhouse-keeper config. Is there any examples of that I could base it on?

alexey-milovidov · 2023-10-31T00:38:15Z

if [[ "1" == "$MY_ID" ]]; then
    clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery

This looks suspicious, not sure if it is correct.

alexey-milovidov · 2023-10-31T00:38:25Z

Let's ask @antonio2368 for the details.

antonio2368 · 2023-11-02T09:44:35Z

--force-recovery is a command that should NOT be used in such a way, it's a last resort option when you lose enough nodes so quorum cannot be achieved anymore.

antonio2368 · 2023-11-02T09:52:49Z

@tman5 When logs for the Keeper are included, it would be helpful to set keeper_server.coordination_settings.raft_logs_level to trace in config. There will be much more information about the replication process itself.

tman5 · 2023-11-02T15:42:59Z

So why would that IF block be in there? If we remove --force-recovery than the IF statement wouldn't even be needed. It looks like it's looking for the 1st server in the cluster?

if [[ "1" == "$MY_ID" ]]; then
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
else
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
fi

Slach · 2023-11-02T17:36:34Z

@tman5 look to Altinity/clickhouse-operator#1234
currently if you wan't to scale-up / scale-down for clickhouse-keeper in Kubenetes, you need wait when #53481 will merge

tman5 · 2023-11-02T18:58:38Z

So will your updated manifests work? Or do we also need to wait for that PR to merge?

alexey-milovidov · 2023-11-02T20:37:34Z

@tman5, these manifests are not part of the official ClickHouse product, and we don't support them.
"Altinity/clickhouse-operator" is a third-party repository.

We have noticed at least one mistake in these manifests, so they cannot be used.
You can carefully review every line of code of these manifests, remove every line that you don't understand, and then it might be ok.

tman5 · 2023-11-20T12:38:14Z

@alexey-milovidov is there any plans to release an official helm chart for clickhouse-keeper?

alexey-milovidov · 2023-11-20T21:00:43Z

Currently, there are no plans, but we are considering it for the future.

Note: it is hard to operate Keeper or ZooKeeper or any other distributed consensus system in Kubernetes. If you have frequent pod restarts and combine it either with a misconfiguration (in the example above) or with corrupted data on a single node, it can lead to a rollback of the Keeper's state, leading to "intersecting parts" errors and data loss.

tman5 added the potential bug To be reviewed by developers and confirmed/rejected. label Oct 3, 2023

alexey-milovidov added st-need-info We need extra data to continue (waiting for response) operations and removed potential bug To be reviewed by developers and confirmed/rejected. labels Oct 13, 2023

alexey-milovidov closed this as completed Nov 2, 2023

alexey-milovidov added the invalid label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

tman5 commented Oct 3, 2023

antonio2368 commented Oct 10, 2023

tman5 commented Oct 11, 2023

alexey-milovidov commented Oct 13, 2023

alexey-milovidov commented Oct 13, 2023 •

edited

Loading

jaigopinathmalempati commented Oct 16, 2023

tman5 commented Oct 30, 2023

alexey-milovidov commented Oct 31, 2023

alexey-milovidov commented Oct 31, 2023

antonio2368 commented Nov 2, 2023

antonio2368 commented Nov 2, 2023

tman5 commented Nov 2, 2023

Slach commented Nov 2, 2023

tman5 commented Nov 2, 2023

alexey-milovidov commented Nov 2, 2023

tman5 commented Nov 20, 2023

alexey-milovidov commented Nov 20, 2023 •

edited

Loading

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

Comments

tman5 commented Oct 3, 2023

antonio2368 commented Oct 10, 2023

tman5 commented Oct 11, 2023

alexey-milovidov commented Oct 13, 2023

alexey-milovidov commented Oct 13, 2023 • edited Loading

jaigopinathmalempati commented Oct 16, 2023

tman5 commented Oct 30, 2023

alexey-milovidov commented Oct 31, 2023

alexey-milovidov commented Oct 31, 2023

antonio2368 commented Nov 2, 2023

antonio2368 commented Nov 2, 2023

tman5 commented Nov 2, 2023

Slach commented Nov 2, 2023

tman5 commented Nov 2, 2023

alexey-milovidov commented Nov 2, 2023

tman5 commented Nov 20, 2023

alexey-milovidov commented Nov 20, 2023 • edited Loading

alexey-milovidov commented Oct 13, 2023 •

edited

Loading

alexey-milovidov commented Nov 20, 2023 •

edited

Loading