Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

Closed
tman5 opened this issue Oct 3, 2023 · 16 comments
Closed

Clickhouse-Keeper on Kubernetes - Node/Pod Restart Issues #55219

tman5 opened this issue Oct 3, 2023 · 16 comments
Labels
invalid operations st-need-info We need extra data to continue (waiting for response)

Comments

@tman5
Copy link

tman5 commented Oct 3, 2023

Upon rebooting an underlying Kubernetes node or re-creating a StateFul set for clickhouse-keeper in k8s, sometimes the pods will come back and be in a CrashLoop state with errors such as:

2023.10.03 18:13:18.061959 [ 30 ] {} <Error> bool DB::KeeperStateMachine::preprocess(const KeeperStorage::RequestForSession &): Failed to preprocess stored log, aborting to avoid inconsistent state: Code: 49. DB::Exception: Got new ZXID (52749) smaller or equal to current ZXID (52750). It's a bug. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e3247b in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<long&, long&>(int, FormatStringHelperImpl<std::type_identity<long&>::type, std::type_identity<long&>::type>, long&, long&) @ 0x0000000000871562 in /usr/bin/clickhouse-keeper
2. DB::KeeperStorage::preprocessRequest(std::shared_ptr<Coordination::ZooKeeperRequest> const&, long, long, long, bool, std::optional<DB::KeeperStorage::Digest>, long) @ 0x00000000008748c6 in /usr/bin/clickhouse-keeper
3. DB::KeeperStateMachine::preprocess(DB::KeeperStorage::RequestForSession const&) @ 0x000000000084ffa1 in /usr/bin/clickhouse-keeper
4. DB::KeeperStateMachine::pre_commit(unsigned long, nuraft::buffer&) @ 0x000000000084e762 in /usr/bin/clickhouse-keeper
5. nuraft::raft_server::handle_append_entries(nuraft::req_msg&) @ 0x0000000000c40371 in /usr/bin/clickhouse-keeper
6. nuraft::raft_server::process_req(nuraft::req_msg&, nuraft::raft_server::req_ext_params const&) @ 0x0000000000c0706e in /usr/bin/clickhouse-keeper
7. nuraft::rpc_session::read_complete(std::shared_ptr<nuraft::buffer>, std::shared_ptr<nuraft::buffer>) @ 0x0000000000bec4a8 in /usr/bin/clickhouse-keeper
8. nuraft::rpc_session::read_log_data(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long) @ 0x0000000000bed14a in /usr/bin/clickhouse-keeper
9. boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>::operator()(boost::system::error_code, unsigned long, int) @ 0x0000000000bf8e13 in /usr/bin/clickhouse-keeper
10. boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::__bind<void (nuraft::rpc_session::*)(std::shared_ptr<nuraft::buffer>, boost::system::error_code const&, unsigned long), std::shared_ptr<nuraft::rpc_session> const&, std::shared_ptr<nuraft::buffer>&, std::placeholders::__ph<1> const&, std::placeholders::__ph<2> const&>>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) @ 0x0000000000bf9739 in /usr/bin/clickhouse-keeper
11. boost::asio::detail::scheduler::run(boost::system::error_code&) @ 0x0000000000ba9eb9 in /usr/bin/clickhouse-keeper
12. nuraft::asio_service_impl::worker_entry() @ 0x0000000000ba2db2 in /usr/bin/clickhouse-keeper
13. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, std::__bind<void (nuraft::asio_service_impl::*)(), nuraft::asio_service_impl*>>>(void*) @ 0x0000000000bafa1a in /usr/bin/clickhouse-keeper
 (version 23.10.1.134 (official build))
2023.10.03 18:13:18.063310 [ 12 ] {} <Trace> BaseDaemon: Received signal 6
2023.10.03 18:13:18.063510 [ 34 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2023.10.03 18:13:18.063531 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) Received signal 6
2023.10.03 18:13:18.063538 [ 34 ] {} <Fatal> BaseDaemon: Signal description: Aborted
2023.10.03 18:13:18.063542 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063548 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063551 [ 34 ] {} <Fatal> BaseDaemon: ########################################
2023.10.03 18:13:18.063559 [ 34 ] {} <Fatal> BaseDaemon: (version 23.10.1.134 (official build), build id: 8062681F29E88365723319EBF0E96CB2D91AA0D5, git hash: 7b6548157c9216eb52867ba10c8c9bd30a3492dd) (from thread 30) (no query) Received signal Aborted (6)
2023.10.03 18:13:18.063563 [ 34 ] {} <Fatal> BaseDaemon: 
2023.10.03 18:13:18.063566 [ 34 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a3ff30
2023.10.03 18:13:18.063619 [ 34 ] {} <Fatal> BaseDaemon: 0. signalHandler(int, siginfo_t*, void*) @ 0x0000000000a3ff30 in /usr/bin/clickhouse-keeper
2023.10.03 18:13:18.063629 [ 34 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2023.10.03 18:13:18.063636 [ 34 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues
bash: line 27:    11 Aborted                 (core dumped) clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml

clickhouse-keeper version 23.9.1

This issue appears to be similar to this one #42668 however this one is on K8s using stateful set. This is the manifest we are using: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-3-nodes.yaml

It seems like it's an order of operations/race condition issue. I can't reproduce it reliably. Sometimes node reboots work fine. Other times the clickhouse-keeper pods will come up in this crashloop state.

A "fix" is to delete the pod and PVC and let it re-create. That will bring it back but it's not a long term solution.

@tman5 tman5 added the potential bug To be reviewed by developers and confirmed/rejected. label Oct 3, 2023
@antonio2368
Copy link
Member

It's hard to say what happened without the logs before shutdown and during the first failed start to see where the error happened.
It's possible something was incorrectly applied.

@tman5
Copy link
Author

tman5 commented Oct 11, 2023

Logs attached
keeper_logs.zip

Manifests here:

---
# Setup Service to provide access to ClickHouse keeper for clients
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper.namespace.svc
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 2181
      name: client
    - port: 7000
      name: prometheus
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  # DNS would be like clickhouse-keeper-0.clickhouse-keepers.namespace.svc
  name: clickhouse-keepers
  labels:
    app: clickhouse-keeper
spec:
  ports:
    - port: 9444
      name: raft
  clusterIP: None
  selector:
    app: clickhouse-keeper
    what: node
---
# Setup max number of unavailable pods in StatefulSet
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-keeper-pod-disruption-budget
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  maxUnavailable: 1
---
# Setup ClickHouse Keeper settings
apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-keeper-settings
data:
  keeper_config.xml: |
    <clickhouse>
        <include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
        <logger>
            <level>trace</level>
            <console>true</console>
        </logger>
        <listen_host>0.0.0.0</listen_host>
        <keeper_server incl="keeper_server">
            <path>/var/lib/clickhouse-keeper</path>
            <tcp_port>2181</tcp_port>
            <coordination_settings>
                <!-- <raft_logs_level>trace</raft_logs_level> -->
                <raft_logs_level>information</raft_logs_level>
            </coordination_settings>
        </keeper_server>
        <prometheus>
            <endpoint>/metrics</endpoint>
            <port>7000</port>
            <metrics>true</metrics>
            <events>true</events>
            <asynchronous_metrics>true</asynchronous_metrics>
            <!-- https://github.com/ClickHouse/ClickHouse/issues/46136 -->
            <status_info>false</status_info>
        </prometheus>
    </clickhouse>

---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
  name: clickhouse-keeper
  labels:
    app: clickhouse-keeper
spec:
  selector:
    matchLabels:
      app: clickhouse-keeper
  serviceName: clickhouse-keepers
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: clickhouse-keeper
        what: node
      annotations:
        prometheus.io/port: '7000'
        prometheus.io/scrape: 'true'
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - clickhouse-keeper
              topologyKey: "kubernetes.io/hostname"
      volumes:
        - name: clickhouse-keeper-settings
          configMap:
            name: clickhouse-keeper-settings
            items:
              - key: keeper_config.xml
                path: keeper_config.xml
      containers:
        - name: clickhouse-keeper
          imagePullPolicy: IfNotPresent
          image: "hub.docker.io/dockerhub/clickhouse/clickhouse-keeper:head-alpine"
          resources:
            requests:
              memory: "256M"
              cpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
          volumeMounts:
            - name: clickhouse-keeper-settings
              mountPath: /etc/clickhouse-keeper/
            - name: clickhouse-keeper-datadir-volume
              mountPath: /var/lib/clickhouse-keeper
          env:
            - name: SERVERS
              value: "3"
            - name: RAFT_PORT
              value: "9444"
          command:
            - bash
            - -x
            - -c
            - |
              HOST=`hostname -s` &&
              DOMAIN=`hostname -d` &&
              if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
                  NAME=${BASH_REMATCH[1]}
                  ORD=${BASH_REMATCH[2]}
              else
                  echo "Failed to parse name and ordinal of Pod"
                  exit 1
              fi &&
              export MY_ID=$((ORD+1)) &&
              mkdir -p /tmp/clickhouse-keeper/config.d/ &&
              {
                echo "<yandex><keeper_server>"
                echo "<server_id>${MY_ID}</server_id>"
                echo "<raft_configuration>"
                for (( i=1; i<=$SERVERS; i++ )); do
                    echo "<server><id>${i}</id><hostname>$NAME-$((i-1)).${DOMAIN}</hostname><port>${RAFT_PORT}</port></server>"
                done
                echo "</raft_configuration>"
                echo "</keeper_server></yandex>"
              } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml &&
              if [[ "1" == "$MY_ID" ]]; then
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
              else
                clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
              fi
          livenessProbe:
            exec:
              command:
                - bash
                - -xc
                - 'date && OK=$(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;); if [[ "$OK" == "imok" ]]; then exit 0; else exit 1; fi'
            initialDelaySeconds: 20
            timeoutSeconds: 15
          ports:
            - containerPort: 7000
              name: prometheus
  volumeClaimTemplates:
    - metadata:
        name: clickhouse-keeper-datadir-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi

@alexey-milovidov alexey-milovidov added st-need-info We need extra data to continue (waiting for response) operations and removed potential bug To be reviewed by developers and confirmed/rejected. labels Oct 13, 2023
@alexey-milovidov
Copy link
Member

This is the manifest we are using: https://github.com/Altinity...

I don't trust this manifest (it's from a third-party company). Does the issue reproduce if you run Keeper without Kubernetes?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Oct 13, 2023

The manifest looks hairy; I advise throwing it away and writing your own from scratch.

@jaigopinathmalempati
Copy link

Do We have helm chart for clickhouse keeper?

@tman5
Copy link
Author

tman5 commented Oct 30, 2023

@alexey-milovidov we have not tried running keeper outside of K8s. We really weren't going to entertain that unless absolutely necessary since our install is a single application that will be using keeper at the moment

Is there a better example of running clickhouse-keeper in kubernetes? The only part of the config that appears to be very specific is the StateFul set config which includes a block that write clickhouse-keeper config. Is there any examples of that I could base it on?

@alexey-milovidov
Copy link
Member

if [[ "1" == "$MY_ID" ]]; then
    clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery

This looks suspicious, not sure if it is correct.

@alexey-milovidov
Copy link
Member

Let's ask @antonio2368 for the details.

@antonio2368
Copy link
Member

--force-recovery is a command that should NOT be used in such a way, it's a last resort option when you lose enough nodes so quorum cannot be achieved anymore.

@antonio2368
Copy link
Member

@tman5 When logs for the Keeper are included, it would be helpful to set keeper_server.coordination_settings.raft_logs_level to trace in config. There will be much more information about the replication process itself.

@tman5
Copy link
Author

tman5 commented Nov 2, 2023

So why would that IF block be in there? If we remove --force-recovery than the IF statement wouldn't even be needed. It looks like it's looking for the 1st server in the cluster?

if [[ "1" == "$MY_ID" ]]; then
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml --force-recovery
else
  clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml
fi

@Slach
Copy link
Contributor

Slach commented Nov 2, 2023

@tman5 look to Altinity/clickhouse-operator#1234
currently if you wan't to scale-up / scale-down for clickhouse-keeper in Kubenetes, you need wait when #53481 will merge

@tman5
Copy link
Author

tman5 commented Nov 2, 2023

So will your updated manifests work? Or do we also need to wait for that PR to merge?

@alexey-milovidov
Copy link
Member

@tman5, these manifests are not part of the official ClickHouse product, and we don't support them.
"Altinity/clickhouse-operator" is a third-party repository.

We have noticed at least one mistake in these manifests, so they cannot be used.
You can carefully review every line of code of these manifests, remove every line that you don't understand, and then it might be ok.

@tman5
Copy link
Author

tman5 commented Nov 20, 2023

@alexey-milovidov is there any plans to release an official helm chart for clickhouse-keeper?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 20, 2023

Currently, there are no plans, but we are considering it for the future.

Note: it is hard to operate Keeper or ZooKeeper or any other distributed consensus system in Kubernetes. If you have frequent pod restarts and combine it either with a misconfiguration (in the example above) or with corrupted data on a single node, it can lead to a rollback of the Keeper's state, leading to "intersecting parts" errors and data loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid operations st-need-info We need extra data to continue (waiting for response)
Projects
None yet
Development

No branches or pull requests

5 participants