Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD database space / quota exceeded, goes into maintenance mode #4005

Closed
KashifSaadat opened this issue Dec 4, 2017 · 12 comments
Closed

ETCD database space / quota exceeded, goes into maintenance mode #4005

KashifSaadat opened this issue Dec 4, 2017 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@KashifSaadat
Copy link
Contributor

KashifSaadat commented Dec 4, 2017

Kops Version: kops v1.8.0-beta.2
Kubernetes Version: kubernetes v1.8.2
ETCD Version: v3.0.17 (TLS enabled)
Cloud Provider: AWS

Steps to recreate (will take time):

  1. Create a Kubernetes Cluster on the versions specified above, using ETCD v3 with config similar to below (I had 5 members configured, just trimmed this spec so less spammy).
  2. Need to give some operation time on the Cluster (creating lots of deployments, events etc).
  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: main
    version: 3.0.17
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master0-az0
      name: a-1
    - encryptedVolume: true
      instanceGroup: master1-az0
      name: a-2
    name: events
    version: 3.0.17

After some operation time, you may begin to see warnings such as below in the logs:

kubelet[1495]: W1204 11:17:02.533588    1495 status_manager.go:446] Failed to update status for pod "custom-pod-A": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.542113    1495 status_manager.go:446] Failed to update status for pod "custom-pod-B": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.551753    1495 status_manager.go:446] Failed to update status for pod "canal-hcldk_kube-system(C)": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.557246    1495 status_manager.go:446] Failed to update status for pod "custom-pod-D": etcdserver: mvcc: database space exceeded
kubelet[1495]: W1204 11:17:02.565505    1495 status_manager.go:446] Failed to update status for pod "custom-pod-E": etcdserver: mvcc: database space exceeded
kubelet[1495]: \"sizeBytes\":746888}]}}" for node "ip-1-2-3-4.aws-region.compute.internal": etcdserver: mvcc: database space exceeded

Check ETCD Status:

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} alarm list
memberID:A alarm:NOSPACE
memberID:B alarm:NOSPACE
memberID:C alarm:NOSPACE

~ # ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT} --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://localhost:4001 | 670630e06d36fd3c |  3.0.17 |  140 MB |      true |       358 |  120256658 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

~ # df -h | grep "master-vol"
/dev/xvdu         20G  442M   19G   3% /mnt/master-vol-A
/dev/xvdv         20G  419M   19G   3% /mnt/master-vol-B

According to the ETCD Maintenance Docs the cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.


Recovery: History compaction needs to occur (and then possible defragmentation to release the free storage space for use) for it to be operational again, the steps for this are in the above docs link.

There are possible options we could supply to etcd via kops which will hopefully mitigate this issue and reduce manual user maintenance required (although I don't know much about etcd to be sure):

  • EtcdClusterSpec: Allow ETCD_QUOTA_BACKEND_BYTES to be configurable, so a higher value can be set rather than the default of 0 (0 defaults to low space quota)
  • EtcdClusterSpec: Allow ETCD_AUTO_COMPACTION_RETENTION to be configurable, so it can trigger automatically without user intervention.
    • Could have some performance implications?
    • If we were to support this, should we default it to be enabled for new clusters?
    • Does periodic defragmentation still need to occur?

EDIT: 1 of the 5 nodes had etcd volume maxed out at 100%, due to a dodgy deployment. The other 4 were only 3% utilised as shown in the above log snippets.

Ping @gambol99 @justinsb @chrislovecnm

@justinsb
Copy link
Member

justinsb commented Dec 4, 2017

So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related:

kubernetes/kubernetes#45037
etcd-io/etcd#8009
etcd-io/etcd#7116

It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3.

@KashifSaadat
Copy link
Contributor Author

Cheers, that's probably the cause of it then!

In regards to apiserver doing the compaction every 5 minutes, shouldn't this mean that the other 4 nodes with disk space remaining should have remained operational? Or maybe we still needed to do the defrag to reclaim the free space on the members / clear the alarms that had triggered?

@KashifSaadat
Copy link
Contributor Author

KashifSaadat commented Dec 4, 2017

If anyone runs into the above issue, you can attempt to follow the below very rough recovery steps that I took (tested on CoreOS).

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

For any members that have experienced the above mentioned bug, where volume is at 100% (not entirely sure whether steps 2-5 are necessary in all cases):

  1. Find the affected member in AWS, terminate the associated ASG and 2x attached EBS Volumes (etcd, etcd-events)
  2. On one of the healthy-ish members, get the etcd member list: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member list
  3. Remove the dead member (should have the same tag name as the ASG / instance you deleted): ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member remove <member-id-from-above-command>
  4. Add the member back in, will be in an un-started state: ${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} member add <etcd-member-name> --peer-urls="https://<etcd-member-name>.internal.${KOPS_CLUSTER_NAME}:2380"
  5. Repeat Steps 2-5 for ${ETCD_ENDPOINT_EVENTS}. <etcd-member-name> will differ and port will be 2381 rather than 2380.
  6. kops update cluster ${KOPS_CLUSTER_NAME} --yes (this will re-create the ASG and volumes)
  7. Once the new master has started, ssh into the instance
  8. Run as root: systemctl stop kubelet && systemctl stop protokube
  9. Edit both/etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest, change ETCD_INITIAL_CLUSTER_STATE value to existing
  10. Drop the docker containers: docker kill $(docker ps | grep "etcd" | awk '{print $1}')
  11. For both the etcd volumes, remove the member dirs:
    • rm -rf /mnt/master-vol-<vol-id-main>/var/etcd/data/member
    • rm -rf /mnt/master-vol-<vol-id-events>/var/etcd/data-events/member
  12. Start kubelet: systemctl start kubelet. Wait for the cluster to report healthy again (check etcd member list, kops validate cluster etc).
  13. Start protokube again: systemctl start protokube
  14. Once the cluster is all healthy, slowly terminate the masters one by one (giving time for the cluster to recover), to ensure they are all in a clean state.

The above steps were modified slightly from following this guide: https://github.com/kubernetes/kops/blob/master/docs/single-to-multi-master.md#4---add-the-third-master

@KashifSaadat
Copy link
Contributor Author

v3.3.0 has officially been released. The following PR should correct issues with logging and pick up version changes for a rolling update: #4371

I'll be testing this out and will see how it goes!

@KashifSaadat
Copy link
Contributor Author

KashifSaadat commented Mar 2, 2018

Tempted to close this issue now.. ETCD v3.3.0 appears to resolve this issue. I'm running a cluster on the newer version and haven't noticed any problems so far (including the PR referenced above).

Just a note, with kops you'll need to define the new version as follows in your kops spec:

  etcdClusters:
  - etcdMembers:
     ...
    enableEtcdTLS: true
    image: gcr.io/etcd-development/etcd:v3.3.0
    name: main
    version: 3.3.0

The version field doesn't need to be identical to the image, so long as it's 3.x.x.

@justinsb anything more you think we need to do here, or happy to close this?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@voigt
Copy link

voigt commented Nov 14, 2019

Run this on each of the members affected, which still have available space on the etcd volume:

export ETCDCTL_API=3
ETCD_ENDPOINT_MAIN="https://localhost:4001"
ETCD_ENDPOINT_EVENTS="https://localhost:4002"
CA_FILE="/srv/kubernetes/ca.crt"
ETCD_CMD="etcdctl --cacert ${CA_FILE}"
rev=`${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'`
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} compact $rev
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} defrag
${ETCD_CMD} --endpoints ${ETCD_ENDPOINT_MAIN} alarm disarm

Sorry to resurrect this old issue. I just fell into it.

When I did the etcdctl [...] defrag I always got this into the error:
Failed to defragment etcd member[https://127.0.0.1:4001] (context deadline exceeded)

Setting the flag --command-timeout=120s solved this issue for me.

Hope that I could save someone some time.

@jsonmp-k8
Copy link

Does KOPS support this --quota-backend-bytes param for etcd ?

@olemarkus
Copy link
Member

You can specify which ENV vars to pass on to etcd: https://kops.sigs.k8s.io/cluster_spec/#etcdclusters
So you just have to set ETCD_QUOTA_BACKEND_BYTES there.

@jsonmp-k8
Copy link

Thanks @olemarkus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants