Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd size sometimes starts growing and grows until "mvcc: database space exceeded" #8009

Closed
wojtek-t opened this issue May 31, 2017 · 53 comments
Assignees

Comments

@wojtek-t
Copy link
Contributor

We have observed already few cases, where suddenly (after days of running a GKE cluster), the size of database starts growing.

As an example, we have a cluster, that was running without any issues for ~2 weeks (size of database was ~16MB), and then its database started growing. The growth wasn't immediate - it took ~2days, before it reached 4GB limit and there were steps of growing.
For the reference we have backups (snapshots) (done via etcdctl snapshot) that reflect the growth speed (the name contains the time when it was made)

  ... // all snapshots are roughly 16MB
  2017-05-24T04:57:24-07:00_snapshot.db 16,27 MB
  2017-05-24T05:57:26-07:00_snapshot.db 29,06 MB
  2017-05-24T06:57:30-07:00_snapshot.db 108,98 MB
  2017-05-24T07:57:36-07:00_snapshot.db 177,57 MB
  2017-05-24T08:57:51-07:00_snapshot.db 308,4 MB        
  2017-05-24T09:58:32-07:00_snapshot.db 534,54 MB
  2017-05-24T11:00:16-07:00_snapshot.db 655,73 MB
  2017-05-24T12:00:55-07:00_snapshot.db 764,22 MB
  ... // all snapshots of the same size
  2017-05-25T15:15:10-07:00_snapshot.db 764,22 MB
  2017-05-25T16:16:25-07:00_snapshot.db 818,14 MB
  2017-05-25T17:26:35-07:00_snapshot.db 963,93 MB
  ... // all snapshots of the same size
  2017-05-25T22:25:08-07:00_snapshot.db 963,93 MB
  2017-05-25T23:27:03-07:00_snapshot.db 1,56 GB
  2017-05-26T00:30:13-07:00_snapshot.db 1,56 GB
  2017-05-26T01:05:24-07:00_snapshot.db 1,56 GB
  2017-05-26T02:24:21-07:00_snapshot.db 2,18 GB
  ... // all snapshots of the same size
  2017-05-26T08:43:07-07:00_snapshot.db 2,18 GB
  2017-05-26T09:46:47-07:00_snapshot.db 2,19 GB
  ... // all snapshots of the same size
  2017-05-26T16:11:31-07:00_snapshot.db 2,19 GB
  2017-05-26T17:16:47-07:00_snapshot.db 2,65 GB
  2017-05-26T18:22:37-07:00_snapshot.db 3,12 GB
  2017-05-26T19:29:07-07:00_snapshot.db 3,86 GB
  2017-05-26T20:33:24-07:00_snapshot.db 4,6 GB
  <boom>

We've checked that we were doing compaction very regularly every 5m for the whole time - so it doesn't seem to be the same as: #7944
I'm attaching the interested lines from etcd logs in etcd-compaction.txt

[Note time in that logs are in UTC, and time in snapshot names is PST, so 7 hours difference]

To summarize, the compaction was always at most few thousands of transactions (so it's not that we did a lot during some 5m period), though there were some longer compactions, up to ~7s.

I started digging into individual snapshots and found some strange thing (I was using bolt)

  1. 16MB snapshot:
Aggregate statistics for 10 buckets

Page count statistics
        Number of logical branch pages: 10
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 789
        Number of physical leaf overflow pages: 518
Tree statistics
        Number of keys/value pairs: 1667
        Number of levels in B+tree: 3
Page size utilization
        Bytes allocated for physical branch pages: 40960
        Bytes actually used for branch data: 26494 (64%)
        Bytes allocated for physical leaf pages: 5353472
        Bytes actually used for leaf data: 3411680 (63%)
Bucket statistics
        Total number of buckets: 10
        Total number on inlined buckets: 9 (90%)
        Bytes used for inlined buckets: 536 (0%)
  1. 534MB snapshot (5 hours later):
Aggregate statistics for 10 buckets

Page count statistics
        Number of logical branch pages: 65
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 5559
        Number of physical leaf overflow pages: 107743
Tree statistics
        Number of keys/value pairs: 13073
        Number of levels in B+tree: 3
Page size utilization
        Bytes allocated for physical branch pages: 266240
        Bytes actually used for branch data: 186912 (70%)
        Bytes allocated for physical leaf pages: 464084992
        Bytes actually used for leaf data: 451590110 (97%)
Bucket statistics
        Total number of buckets: 10
        Total number on inlined buckets: 9 (90%)
        Bytes used for inlined buckets: 536 (0%)
  1. 1.56GB snapshot (another ~36 hours later):
Aggregate statistics for 10 buckets

Page count statistics
        Number of logical branch pages: 70
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 4525
        Number of physical leaf overflow pages: 115179
Tree statistics
        Number of keys/value pairs: 10978
        Number of levels in B+tree: 3
Page size utilization
        Bytes allocated for physical branch pages: 286720
        Bytes actually used for branch data: 152723 (53%)
        Bytes allocated for physical leaf pages: 490307584
        Bytes actually used for leaf data: 478196884 (97%)
Bucket statistics
        Total number of buckets: 10
        Total number on inlined buckets: 9 (90%)
        Bytes used for inlined buckets: 536 (0%)
  1. 3.86GB snapshot (another ~18 hours later)
Aggregate statistics for 10 buckets

Page count statistics
        Number of logical branch pages: 90
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 6219
        Number of physical leaf overflow pages: 6791
Tree statistics
        Number of keys/value pairs: 15478
        Number of levels in B+tree: 3
Page size utilization
        Bytes allocated for physical branch pages: 368640
        Bytes actually used for branch data: 209621 (56%)
        Bytes allocated for physical leaf pages: 53288960
        Bytes actually used for leaf data: 36704465 (68%)
Bucket statistics
        Total number of buckets: 10
        Total number on inlined buckets: 9 (90%)
        Bytes used for inlined buckets: 536 (0%)
  1. 4.6GB snapshot (1hour later, right before exceeding space):
Aggregate statistics for 10 buckets

Page count statistics
        Number of logical branch pages: 89
        Number of physical branch overflow pages: 0
        Number of logical leaf pages: 6074
        Number of physical leaf overflow pages: 6713
Tree statistics
        Number of keys/value pairs: 15173
        Number of levels in B+tree: 3
Page size utilization
        Bytes allocated for physical branch pages: 364544
        Bytes actually used for branch data: 204788 (56%)
        Bytes allocated for physical leaf pages: 52375552
        Bytes actually used for leaf data: 36092789 (68%)
Bucket statistics
        Total number of buckets: 10
        Total number on inlined buckets: 9 (90%)
        Bytes used for inlined buckets: 564 (0%)

What is extremely interesting to me is that both:

  • Number of physical leaf overflow pages
  • Bytes allocated for physical leaf pages
    dropped by order of magnitude in this 3.86GB snapshot, but the total size of database didn't drop

Unfortunately I can't provide any of those snapshots due to privacy reasons, but maybe you can see anything that we can investigate (or share results of some commands) that can help with debugging?

@xiang90 @hongchaodeng @mml @lavalamp

@wojtek-t
Copy link
Contributor Author

FYI, restoring from all those snapshots locally and running:

ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt
ls -la out.txt

always returns files of size in the order of at most tens of MB.

@armstrongli
Copy link

@wojtek-t We got the similar issue on the DB file size increasement. What we have done is to do defrag periodically.
The DB file size decreases a lot(from 4.7GB to ~200MB).

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

@benbjohnson can you take a look at the boltdb data? We double checked the etcd txn, and do not feel we can have a long running transaction that can cause the db size continuous to grow.

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

@wojtek-t

2017-05-25T23:27:03-07:00_snapshot.db 1,56 GB
2017-05-26T00:30:13-07:00_snapshot.db 1,56 GB
2017-05-26T01:05:24-07:00_snapshot.db 1,56 GB

Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.

@benbjohnson
Copy link
Contributor

@xiang90 The snapshot and batchTxBufferred seem to be the only places using bolt.Tx.Begin(false). It seems like it could get hung in a spot such as newSnapshotReaderCloser() where WriteTo() gets blocked on the output writer.

Can you add logging to print the Tx pointer (using the %p formatter) when you open and close a snapshot read transaction?

https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L173

https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L439-L443

I'm not sure how much it'll help for me to look at the DB. Also, I am pretty slammed right now and don't have the bandwidth. My suspicion is that it's the snapshot getting hung or somehow the Tx is getting dropped. There are some spots where a panic() could prevent the Tx from being cleaned up but I couldn't tell if you (or an underlying library like net/http) has a recover() function masking that issue.

@wojtek-t
Copy link
Contributor Author

Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.

Good question. Maybe the underlying VM was restarted or sth? But that's just guessing, I don't really know.

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

@wojtek-t As @benbjohnson mentioned, if there is a pending snapshot for a long time, we will have troubles. so i want you to check if there is any failed or pending snapshot. if there is not, we probably can rule out the long running txn thing. we also add quite a few warning lines around long running read only txn in etcd master. if you can reproduce the issue, i would suggest you to back port the debugging patch and give it a try.

@wojtek-t
Copy link
Contributor Author

@xiang90 - are there any logs that may confirm it in 3.0.17? We don't have reproduction for it (and backporting is hard on our side).

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

are there any logs that may confirm it in 3.0.17

sorry. not really. :(

We don't have reproduction for it

what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.

But, for 3.0, we are at a bad situation here. No reproduce, not enough logging. We have some guess around what happened, but they are not super convincing. The only way to move forward is probably to get more debugging logging and reproduce it.

@armstrongli if you guys can reproduce it, it would be super helpful.

@wojtek-t
Copy link
Contributor Author

wojtek-t commented May 31, 2017

@xiang90 - do you know why between 1.56GB and 3.8GB the number of:

  • Number of physical leaf overflow pages
  • Bytes allocated for physical leaf pages

dropped by order of magnitude, but the size of DB more the doubled?
That seems pretty strange to me....

@wojtek-t
Copy link
Contributor Author

what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.

Ironicaly, so far those problems are happening on small clusters (less than 10 nodes). And there isn't big churn on them (at least on some of them). So we don't even need scale for this...

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

Ironicaly, so far those problems are happening on small clusters (less than 10 nodes).

Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?

@wojtek-t
Copy link
Contributor Author

Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?

Can be both. What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.

@wojtek-t
Copy link
Contributor Author

@xiang90 - I looked into logs of our backup tasks. Unfortunately, it seems that they were log-rotated in the meantime, but we have something. In particular I'm seeing this:

E0526 08:22:32.952657       5 main.go:209] stdout from failed etcdctl backup:
E0526 08:22:32.952698       5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
2017/05/26 08:22:31 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
Error:  grpc: timed out when dialing

Those times are UTC, so this happened during an hour when DB size grown from 1.56GB to 2.18GB.
I don't see any other failures later, though I don't know what happened before.

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

dropped by order of magnitude, but the size of DB more the doubled?

boltdb neve reclaims the pages. What you see there are pages in use, there is also a freelist that tracks all freed pages. for 3.8gb one, i assume most of the pages become free?

one actually interesting point of your data is that the overflow pages are huge after 15.6MB one and becomes normal in the 3.8gb one.

in 1.56 gb one, for example, the overflow page/logic page ~ 20, which means the avg size of the key is (4000+ 115179) * 4kb/10978. is that what you expect?

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

E0526 08:22:32.952657 5 main.go:209] stdout from failed etcdctl backup:

so we do have failed snapshot?

@wojtek-t
Copy link
Contributor Author

so we do have failed snapshot?

It failed on the client side - I don't know what happened on the server side (if it even reached the server or not).

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

FYI, restoring from all those snapshots locally and running:

ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt
ls -la out.txt
always returns files of size in the order of at most tens of MB.

This is also strange. For the 500MB one, the actual usage is 97%, which is (451590110). Probably it worth to look at the data through https://github.com/br0xen/boltbrowser to see what are in there.

I am curious how 10978 k8s keys take up to ~450MB.

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

E0526 08:22:32.952698 5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" }

do you have server side logging around the same time?

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.

correct. we will try that. it would be helpful if you guys can also do some testing. i believe google probably has more computation resources than us :P.

@wojtek-t
Copy link
Contributor Author

do you have server side logging around the same time?

I do - what should I look for?

@xiang90
Copy link
Contributor

xiang90 commented May 31, 2017

I do - what should I look for?

any panics? any warning or error level logging?

@wojtek-t
Copy link
Contributor Author

I don't see any panic and warnings. But it seems that I missed something important initially. It seems that etcd was restarted multiple times in the meantime. It was also restarted around that time. I'm trying to figure out why....

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Jun 1, 2017

So I looked deeper into what was happening with this cluster and it was very heavily overloaded. In particular we were restarting different components every now and then do to them not responding to liveness probes (etcd) or not being able to refresh master leases.

Regarding the large objects - those were huge endpoints containing O(1000) failed pods. This is k8s issue and I will file a separate issue in k8s repo for it, but it should cause etcd database space exceeded in my opinion.

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Jun 1, 2017

So I started looking into other clusters in which we faced the same issue, and it seems that there are some long-running "etcdctl snapshot save" operations there.

In particular, examples from logs from other clusters:

2017-05-26 23:07:46 UTC	Starting backup.
...
2017-05-26 23:07:51 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

[The second line means backup is done and we are starting to upload it to where it should be stored.]
This produced backup of size 193MB

And then it started getting longer:

2017-05-27 00:07:53 UTC	Starting backup.	
...
2017-05-27 00:09:31 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

[It was still 193MB backup - and there were a bunch of backups like this, like 7 or so, each taking ~2m]

Then this one 4 minutes long:

2017-05-27 07:23:13 UTC	Starting backup.
...
2017-05-27 07:27:36 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

And the one more which started before we exceeded database space, but finished after (apiserver logs below):

2017-05-27 08:28:35 UTC	Starting backup.
...
2017-05-27 08:42:05 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

This one produced backup of 1.29GB size.

2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded      
2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded      
2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded

That said, I'm not seeing any logs about those long backups in etcd log (or I don't know what to look at there).

I also looked at the cluster for which I reported the issue (I managed to get all previous logs from doing backups) and the situation was pretty similar:

  • initially they were taking ~15-20s
  • then it started growing to 20-30s
  • and then again it grown to 1m30s+, e.g.:
2017-05-26 00:24:57 UTC	Starting backup.
...
2017-05-26 00:26:35 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

or

2017-05-26 07:28:54 UTC	Starting backup
...
2017-05-26 07:30:14 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

Then to 2minutes+

2017-05-26 16:44:31 UTC	Starting backup.
...
2017-05-26 16:46:47 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

then to 4minutes+

2017-05-27 01:18:27 UTC	Starting backup.
...
2017-05-27 01:22:37 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

or 3 minutes+ right before we hit the "mvcc: database space exceeded":

2017-05-27 03:30:18 UTC	Starting backup.
...
2017-05-27 03:33:24 UTC	Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Jun 1, 2017

Though in the second case, the machine was overloaded, which might have also impact the time backups were taking.

@xiang90
Copy link
Contributor

xiang90 commented Jun 1, 2017

but it should cause etcd database space exceeded in my opinion.

you mean should not, right? I tend to agree on this if that is the only bad thing happened.

A few minutes snapshot is not a unreasonably long running one. I was worrying about there is a hanging one or leaky one (never finished).

Thanks for all the information. I will try to see what can happen when etcd is taking a snapshot + receiving some large value updates.

We will keep you updated on this.

@wojtek-t
Copy link
Contributor Author

@xiang90 - great thanks!
When can we expect this to be part of some release? Also will it be backported to 3.0.* and/or 3.1.* ?

@xiang90 - friendly ping ^^

@sebastianwoinar
Copy link

Has this been released already?
Even 3.2.6 seems to still rely on the old backend (https://github.com/coreos/etcd/blob/v3.2.6/glide.yaml#L5).

@heyitsanthony
Copy link
Contributor

@sebastianwoinar it will be in 3.3.

@tatsuhiro-t
Copy link

We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB.
We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?

@loxo33
Copy link

loxo33 commented Nov 1, 2017

We are experiencing this issue with 3.2.7 and 3.2.9. Can this issue be re-opened and investigated, please?

@xiang90
Copy link
Contributor

xiang90 commented Nov 1, 2017

@loxo33 the patch is merged in to master, not 3.2.x

@xiang90
Copy link
Contributor

xiang90 commented Nov 1, 2017

We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB.
We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?

probably a boltdb issue.

@christianhuening
Copy link

We might be hitting this as well with etcd 3.1. So what is the recommended action to get a working cluster again? Compacting, defragmenting and disarming does not work as the alarm gets re-raised immediately

@sokoow
Copy link

sokoow commented Dec 27, 2017

Where are we with this ? according to the comment above, it'd should have been in 3.3, but this is still unresolved right ?

@karankh
Copy link

karankh commented Dec 28, 2017

Did you guys try to run defrag periodically, we do and it helps keep the size in check.

@ghost
Copy link

ghost commented Jan 10, 2018

Is there any recommended action yet? We are running into this cluster issue aswell.

@christianhuening
Copy link

@ricklagerweij For me it turned out to be a configuration issue. I had to set the --quota-backend-bytes flag to something larger than 2GB for my cluster to continue working, as the cluster size was 2,3 GB. It defaults to 2GB, which is documented here: https://github.com/coreos/etcd/blob/master/Documentation/dev-guide/limit.md#storage-size-limit
After that the defrag etc worked again as expected.

@ghost
Copy link

ghost commented Jan 10, 2018

@christianhuening Thanks! That makes sense as 2 out of 3 nodes had a database which had a size of 2.1GB. I'm not familiar with etcd, how would I tweak this on an existing cluster? This page doesn't really give me much operational recommendations https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md

@christianhuening
Copy link

@ricklagerweij Well, assuming you have it running via systemd: You go to node 1, edit the systemd service file for etcd and add --quota-backend-bytes as an argument (pass the desired GBs as bytes (3GB = 3.221.225.472). Then you restart etcd on that node via systemctl restart etcd and monitor the status of the cluster on another node via etcdctl until all 3 nodes are healthy again (you can loose 1 out of 3 nodes with 3 nodes due to quorum) . Then repeat for nodes 2 and 3.

@ghost
Copy link

ghost commented Jan 11, 2018

@christianhuening makes sense, thanks again.

@typhoonzero
Copy link

@xiang90 I've encountered the same issue now, according to the discussion, I need to update the version of my etcd right? Is there any workaround so I can get a quick fix?

@h1z3y3
Copy link

h1z3y3 commented Nov 2, 2020

mark

@patsevanton
Copy link

In which version is this issue fixed ?

@yuzhiquan
Copy link
Contributor

In which version is this issue fixed ?

3.2.10 later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests