defrag stops the process completely for noticeable duration #9222

yudai · 2018-01-24T23:12:20Z

What I observed

The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests.
New requests and ongoing streaming requests such as Watch() can be canceled by their deadlines and the cluster can be unstable.

Side effects

This "stop the world" behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won't release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.

(Btw, can we add a new metric for "actual DB usage"?)

CC/ @xiang90

The text was updated successfully, but these errors were encountered:

gyuho · 2018-01-24T23:30:30Z

Yeah, we've been meaning to document this behavior.

Do you want to add some notes here https://github.com/coreos/etcd/blob/master/Documentation/op-guide/maintenance.md#defragmentation?

Thanks.

xiang90 · 2018-01-24T23:31:23Z

@gyuho

we probably want to change the behavior to non-stop-the-world.

gyuho · 2018-01-24T23:33:45Z

In the meantime,

(Btw, can we add a new metric for "actual DB usage"?)

This would be helpful. No objection.

xiang90 · 2018-01-24T23:34:45Z

@yudai @gyuho

hm... i am not sure if boltdb exposes that stats itself. need to double check.

yudai · 2018-01-25T23:02:17Z

@gyuho Thanks for the reference.

We hit an issue with the stop the world with some older version of etcd, so I need to revalidate with the latest version. (I think we got some dead line exceeded errors).

Currently errors are supposed to be hidden by the internal retry mechanism and clients should not get any errors by a process that is stopping for defrag. So if you run defrag on a single node at a time, you should not notice any problem. Does my understanding correct?

In that case, from operational perspective, if we can see the actual usage of the db size. It would be enough.

yudai · 2018-01-25T23:26:50Z

It seems my assumption above is not true.

I tested with etcd v3.1.14 servers (3-node cluster) and a test client app implemented with clientv3 from master (64713e5). The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds.
I triggered defrag on a node and it took for a while like 8 seconds (I'm using a HDD for this env), then some requests simply returned context dealine exceeded errors. So it's not hidden by the retry mechanism.

User apps most likely detect those errors and alert or start failure state procedures. So I assume it would be great if we can improve the retry mechanism for deadline exceeded error from a node.

gyuho · 2018-01-25T23:32:26Z

The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds.

Put, as mutable operation, won't be retried by our client balancer to keep at-most once semantics. But we could start with returning errors on in-progress defragmentation, so that users know that requests fail from ongoing maintenance.

yudai · 2018-01-25T23:53:29Z

I see. That makes sense not to retry mutable operations.

A new option like WithRetryMutable() could be another solution. I think it's easier to add the new option to Put() calls, rather than triaging errors in my code, which I feel more error prone.

xiang90 · 2018-01-26T19:21:40Z

@yudai

as a first step, how about exposing the in-use/allocated stats of the database? if that ratio is low, a defrag is probably needed.

i checked bbolt, and it only exposes per bucket info here: https://godoc.org/github.com/coreos/bbolt#Bucket.Stats. we need to aggregate all buckets to have a per db view.

do you want to help on exposing the metrics?

yudai · 2018-01-27T00:21:54Z

@xiang90 It would be a big help for operation. I'll take look in the doc and see if I can add the metrics myself.

yudai · 2018-01-30T20:32:31Z

Created a PR.

I did some tries and found that Stats.FreePageN is a good indicator to see free space. You can calculate page-level actual usage using this value. PendingPageN is also part of db's internal free list. However, it seems it's not so free actually(for rollback?, you always would see 3 or 4 at this metric anyway), so I didn't include the value to the calculation.

Since boltdb allocates pages before the existing pages get full, and it also manages branches and leafs separately, so byte-level actual usage didn't make sense so much. in-use/allocated does not hit 1 in any cases.

xiang90 · 2018-02-05T19:29:27Z

@yudai I can see two ways to solve the stop-the-world problem.

automatic client migration before and after the defrag

Before a defrag starts, etcd server needs to notify its client and migrate the clients to other servers. After the defrag, the clients need to migrate back to the server for rebalancing purpose.

make defrag concurrent internally

defrag is basically a database file rewrite. right now, to keep the consistency of the old file and the one being rewritten, we stop accepting writes into the old file. To the defrag concurrent, we need to either write to both old/new or keep a copy of the new writes and flushes it into the new one after the rewrite is finished.

xiang90 · 2018-02-05T20:03:11Z

/cc @jpbetz @wojtek-t

yudai · 2018-02-05T20:03:18Z

@xiang90

automatic client migration before and after the defrag

Do we have a control channel for server-initiated notifications now? If not, we might need to embed the notification to replies.

make defrag concurrent internally

This sounds more ideal but complicated solution. What in my mind is we can do something similar to virtual machine live migration for defrag, just like you mentioned. Technically it should be feasible, but I can imagine some precise work is required.
Or, we might add something like "forwarding mode", which makes the process to redirect incoming requests to other processes, and activate that while running defrag? A process works as a proxy when it's unresponsive for defrag.

I rather like the option 2, because client implementations don't need to take care of the situation (I guess this kind of procedure is somewhat error prone to implement). I assume we don't like fat clients.

jpbetz · 2018-02-06T01:16:10Z

#2 might not be all that hard to implement.

While 100% defragment a bolt db file is easiest to implement by rewriting the entire file, there might be 90%+ solution that just blocks writes (not reads) while performing a scan of the pages and applies some basic packing. To actually reduce the db file size would still require a subsequent stop-the-world operation to shrink the file, and a routine to recalculate the freelist as part of the file shrink. But that pause should be almost imperceptible- no longer than the current stop-the-world pauses that already happen to expand the file size. That step might require some careful thought, but it seems do-able. I would love to get to the point where bolt keeps statistics on the db file and triggers these types of "defragmentation" operations automatically as needed.

That said, generally having the ability that #1 provides-- taking a member out of rotation for online maintenance-- also sounds useful.

So I'm in favor of doing both, at least eventually :) I'd be happy to spend some time on #2 since I'm already spun up on that code.

stale · 2020-04-07T07:11:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho · 2020-07-24T21:25:48Z

Copying my comment from kubernetes/kubernetes#93280

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

Write request to "any" node (leader or follower) incurs a proposal message to leader for log replication.
The log replication proceeds regardless of the local storage state: Committed index may increment while applied index stalls (e.g. due to slow apply or defragmentation). A state machine still accepts proposals.
As a result, leader's applied index may fall behind follower's, or vice versa.
The divergence stops when the gap between applied and committed index reaches 5,000: Mutable requests fail with an error "too many requests", while immutable requests may succeed.
Worse if slow apply happens due to defrag: all immutable requests block as well.
Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies.

Please correct me if I am wrong.

stale · 2020-10-22T23:18:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale · 2021-02-12T15:25:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

cenkalti · 2022-10-13T23:55:30Z

I have a question about the last bullet point of @gyuho 's comment:

Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

Do pending entries stored in memory or do they get written to raft log (disk)?

chaochn47 · 2023-04-07T21:46:01Z

@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster.

xiang90 added the area/performance label Jan 24, 2018

gyuho mentioned this issue Jan 29, 2018

Documentation/op-guide: highlight defragment operation #9245

Merged

yudai mentioned this issue Jan 30, 2018

*: Add dbSizeInUse to StatusResposne #9256

Merged

whereisaaron mentioned this issue Feb 10, 2018

[ETCD] auto-compaction on kube-aws 0.9.7? kubernetes-retired/kube-aws#1061

Closed

gyuho mentioned this issue May 15, 2018

KEP: efficient node heartbeats kubernetes/community#2090

Merged

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

gyuho reopened this Jul 24, 2020

stale bot removed the stale label Jul 24, 2020

gyuho mentioned this issue Jul 24, 2020

Defrag on etcd node causing API 500s kubernetes/kubernetes#93280

Open

gyuho self-assigned this Jul 24, 2020

stale bot added the stale label Oct 22, 2020

stale bot closed this as completed Nov 13, 2020

gyuho reopened this Nov 14, 2020

stale bot removed the stale label Nov 14, 2020

stale bot added the stale label Feb 12, 2021

stale bot closed this as completed Mar 5, 2021

ptabor added stage/investigating and removed stale labels Mar 5, 2021

ptabor reopened this Mar 5, 2021

serathius added stage/tracked and removed stage/investigating labels Jun 13, 2022

geetasg mentioned this issue Apr 11, 2023

Script for defragmentation #15477

Closed

gyuho removed their assignment Jul 4, 2023

This was referenced Jul 20, 2023

Introduce grpc health check in etcd client #16276

Open

Documentation: add roadmap #16279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defrag stops the process completely for noticeable duration #9222

defrag stops the process completely for noticeable duration #9222

yudai commented Jan 24, 2018

gyuho commented Jan 24, 2018

xiang90 commented Jan 24, 2018

gyuho commented Jan 24, 2018

xiang90 commented Jan 24, 2018

yudai commented Jan 25, 2018

yudai commented Jan 25, 2018

gyuho commented Jan 25, 2018

yudai commented Jan 25, 2018

xiang90 commented Jan 26, 2018

yudai commented Jan 27, 2018

yudai commented Jan 30, 2018 •

edited

Loading

xiang90 commented Feb 5, 2018

xiang90 commented Feb 5, 2018

yudai commented Feb 5, 2018 •

edited

Loading

jpbetz commented Feb 6, 2018 •

edited

Loading

stale bot commented Apr 7, 2020

gyuho commented Jul 24, 2020 •

edited

Loading

stale bot commented Oct 22, 2020

stale bot commented Feb 12, 2021

cenkalti commented Oct 13, 2022

chaochn47 commented Apr 7, 2023

defrag stops the process completely for noticeable duration #9222

defrag stops the process completely for noticeable duration #9222

Comments

yudai commented Jan 24, 2018

What I observed

Side effects

gyuho commented Jan 24, 2018

xiang90 commented Jan 24, 2018

gyuho commented Jan 24, 2018

xiang90 commented Jan 24, 2018

yudai commented Jan 25, 2018

yudai commented Jan 25, 2018

gyuho commented Jan 25, 2018

yudai commented Jan 25, 2018

xiang90 commented Jan 26, 2018

yudai commented Jan 27, 2018

yudai commented Jan 30, 2018 • edited Loading

xiang90 commented Feb 5, 2018

xiang90 commented Feb 5, 2018

yudai commented Feb 5, 2018 • edited Loading

jpbetz commented Feb 6, 2018 • edited Loading

stale bot commented Apr 7, 2020

gyuho commented Jul 24, 2020 • edited Loading

stale bot commented Oct 22, 2020

stale bot commented Feb 12, 2021

cenkalti commented Oct 13, 2022

chaochn47 commented Apr 7, 2023

yudai commented Jan 30, 2018 •

edited

Loading

yudai commented Feb 5, 2018 •

edited

Loading

jpbetz commented Feb 6, 2018 •

edited

Loading

gyuho commented Jul 24, 2020 •

edited

Loading