-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
defrag stops the process completely for noticeable duration #9222
Comments
Yeah, we've been meaning to document this behavior. Do you want to add some notes here https://github.com/coreos/etcd/blob/master/Documentation/op-guide/maintenance.md#defragmentation? Thanks. |
we probably want to change the behavior to non-stop-the-world. |
In the meantime,
This would be helpful. No objection. |
@gyuho Thanks for the reference. We hit an issue with the stop the world with some older version of etcd, so I need to revalidate with the latest version. (I think we got some dead line exceeded errors). Currently errors are supposed to be hidden by the internal retry mechanism and clients should not get any errors by a process that is stopping for defrag. So if you run defrag on a single node at a time, you should not notice any problem. Does my understanding correct? In that case, from operational perspective, if we can see the actual usage of the db size. It would be enough. |
It seems my assumption above is not true. I tested with etcd v3.1.14 servers (3-node cluster) and a test client app implemented with clientv3 from master (64713e5). The test client app simply spawns 4 goroutines that issue User apps most likely detect those errors and alert or start failure state procedures. So I assume it would be great if we can improve the retry mechanism for deadline exceeded error from a node. |
Put, as mutable operation, won't be retried by our client balancer to keep at-most once semantics. But we could start with returning errors on in-progress defragmentation, so that users know that requests fail from ongoing maintenance. |
I see. That makes sense not to retry mutable operations. A new option like |
as a first step, how about exposing the i checked bbolt, and it only exposes per bucket info here: https://godoc.org/github.com/coreos/bbolt#Bucket.Stats. we need to aggregate all buckets to have a per db view. do you want to help on exposing the metrics? |
@xiang90 It would be a big help for operation. I'll take look in the doc and see if I can add the metrics myself. |
Created a PR. I did some tries and found that Since boltdb allocates pages before the existing pages get full, and it also manages branches and leafs separately, so byte-level actual usage didn't make sense so much. |
@yudai I can see two ways to solve the stop-the-world problem.
Before a defrag starts, etcd server needs to notify its client and migrate the clients to other servers. After the defrag, the clients need to migrate back to the server for rebalancing purpose.
defrag is basically a database file rewrite. right now, to keep the consistency of the old file and the one being rewritten, we stop accepting writes into the old file. To the defrag concurrent, we need to either write to both old/new or keep a copy of the new writes and flushes it into the new one after the rewrite is finished. |
Do we have a control channel for server-initiated notifications now? If not, we might need to embed the notification to replies.
This sounds more ideal but complicated solution. What in my mind is we can do something similar to virtual machine live migration for defrag, just like you mentioned. Technically it should be feasible, but I can imagine some precise work is required. I rather like the option 2, because client implementations don't need to take care of the situation (I guess this kind of procedure is somewhat error prone to implement). I assume we don't like fat clients. |
#2 might not be all that hard to implement. While 100% defragment a bolt db file is easiest to implement by rewriting the entire file, there might be 90%+ solution that just blocks writes (not reads) while performing a scan of the pages and applies some basic packing. To actually reduce the db file size would still require a subsequent stop-the-world operation to shrink the file, and a routine to recalculate the freelist as part of the file shrink. But that pause should be almost imperceptible- no longer than the current stop-the-world pauses that already happen to expand the file size. That step might require some careful thought, but it seems do-able. I would love to get to the point where bolt keeps statistics on the db file and triggers these types of "defragmentation" operations automatically as needed. That said, generally having the ability that #1 provides-- taking a member out of rotation for online maintenance-- also sounds useful. So I'm in favor of doing both, at least eventually :) I'd be happy to spend some time on #2 since I'm already spun up on that code. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Copying my comment from kubernetes/kubernetes#93280 An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:
In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies. Please correct me if I am wrong. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
I have a question about the last bullet point of @gyuho 's comment:
Do pending entries stored in memory or do they get written to raft log (disk)? |
@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster. |
What I observed
The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests.
New requests and ongoing streaming requests such as
Watch()
can be canceled by their deadlines and the cluster can be unstable.Side effects
This "stop the world" behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won't release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.
(Btw, can we add a new metric for "actual DB usage"?)
CC/ @xiang90
The text was updated successfully, but these errors were encountered: