Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defrag stops the process completely for noticeable duration #9222

Open
yudai opened this issue Jan 24, 2018 · 21 comments
Open

defrag stops the process completely for noticeable duration #9222

yudai opened this issue Jan 24, 2018 · 21 comments

Comments

@yudai
Copy link
Contributor

yudai commented Jan 24, 2018

What I observed

The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests.
New requests and ongoing streaming requests such as Watch() can be canceled by their deadlines and the cluster can be unstable.

Side effects

This "stop the world" behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won't release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.

(Btw, can we add a new metric for "actual DB usage"?)

CC/ @xiang90

@gyuho
Copy link
Contributor

gyuho commented Jan 24, 2018

Yeah, we've been meaning to document this behavior.

Do you want to add some notes here https://github.com/coreos/etcd/blob/master/Documentation/op-guide/maintenance.md#defragmentation?

Thanks.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@gyuho

we probably want to change the behavior to non-stop-the-world.

@gyuho
Copy link
Contributor

gyuho commented Jan 24, 2018

In the meantime,

(Btw, can we add a new metric for "actual DB usage"?)

This would be helpful. No objection.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@yudai @gyuho

hm... i am not sure if boltdb exposes that stats itself. need to double check.

@yudai
Copy link
Contributor Author

yudai commented Jan 25, 2018

@gyuho Thanks for the reference.

We hit an issue with the stop the world with some older version of etcd, so I need to revalidate with the latest version. (I think we got some dead line exceeded errors).

Currently errors are supposed to be hidden by the internal retry mechanism and clients should not get any errors by a process that is stopping for defrag. So if you run defrag on a single node at a time, you should not notice any problem. Does my understanding correct?

In that case, from operational perspective, if we can see the actual usage of the db size. It would be enough.

@yudai
Copy link
Contributor Author

yudai commented Jan 25, 2018

It seems my assumption above is not true.

I tested with etcd v3.1.14 servers (3-node cluster) and a test client app implemented with clientv3 from master (64713e5). The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds.
I triggered defrag on a node and it took for a while like 8 seconds (I'm using a HDD for this env), then some requests simply returned context dealine exceeded errors. So it's not hidden by the retry mechanism.

User apps most likely detect those errors and alert or start failure state procedures. So I assume it would be great if we can improve the retry mechanism for deadline exceeded error from a node.

@gyuho
Copy link
Contributor

gyuho commented Jan 25, 2018

The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds.

Put, as mutable operation, won't be retried by our client balancer to keep at-most once semantics. But we could start with returning errors on in-progress defragmentation, so that users know that requests fail from ongoing maintenance.

@yudai
Copy link
Contributor Author

yudai commented Jan 25, 2018

I see. That makes sense not to retry mutable operations.

A new option like WithRetryMutable() could be another solution. I think it's easier to add the new option to Put() calls, rather than triaging errors in my code, which I feel more error prone.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

@yudai

as a first step, how about exposing the in-use/allocated stats of the database? if that ratio is low, a defrag is probably needed.

i checked bbolt, and it only exposes per bucket info here: https://godoc.org/github.com/coreos/bbolt#Bucket.Stats. we need to aggregate all buckets to have a per db view.

do you want to help on exposing the metrics?

@yudai
Copy link
Contributor Author

yudai commented Jan 27, 2018

@xiang90 It would be a big help for operation. I'll take look in the doc and see if I can add the metrics myself.

@yudai
Copy link
Contributor Author

yudai commented Jan 30, 2018

Created a PR.

I did some tries and found that Stats.FreePageN is a good indicator to see free space. You can calculate page-level actual usage using this value. PendingPageN is also part of db's internal free list. However, it seems it's not so free actually(for rollback?, you always would see 3 or 4 at this metric anyway), so I didn't include the value to the calculation.

Since boltdb allocates pages before the existing pages get full, and it also manages branches and leafs separately, so byte-level actual usage didn't make sense so much. in-use/allocated does not hit 1 in any cases.

@xiang90
Copy link
Contributor

xiang90 commented Feb 5, 2018

@yudai I can see two ways to solve the stop-the-world problem.

  1. automatic client migration before and after the defrag

Before a defrag starts, etcd server needs to notify its client and migrate the clients to other servers. After the defrag, the clients need to migrate back to the server for rebalancing purpose.

  1. make defrag concurrent internally

defrag is basically a database file rewrite. right now, to keep the consistency of the old file and the one being rewritten, we stop accepting writes into the old file. To the defrag concurrent, we need to either write to both old/new or keep a copy of the new writes and flushes it into the new one after the rewrite is finished.

@xiang90
Copy link
Contributor

xiang90 commented Feb 5, 2018

/cc @jpbetz @wojtek-t

@yudai
Copy link
Contributor Author

yudai commented Feb 5, 2018

@xiang90

automatic client migration before and after the defrag

Do we have a control channel for server-initiated notifications now? If not, we might need to embed the notification to replies.

make defrag concurrent internally

This sounds more ideal but complicated solution. What in my mind is we can do something similar to virtual machine live migration for defrag, just like you mentioned. Technically it should be feasible, but I can imagine some precise work is required.
Or, we might add something like "forwarding mode", which makes the process to redirect incoming requests to other processes, and activate that while running defrag? A process works as a proxy when it's unresponsive for defrag.

I rather like the option 2, because client implementations don't need to take care of the situation (I guess this kind of procedure is somewhat error prone to implement). I assume we don't like fat clients.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 6, 2018

#2 might not be all that hard to implement.

While 100% defragment a bolt db file is easiest to implement by rewriting the entire file, there might be 90%+ solution that just blocks writes (not reads) while performing a scan of the pages and applies some basic packing. To actually reduce the db file size would still require a subsequent stop-the-world operation to shrink the file, and a routine to recalculate the freelist as part of the file shrink. But that pause should be almost imperceptible- no longer than the current stop-the-world pauses that already happen to expand the file size. That step might require some careful thought, but it seems do-able. I would love to get to the point where bolt keeps statistics on the db file and triggers these types of "defragmentation" operations automatically as needed.

That said, generally having the ability that #1 provides-- taking a member out of rotation for online maintenance-- also sounds useful.

So I'm in favor of doing both, at least eventually :) I'd be happy to spend some time on #2 since I'm already spun up on that code.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
@gyuho gyuho reopened this Jul 24, 2020
@stale stale bot removed the stale label Jul 24, 2020
@gyuho gyuho self-assigned this Jul 24, 2020
@gyuho
Copy link
Contributor

gyuho commented Jul 24, 2020

Copying my comment from kubernetes/kubernetes#93280

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

  • Write request to "any" node (leader or follower) incurs a proposal message to leader for log replication.
  • The log replication proceeds regardless of the local storage state: Committed index may increment while applied index stalls (e.g. due to slow apply or defragmentation). A state machine still accepts proposals.
  • As a result, leader's applied index may fall behind follower's, or vice versa.
  • The divergence stops when the gap between applied and committed index reaches 5,000: Mutable requests fail with an error "too many requests", while immutable requests may succeed.
  • Worse if slow apply happens due to defrag: all immutable requests block as well.
  • Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies.

Please correct me if I am wrong.

@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 22, 2020
@stale stale bot closed this as completed Nov 13, 2020
@gyuho gyuho reopened this Nov 14, 2020
@stale stale bot removed the stale label Nov 14, 2020
@stale
Copy link

stale bot commented Feb 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@cenkalti
Copy link
Member

I have a question about the last bullet point of @gyuho 's comment:

Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

Do pending entries stored in memory or do they get written to raft log (disk)?

@chaochn47
Copy link
Member

@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants