Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Raft entry cache grows (inefficiently) to huge size #13231

Closed
bdarnell opened this issue Jan 30, 2017 · 20 comments
Closed

storage: Raft entry cache grows (inefficiently) to huge size #13231

bdarnell opened this issue Jan 30, 2017 · 20 comments
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

Each store has a cache for raft entries, configured by default to be 16MB. On gamma, we see that on one node (at a time), we have over a gigabyte of raft entries in memory. I believe that not all of these entries are in the cache, but they are all being held in place by the cache, because Replica.Entries allocates one large array of raftpb.Entry objects, so that a reference to any one of them keeps the whole array alive.

Additionally, whenever we load a large array of entries, we try to add them all to the cache, inserting them all one by one and then evicting all but the last 16MB. This is actually the bigger concern in the gamma cluster at this time, since this process takes long enough (and blocks the server) so that it loses leases and never makes any progress.

Why do we load this monolithic block of entries? Whenever a new node becomes leader, it loads all uncommitted entries to see if there are any config changes. This is the one time in which we load raft entries without any chunking. It would be easy to add chunking here. In addition, we may want to change the behavior of the raft.Entries method to break up the arrays that it uses when adding entries to the cache (trading off the allocation overhead of smaller allocations vs the wasted memory of sibling array entries). Finally, this problem is also a result of our inability to throttle incoming raft entries. The log appears to keep growing beyond our ability to process it.

@bdarnell bdarnell self-assigned this Jan 30, 2017
@bdarnell
Copy link
Contributor Author

For the record, the range that this is affecting is an ordinary data range in the block_writer table, not a special system range. One range has over 2M raft log entries, compared to under 100 for other ranges. If the cluster had survived a little longer, this range could have split and the problem could be distributed over more nodes. The only theory I have for why this problem started appearing on gamma is that it's been running binaries with modified compaction settings, which might have slowed things down when the cluster reaches a certain size (without letting it survive long enough to split enough time).

@petermattis
Copy link
Collaborator

Reading all of the Raft log to count the number of pending entries is a bit silly. One short term fix would be to special case entries for maxBytes == math.MaxUint64. When that is true we could avoid caching entries and only return entries with Type == EntryConfChange. This is a bit of a hack, though. It would be nicer if there was an explicit NumberPendingConf method on the raft.RaftStorage interface.

@bdarnell
Copy link
Contributor Author

I was wrong about this being the only place we load an unbounded number of entries at once - we also do it with committed entries that need to be applied. Changing that will be a bit trickier, since the code currently assumes that after a Ready cycle the new applied index can be set to the previous commit index. I don't think there's any deep obstacle to changing this, though.

@petermattis
Copy link
Collaborator

I don't think there's any deep obstacle to changing this, though.

This is good to change in any case as I've noticed in the past that processing a large number of Raft entries in a single ready call can take a significant period of time causing missed heartbeats and other fun.

@petermattis
Copy link
Collaborator

Unless I'm missing something, the raftEntryCache.addEntries adds entries individually. While a slice is passed to addEntries, we don't hold a reference to that slice, but only to the contents of the individual raftpb.Entry structs.

I think we should fix both of the places in etcd/raft where an unbounded number of log entries are read, but after re-familiarizing myself with the raft code neither change seems trivial. Mucking up logic at the Raft level is a recipe for last-minute instability. I'm punting any change to early in the 1.1 cycle.

@petermattis petermattis modified the milestones: 1.1, 1.0 Apr 19, 2017
@petermattis
Copy link
Collaborator

It is too late in the cycle to consider fixing etcd/raft. Punting this to 1.2.

@dianasaur323 or @nstewart This is a small to medium size stability task.

@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 23, 2017
@dianasaur323
Copy link
Contributor

kk added to airtable and marked you as an owner :). Currently it's a medium to be conservative.

@a6802739
Copy link
Contributor

@bdarnell @petermattis could I have a try for this? I'm very interested in this part. :)

@petermattis
Copy link
Collaborator

@a6802739 Go ahead, but probably worthwhile to sketch out what your plan is before implementing as you'll likely need to modify etcd/raft

@a6802739
Copy link
Contributor

a6802739 commented Sep 4, 2017

@petermattis, Thank you very much.

Here is my understanding, I'm not sure if it's accurate.

when we load unbounded number of raft entries from raft state machine, we should first add these raft entries to the cache, but the cache will evict them because of exceeding the maxBytes?

So What we want to do is chunk the ready.Entries in etcd/raft to make sure each chunk didn't exceed the maxBytes?

And if we modify etcd/raft, how could we submit the change?

@petermattis
Copy link
Collaborator

Here is my understanding, I'm not sure if it's accurate.

when we load unbounded number of raft entries from raft state machine, we should first add these raft entries to the cache, but the cache will evict them because of exceeding the maxBytes?

So What we want to do is chunk the ready.Entries in etcd/raft to make sure each chunk didn't exceed the maxBytes?

I'm not sure what you mean by chunking ready.Entries. We can't easily change the API for raft.Storage.Entries or we break compatibility.

There are 4 places where Raft reads an unbounded number of log entries:

  • raft.becomeLeader: the leader checks that none of the pending entries are a configuration change. Rather than reading the entries, we could extend raft.Storage to allow retrieving the pending configuration change count directly. We'd still have to read the entries, but could avoid inserting them into the entry cache and the ready could be done without consuming large amounts of memory.
  • raft.step upon receiving a MsgHup: similar to becomeLeader, Raft needs a count of the pending configuration changes, not the entries themselves.
  • raftLog.nextEnts: called from RawNode.Ready to get the entries to commit. In this case, we'd want to retrieve and commit entries in batches.
  • raftLog.allEntries: this is only called from tests.

And if we modify etcd/raft, how could we submit the change?

We have to submit a patch upstream to github.com/etcd/raft. Changing the Raft code would need to be done with care in order to maintain backward compatibility. Fair warning: this isn't a very easy to change to make.

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 5, 2017

This issue is less critical than it used to be, thanks to the introduction of quotaPool. Now we ensure that the uncommitted tail of the raft log is never larger than 1MiB (unless there is a single entry that large). So while it would be nice in principle to fix up the places where etcd/raft attempts to load all the entries into memory at once, it's not a very pressing concern for CockroachDB.

That said, I think I'd address the raft.becomeLeader and raft.step (after MsgHup) cases without modifying raft.Storage. We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit.

raftLog.nextEnts is the tricky one. It will require analysis of the raft code to determine whether it's safe to return partial results here. If it is, then it's easy and we just need to set a limit here. If not, we may need to make some changes to the way we generate the Ready struct. My specific concern here is what happens when HardState.Commit is higher than the last CommittedEntry.

Finally, this issue was originally about some inefficiencies in the raftEntryCache. My main concern about different entries in the same slice appears to be mistaken (per Peter's comment #13231 (comment) ), but it's still kind of wasteful the way we add things to the cache that we may have to immediately evict. I'm not sure if it's worth doing anything there since the quotaPool limits the problem.

@a6802739
Copy link
Contributor

a6802739 commented Sep 5, 2017

@petermattis , Thank you for your explanation.

raft.becomeLeader: the leader checks that none of the pending entries are a configuration change. Rather than reading the entries, we could extend raft.Storage to allow retrieving the pending configuration change count directly. We'd still have to read the entries, but could avoid inserting them into the entry cache and the ready could be done without consuming large amounts of memory.

raft.step upon receiving a MsgHup: similar to becomeLeader, Raft needs a count of the pending configuration changes, not the entries themselves.

From my understanding, it means we should add a interface NumberOfPendingConf for raft.Storage? And what we will still call replica.Entries to iterate all of the entries in this function, but we should avoid inserting them into the raftEntryCache?

raftLog.nextEnts: called from RawNode.Ready to get the entries to commit. In this case, we'd want to retrieve and commit entries in batches.

About this, should we just limit the real number of Entries we iterate from Rocksdb according to the left size of raftEntryCache?

@bdarnell Do you mean we have no need to concern about this, Or we just need to fix the case in raft.becomeLeader and raft.step?

And what do you mean for We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit, do you mean we should set a reasonable parameter for noLimit when we call raftLog.Slice(..., noLimit)?

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 5, 2017

@bdarnell Do you mean we have no need to concern about this, Or we just need to fix the case in raft.becomeLeader and raft.step?

I don't think we have an urgent need to do any of this, but if we do something we should fix becomeLeader, step, and nextEnts (we can ignore allEntries because it is test-only).

And what do you mean for We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit, do you mean we should set a reasonable parameter for noLimit when we call raftLog.Slice(..., noLimit)?

Instead of introducing a new Storage method like NumberOfPendingConf, we could change the code in becomeLeader from this (error handling removed)

	ents, err := r.raftLog.entries(r.raftLog.committed+1, noLimit)
	nconf := numOfPendingConf(ents)

to something like this

nconf := 0
for i := raftLog.committed+1; i < raftLog.LastIndex; {
    ents, err := r.raftLog.entries(i, 1024*1024)
    nconf += numOfPendingConf(ents)
    i = ents[len(ents)-1].Index
}

Adding a new Storage method would allow us to make some optimizations (like possibly bypassing the entry cache), but I don't think it's worth making a change to the interface.

@a6802739
Copy link
Contributor

a6802739 commented Sep 6, 2017

@bdarnell, Yeah, I understand what I have to do now.

So for nextEnts(), we could just use nextEnts(1024 * 1024) to change the noLimit to 1024 * 1024?

@a6802739 a6802739 self-assigned this Sep 6, 2017
@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 6, 2017

So for nextEnts(), we could just use nextEnts(1024 * 1024) to change the noLimit to 1024 * 1024?

Maybe. As I said, it's not clear whether this is safe. (and I'd probably add a field to raft.Config instead of redefining noLimit) We'll need a new test (in the etcd/raft repo) that commits a large number of entries at once and is affected by this limit to make sure that everything works correctly.

@a6802739
Copy link
Contributor

a6802739 commented Sep 7, 2017

@bdarnell, Thank you very much.

So we could just add a field like MaxEntriesPerLoad in the raft.Config, and if it was set, we could just use that parameter to replace noLimit for raftLog.entries() and raftLog. nextEnts()?

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 7, 2017

Yes, I think so.

@petermattis
Copy link
Collaborator

@bdarnell Is there anything left to do here?

@petermattis petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018
@bdarnell
Copy link
Contributor Author

No, I think we can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants