storage: Raft entry cache grows (inefficiently) to huge size #13231

bdarnell · 2017-01-30T05:51:00Z

Each store has a cache for raft entries, configured by default to be 16MB. On gamma, we see that on one node (at a time), we have over a gigabyte of raft entries in memory. I believe that not all of these entries are in the cache, but they are all being held in place by the cache, because Replica.Entries allocates one large array of raftpb.Entry objects, so that a reference to any one of them keeps the whole array alive.

Additionally, whenever we load a large array of entries, we try to add them all to the cache, inserting them all one by one and then evicting all but the last 16MB. This is actually the bigger concern in the gamma cluster at this time, since this process takes long enough (and blocks the server) so that it loses leases and never makes any progress.

Why do we load this monolithic block of entries? Whenever a new node becomes leader, it loads all uncommitted entries to see if there are any config changes. This is the one time in which we load raft entries without any chunking. It would be easy to add chunking here. In addition, we may want to change the behavior of the raft.Entries method to break up the arrays that it uses when adding entries to the cache (trading off the allocation overhead of smaller allocations vs the wasted memory of sibling array entries). Finally, this problem is also a result of our inability to throttle incoming raft entries. The log appears to keep growing beyond our ability to process it.

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-01-30T05:54:55Z

For the record, the range that this is affecting is an ordinary data range in the block_writer table, not a special system range. One range has over 2M raft log entries, compared to under 100 for other ranges. If the cluster had survived a little longer, this range could have split and the problem could be distributed over more nodes. The only theory I have for why this problem started appearing on gamma is that it's been running binaries with modified compaction settings, which might have slowed things down when the cluster reaches a certain size (without letting it survive long enough to split enough time).

petermattis · 2017-01-30T14:23:35Z

Reading all of the Raft log to count the number of pending entries is a bit silly. One short term fix would be to special case entries for maxBytes == math.MaxUint64. When that is true we could avoid caching entries and only return entries with Type == EntryConfChange. This is a bit of a hack, though. It would be nicer if there was an explicit NumberPendingConf method on the raft.RaftStorage interface.

bdarnell · 2017-01-30T14:58:07Z

I was wrong about this being the only place we load an unbounded number of entries at once - we also do it with committed entries that need to be applied. Changing that will be a bit trickier, since the code currently assumes that after a Ready cycle the new applied index can be set to the previous commit index. I don't think there's any deep obstacle to changing this, though.

petermattis · 2017-01-30T15:20:19Z

I don't think there's any deep obstacle to changing this, though.

This is good to change in any case as I've noticed in the past that processing a large number of Raft entries in a single ready call can take a significant period of time causing missed heartbeats and other fun.

petermattis · 2017-04-19T14:00:52Z

Unless I'm missing something, the raftEntryCache.addEntries adds entries individually. While a slice is passed to addEntries, we don't hold a reference to that slice, but only to the contents of the individual raftpb.Entry structs.

I think we should fix both of the places in etcd/raft where an unbounded number of log entries are read, but after re-familiarizing myself with the raft code neither change seems trivial. Mucking up logic at the Raft level is a recipe for last-minute instability. I'm punting any change to early in the 1.1 cycle.

petermattis · 2017-08-23T19:31:06Z

It is too late in the cycle to consider fixing etcd/raft. Punting this to 1.2.

@dianasaur323 or @nstewart This is a small to medium size stability task.

dianasaur323 · 2017-08-24T00:32:48Z

kk added to airtable and marked you as an owner :). Currently it's a medium to be conservative.

a6802739 · 2017-08-31T08:24:44Z

@bdarnell @petermattis could I have a try for this? I'm very interested in this part. :)

petermattis · 2017-08-31T10:54:58Z

@a6802739 Go ahead, but probably worthwhile to sketch out what your plan is before implementing as you'll likely need to modify etcd/raft

a6802739 · 2017-09-04T11:42:20Z

@petermattis, Thank you very much.

Here is my understanding, I'm not sure if it's accurate.

when we load unbounded number of raft entries from raft state machine, we should first add these raft entries to the cache, but the cache will evict them because of exceeding the maxBytes?

So What we want to do is chunk the ready.Entries in etcd/raft to make sure each chunk didn't exceed the maxBytes?

And if we modify etcd/raft, how could we submit the change?

petermattis · 2017-09-04T14:47:51Z

Here is my understanding, I'm not sure if it's accurate.

when we load unbounded number of raft entries from raft state machine, we should first add these raft entries to the cache, but the cache will evict them because of exceeding the maxBytes?

So What we want to do is chunk the ready.Entries in etcd/raft to make sure each chunk didn't exceed the maxBytes?

I'm not sure what you mean by chunking ready.Entries. We can't easily change the API for raft.Storage.Entries or we break compatibility.

There are 4 places where Raft reads an unbounded number of log entries:

raft.becomeLeader: the leader checks that none of the pending entries are a configuration change. Rather than reading the entries, we could extend raft.Storage to allow retrieving the pending configuration change count directly. We'd still have to read the entries, but could avoid inserting them into the entry cache and the ready could be done without consuming large amounts of memory.
raft.step upon receiving a MsgHup: similar to becomeLeader, Raft needs a count of the pending configuration changes, not the entries themselves.
raftLog.nextEnts: called from RawNode.Ready to get the entries to commit. In this case, we'd want to retrieve and commit entries in batches.
raftLog.allEntries: this is only called from tests.

And if we modify etcd/raft, how could we submit the change?

We have to submit a patch upstream to github.com/etcd/raft. Changing the Raft code would need to be done with care in order to maintain backward compatibility. Fair warning: this isn't a very easy to change to make.

bdarnell · 2017-09-05T02:23:07Z

This issue is less critical than it used to be, thanks to the introduction of quotaPool. Now we ensure that the uncommitted tail of the raft log is never larger than 1MiB (unless there is a single entry that large). So while it would be nice in principle to fix up the places where etcd/raft attempts to load all the entries into memory at once, it's not a very pressing concern for CockroachDB.

That said, I think I'd address the raft.becomeLeader and raft.step (after MsgHup) cases without modifying raft.Storage. We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit.

raftLog.nextEnts is the tricky one. It will require analysis of the raft code to determine whether it's safe to return partial results here. If it is, then it's easy and we just need to set a limit here. If not, we may need to make some changes to the way we generate the Ready struct. My specific concern here is what happens when HardState.Commit is higher than the last CommittedEntry.

Finally, this issue was originally about some inefficiencies in the raftEntryCache. My main concern about different entries in the same slice appears to be mistaken (per Peter's comment #13231 (comment) ), but it's still kind of wasteful the way we add things to the cache that we may have to immediately evict. I'm not sure if it's worth doing anything there since the quotaPool limits the problem.

a6802739 · 2017-09-05T06:57:13Z

@petermattis , Thank you for your explanation.

raft.becomeLeader: the leader checks that none of the pending entries are a configuration change. Rather than reading the entries, we could extend raft.Storage to allow retrieving the pending configuration change count directly. We'd still have to read the entries, but could avoid inserting them into the entry cache and the ready could be done without consuming large amounts of memory.

raft.step upon receiving a MsgHup: similar to becomeLeader, Raft needs a count of the pending configuration changes, not the entries themselves.

From my understanding, it means we should add a interface NumberOfPendingConf for raft.Storage? And what we will still call replica.Entries to iterate all of the entries in this function, but we should avoid inserting them into the raftEntryCache?

raftLog.nextEnts: called from RawNode.Ready to get the entries to commit. In this case, we'd want to retrieve and commit entries in batches.

About this, should we just limit the real number of Entries we iterate from Rocksdb according to the left size of raftEntryCache?

@bdarnell Do you mean we have no need to concern about this, Or we just need to fix the case in raft.becomeLeader and raft.step?

And what do you mean for We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit, do you mean we should set a reasonable parameter for noLimit when we call raftLog.Slice(..., noLimit)?

bdarnell · 2017-09-05T15:48:24Z

@bdarnell Do you mean we have no need to concern about this, Or we just need to fix the case in raft.becomeLeader and raft.step?

I don't think we have an urgent need to do any of this, but if we do something we should fix becomeLeader, step, and nextEnts (we can ignore allEntries because it is test-only).

And what do you mean for We could just replace the calls to raftLog.Slice(..., noLimit) with a loop that makes a series of calls with a reasonable limit, do you mean we should set a reasonable parameter for noLimit when we call raftLog.Slice(..., noLimit)?

Instead of introducing a new Storage method like NumberOfPendingConf, we could change the code in becomeLeader from this (error handling removed)

	ents, err := r.raftLog.entries(r.raftLog.committed+1, noLimit)
	nconf := numOfPendingConf(ents)

to something like this

nconf := 0
for i := raftLog.committed+1; i < raftLog.LastIndex; {
    ents, err := r.raftLog.entries(i, 1024*1024)
    nconf += numOfPendingConf(ents)
    i = ents[len(ents)-1].Index
}

Adding a new Storage method would allow us to make some optimizations (like possibly bypassing the entry cache), but I don't think it's worth making a change to the interface.

a6802739 · 2017-09-06T04:09:48Z

@bdarnell, Yeah, I understand what I have to do now.

So for nextEnts(), we could just use nextEnts(1024 * 1024) to change the noLimit to 1024 * 1024?

bdarnell · 2017-09-06T15:34:39Z

So for nextEnts(), we could just use nextEnts(1024 * 1024) to change the noLimit to 1024 * 1024?

Maybe. As I said, it's not clear whether this is safe. (and I'd probably add a field to raft.Config instead of redefining noLimit) We'll need a new test (in the etcd/raft repo) that commits a large number of entries at once and is affected by this limit to make sure that everything works correctly.

a6802739 · 2017-09-07T06:31:58Z

@bdarnell, Thank you very much.

So we could just add a field like MaxEntriesPerLoad in the raft.Config, and if it was set, we could just use that parameter to replace noLimit for raftLog.entries() and raftLog. nextEnts()?

bdarnell · 2017-09-07T15:27:06Z

Yes, I think so.

petermattis · 2018-01-25T17:33:43Z

@bdarnell Is there anything left to do here?

bdarnell · 2018-01-25T18:28:14Z

No, I think we can close it.

bdarnell self-assigned this Jan 30, 2017

bdarnell mentioned this issue Jan 31, 2017

stability: panic on block checksum mismatch #12875

Closed

petermattis added this to the 1.0 milestone Feb 23, 2017

bdarnell mentioned this issue Apr 13, 2017

restore: OOM with TPC-H scale factor 10 data #14776

Closed

bdarnell assigned petermattis and unassigned bdarnell Apr 18, 2017

petermattis modified the milestones: 1.1, 1.0 Apr 19, 2017

petermattis modified the milestones: 1.2, 1.1 Aug 23, 2017

a6802739 self-assigned this Sep 6, 2017

This was referenced Sep 18, 2017

reading specific rows never returns #18199

Closed

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Closed

petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018

bdarnell closed this as completed Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Raft entry cache grows (inefficiently) to huge size #13231

storage: Raft entry cache grows (inefficiently) to huge size #13231

bdarnell commented Jan 30, 2017

bdarnell commented Jan 30, 2017

petermattis commented Jan 30, 2017

bdarnell commented Jan 30, 2017

petermattis commented Jan 30, 2017

petermattis commented Apr 19, 2017

petermattis commented Aug 23, 2017

dianasaur323 commented Aug 24, 2017

a6802739 commented Aug 31, 2017

petermattis commented Aug 31, 2017

a6802739 commented Sep 4, 2017 •

edited

Loading

petermattis commented Sep 4, 2017

bdarnell commented Sep 5, 2017

a6802739 commented Sep 5, 2017

bdarnell commented Sep 5, 2017

a6802739 commented Sep 6, 2017

bdarnell commented Sep 6, 2017

a6802739 commented Sep 7, 2017

bdarnell commented Sep 7, 2017

petermattis commented Jan 25, 2018

bdarnell commented Jan 25, 2018

storage: Raft entry cache grows (inefficiently) to huge size #13231

storage: Raft entry cache grows (inefficiently) to huge size #13231

Comments

bdarnell commented Jan 30, 2017

bdarnell commented Jan 30, 2017

petermattis commented Jan 30, 2017

bdarnell commented Jan 30, 2017

petermattis commented Jan 30, 2017

petermattis commented Apr 19, 2017

petermattis commented Aug 23, 2017

dianasaur323 commented Aug 24, 2017

a6802739 commented Aug 31, 2017

petermattis commented Aug 31, 2017

a6802739 commented Sep 4, 2017 • edited Loading

petermattis commented Sep 4, 2017

bdarnell commented Sep 5, 2017

a6802739 commented Sep 5, 2017

bdarnell commented Sep 5, 2017

a6802739 commented Sep 6, 2017

bdarnell commented Sep 6, 2017

a6802739 commented Sep 7, 2017

bdarnell commented Sep 7, 2017

petermattis commented Jan 25, 2018

bdarnell commented Jan 25, 2018

a6802739 commented Sep 4, 2017 •

edited

Loading