[experiment] storage: provoke inverted lsm #79083

tbg · 2022-03-30T20:04:52Z

Not quite working yet. Stay tuned.

cockroach-teamcity · 2022-03-30T20:05:04Z

This change is

- support `COCKROACH_DEBUG_PEBBLE_FILE_WRITE_DELAY=100ms` - pick up pebble version with its own hacks: - support `COCKROACH_DEBUG_PEBBLE_INGEST_L0=true` - support `COCKROACH_PEBBLE_COMPACTION_DELAY=10s` - script for experiment Release note: None

tbg · 2022-03-31T09:28:32Z

This was interesting. Over the course of the bank import, n3's L0 file count climbed to the ~20k region (nodes eventually filled up their disks, which paused the job before it was too late - that was nice to see). Admission control was definitely aware of this:

I220331 09:11:24.772576 220 util/admission/granter.go:1674 ⋮ [-] 1995 IO overload on store 3 (files 20928, sub-levels 3): admitted: 2, added: 0, removed (0, 18676044), admit: (9338022.000000, 9625942)

The SST ingestion itself never seems to have been delayed, which makes sense. Most files were probably AddSSTabled on n1 and n2 (which run stock master and didn't fall behind, ingesting everything straight to L6) and n3 never got a high read-amp (sublevels was 3 at the end) (the number of L0 files is not taken into account then).

Also, the cluster was completely healthy-looking. n3 didn't register as an outlier and I/O latencies at all. So 20k L0 files seems to be completely fine as long as we don't get high read-amp. (@cockroachdb/storage is aware of some issues when you have a lot more of them, something about slowing down flushes I think).

There was some UX around admission control that I didn't understand and for which I'm unsure whether it's intentional. Half-way through the import, I decided to add some kv0 load to the system. However, CREATE DATABASE kv would simply hang, for many minutes. Eventually invoking it again (in a new shell) would complete before the previous, hanging invocation would return. Canceling and retrying the hanging invocation then went through. I'm not sure what was wrong there.

Running kv0 was similarly a bit odd. Without rate limit, the workload would do 50-100 qps in the first second and then completely stall "forever" (I waited 800s before I gave up). However, with --max-rate 5 the workload would run at a constant 5qps. I was starting to look into this more but then the IMPORT paused itself (due to low disk) and the workload was able to push 2000+ qps at that point. Ranges probably split and possibly moved around during that time, perhaps transferring of n3 to a node whose admission control system would've been unconcerned about node health.

cc @sumeerbhola curious on your take about the expected behavior of admission control while the import was ongoing. We're pumping more and more SSTs into L0. What should happen for write requests arriving to ranges for which n3 holds the lease? Is blocking for 800s+ expected here?
I think I can reproduce everything above fairly easily, so if this is of interest I can probably loom the relevant parts or we can sit down together, when times are a little less hectic.

sumeerbhola · 2022-03-31T13:37:26Z

Indeed interesting -- thanks for running this!

Does this import have any secondary indexes? I ask since import with (large) secondary indexes is usually not great in ingesting into L6. Also, I am curious what the ingest behavior looked like for n3 which did fall behind. The pebble db logs compactions should give us a summary regarding n3 ingest behavior.
I am curious what the store admission control behavior on each of the nodes was. Did it ever delay based on L0 file count or sub-level count? The overload dashboard would give us the summary. I would definitely expect delays on n3 in admission control for commands that were proposed there, since the L0 file count is so high.
20K L0 files being fine may be due to 2 reasons: (a) presumably you were running on master, which has the incremental L0Sublevels building logic (also will be in 22.1) so flushes that need to create a new L0Sublevels will not be slow, (b) if all the load is AddSSTables that are being ingested, there may not be any flushes (the summary from the tool will tell us that).
That log statement from granter.go is interesting, and is demonstrating the deficiency mentioned in admission: byte tokens for store admission #79092. Because nothing was admitted, we are using a bytesAddedPerWork = 1, and because no bytes were added, and only bytes removed due to L0=>Lbase compaction, the number of work tokens is ~9M which is huge. This means if AddSSTable commands started getting proposed at this node, in the next 15s there would be enough tokens to admit 9M AddSSTable requests. If we kept these as 9M byte tokens and each AddSSTable consumed tokens equal to its byte size (I am over-simplifying here -- there will be an adjustment for how many bytes get ingested into L0) we would have sane behavior.

tbg · 2022-03-31T13:41:33Z

@sumeerbhola I'm currently running some other experiments, but I think it would be helpful if I set up this experiment again and then we poked at it synchronously? This is the bank table import which has no secondary indexes as far as I know.

Also, in case this wasn't completely clear - I'm forcing pebble to ingest into L0 (& preventing it from doing move compactions), to see what that state looks like. I don't have a way to "provoke" this state on stock CRDB.

- disable size-based per-replica queue size (100) - disable split delay helper. It will backpressure because the LHS on n3 is always way behind. We only have one import processor per node and if their splits block here, it ruins the experiment. - drain n3, since n3's raftMu is held for extended periods of time and this artificially throttles when proposals are acked, thus slowing the import down to a crawl (which is not a natural mechanism). - disable quota pool - give n3 a 2000 thread scheduler - disable "max addsst per store" semaphores TODO pick up David's patch https://github.com/cockroachdb/cockroach/compare/master...dt:import-procs?expand=1 Hopefully easy way to reproduce [cockroachdb#71805]. [cockroachdb#71805]: cockroachdb#71805 (comment) Release note: None

tbg force-pushed the repro-inverted-lsm branch 10 times, most recently from 9f39e83 to a6e57f9 Compare March 31, 2022 08:09

storage: allow provoking L0 growth

1957722

- support `COCKROACH_DEBUG_PEBBLE_FILE_WRITE_DELAY=100ms` - pick up pebble version with its own hacks: - support `COCKROACH_DEBUG_PEBBLE_INGEST_L0=true` - support `COCKROACH_PEBBLE_COMPACTION_DELAY=10s` - script for experiment Release note: None

tbg force-pushed the repro-inverted-lsm branch 3 times, most recently from 36eda87 to 9c1d89a Compare March 31, 2022 10:03

tbg mentioned this pull request Mar 31, 2022

kvserver: raft receive queue may OOM under overload #71805

Open

tbg force-pushed the repro-inverted-lsm branch 3 times, most recently from a9947f6 to 6c9914a Compare March 31, 2022 13:29

tbg force-pushed the repro-inverted-lsm branch 2 times, most recently from 9845641 to 12502e9 Compare March 31, 2022 19:36

tbg force-pushed the repro-inverted-lsm branch from 12502e9 to 1d22272 Compare March 31, 2022 20:35

nicktrav mentioned this pull request Apr 5, 2022

storage: reconsider the L0-sublevels / L0 files default limits #79159

Closed

tbg closed this Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiment] storage: provoke inverted lsm #79083

[experiment] storage: provoke inverted lsm #79083

tbg commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

tbg commented Mar 31, 2022

sumeerbhola commented Mar 31, 2022

tbg commented Mar 31, 2022

[experiment] storage: provoke inverted lsm #79083

[experiment] storage: provoke inverted lsm #79083

Conversation

tbg commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

tbg commented Mar 31, 2022

sumeerbhola commented Mar 31, 2022

tbg commented Mar 31, 2022