Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experiment] storage: provoke inverted lsm #79083

Closed
wants to merge 2 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 30, 2022

Not quite working yet. Stay tuned.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the repro-inverted-lsm branch 10 times, most recently from 9f39e83 to a6e57f9 Compare March 31, 2022 08:09
- support `COCKROACH_DEBUG_PEBBLE_FILE_WRITE_DELAY=100ms`
- pick up pebble version with its own hacks:
  - support `COCKROACH_DEBUG_PEBBLE_INGEST_L0=true`
  - support `COCKROACH_PEBBLE_COMPACTION_DELAY=10s`
- script for experiment

Release note: None
@tbg
Copy link
Member Author

tbg commented Mar 31, 2022

This was interesting. Over the course of the bank import, n3's L0 file count climbed to the ~20k region (nodes eventually filled up their disks, which paused the job before it was too late - that was nice to see). Admission control was definitely aware of this:

I220331 09:11:24.772576 220 util/admission/granter.go:1674 ⋮ [-] 1995 IO overload on store 3 (files 20928, sub-levels 3): admitted: 2, added: 0, removed (0, 18676044), admit: (9338022.000000, 9625942)

The SST ingestion itself never seems to have been delayed, which makes sense. Most files were probably AddSSTabled on n1 and n2 (which run stock master and didn't fall behind, ingesting everything straight to L6) and n3 never got a high read-amp (sublevels was 3 at the end) (the number of L0 files is not taken into account then).

Also, the cluster was completely healthy-looking. n3 didn't register as an outlier and I/O latencies at all. So 20k L0 files seems to be completely fine as long as we don't get high read-amp. (@cockroachdb/storage is aware of some issues when you have a lot more of them, something about slowing down flushes I think).

There was some UX around admission control that I didn't understand and for which I'm unsure whether it's intentional. Half-way through the import, I decided to add some kv0 load to the system. However, CREATE DATABASE kv would simply hang, for many minutes. Eventually invoking it again (in a new shell) would complete before the previous, hanging invocation would return. Canceling and retrying the hanging invocation then went through. I'm not sure what was wrong there.

Running kv0 was similarly a bit odd. Without rate limit, the workload would do 50-100 qps in the first second and then completely stall "forever" (I waited 800s before I gave up). However, with --max-rate 5 the workload would run at a constant 5qps. I was starting to look into this more but then the IMPORT paused itself (due to low disk) and the workload was able to push 2000+ qps at that point. Ranges probably split and possibly moved around during that time, perhaps transferring of n3 to a node whose admission control system would've been unconcerned about node health.

cc @sumeerbhola curious on your take about the expected behavior of admission control while the import was ongoing. We're pumping more and more SSTs into L0. What should happen for write requests arriving to ranges for which n3 holds the lease? Is blocking for 800s+ expected here?
I think I can reproduce everything above fairly easily, so if this is of interest I can probably loom the relevant parts or we can sit down together, when times are a little less hectic.

@tbg tbg force-pushed the repro-inverted-lsm branch 3 times, most recently from 36eda87 to 9c1d89a Compare March 31, 2022 10:03
@tbg tbg force-pushed the repro-inverted-lsm branch 3 times, most recently from a9947f6 to 6c9914a Compare March 31, 2022 13:29
@sumeerbhola
Copy link
Collaborator

Indeed interesting -- thanks for running this!

  • Does this import have any secondary indexes? I ask since import with (large) secondary indexes is usually not great in ingesting into L6. Also, I am curious what the ingest behavior looked like for n3 which did fall behind. The pebble db logs compactions should give us a summary regarding n3 ingest behavior.
  • I am curious what the store admission control behavior on each of the nodes was. Did it ever delay based on L0 file count or sub-level count? The overload dashboard would give us the summary. I would definitely expect delays on n3 in admission control for commands that were proposed there, since the L0 file count is so high.
  • 20K L0 files being fine may be due to 2 reasons: (a) presumably you were running on master, which has the incremental L0Sublevels building logic (also will be in 22.1) so flushes that need to create a new L0Sublevels will not be slow, (b) if all the load is AddSSTables that are being ingested, there may not be any flushes (the summary from the tool will tell us that).
  • That log statement from granter.go is interesting, and is demonstrating the deficiency mentioned in admission: byte tokens for store admission #79092. Because nothing was admitted, we are using a bytesAddedPerWork = 1, and because no bytes were added, and only bytes removed due to L0=>Lbase compaction, the number of work tokens is ~9M which is huge. This means if AddSSTable commands started getting proposed at this node, in the next 15s there would be enough tokens to admit 9M AddSSTable requests. If we kept these as 9M byte tokens and each AddSSTable consumed tokens equal to its byte size (I am over-simplifying here -- there will be an adjustment for how many bytes get ingested into L0) we would have sane behavior.

@tbg
Copy link
Member Author

tbg commented Mar 31, 2022

@sumeerbhola I'm currently running some other experiments, but I think it would be helpful if I set up this experiment again and then we poked at it synchronously? This is the bank table import which has no secondary indexes as far as I know.

Also, in case this wasn't completely clear - I'm forcing pebble to ingest into L0 (& preventing it from doing move compactions), to see what that state looks like. I don't have a way to "provoke" this state on stock CRDB.

@tbg tbg force-pushed the repro-inverted-lsm branch 2 times, most recently from 9845641 to 12502e9 Compare March 31, 2022 19:36
- disable size-based per-replica queue size (100)
- disable split delay helper. It will backpressure because the LHS on n3
  is always way behind. We only have one import processor per node and
  if their splits block here, it ruins the experiment.
- drain n3, since n3's raftMu is held for extended periods of time and
  this artificially throttles when proposals are acked, thus slowing the
  import down to a crawl (which is not a natural mechanism).
- disable quota pool
- give n3 a 2000 thread scheduler
- disable "max addsst per store" semaphores

TODO pick up David's patch https://github.com/cockroachdb/cockroach/compare/master...dt:import-procs?expand=1

Hopefully easy way to reproduce [cockroachdb#71805].

[cockroachdb#71805]: cockroachdb#71805 (comment)

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants