-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experiment] storage: provoke inverted lsm #79083
Conversation
9f39e83
to
a6e57f9
Compare
- support `COCKROACH_DEBUG_PEBBLE_FILE_WRITE_DELAY=100ms` - pick up pebble version with its own hacks: - support `COCKROACH_DEBUG_PEBBLE_INGEST_L0=true` - support `COCKROACH_PEBBLE_COMPACTION_DELAY=10s` - script for experiment Release note: None
This was interesting. Over the course of the bank import, n3's L0 file count climbed to the ~20k region (nodes eventually filled up their disks, which paused the job before it was too late - that was nice to see). Admission control was definitely aware of this:
The SST ingestion itself never seems to have been delayed, which makes sense. Most files were probably AddSSTabled on n1 and n2 (which run stock master and didn't fall behind, ingesting everything straight to L6) and n3 never got a high read-amp (sublevels was 3 at the end) (the number of L0 files is not taken into account then). Also, the cluster was completely healthy-looking. n3 didn't register as an outlier and I/O latencies at all. So 20k L0 files seems to be completely fine as long as we don't get high read-amp. (@cockroachdb/storage is aware of some issues when you have a lot more of them, something about slowing down flushes I think). There was some UX around admission control that I didn't understand and for which I'm unsure whether it's intentional. Half-way through the import, I decided to add some kv0 load to the system. However, Running kv0 was similarly a bit odd. Without rate limit, the workload would do 50-100 qps in the first second and then completely stall "forever" (I waited 800s before I gave up). However, with cc @sumeerbhola curious on your take about the expected behavior of admission control while the import was ongoing. We're pumping more and more SSTs into L0. What should happen for write requests arriving to ranges for which n3 holds the lease? Is blocking for 800s+ expected here? |
36eda87
to
9c1d89a
Compare
a9947f6
to
6c9914a
Compare
Indeed interesting -- thanks for running this!
|
@sumeerbhola I'm currently running some other experiments, but I think it would be helpful if I set up this experiment again and then we poked at it synchronously? This is the Also, in case this wasn't completely clear - I'm forcing pebble to ingest into L0 (& preventing it from doing move compactions), to see what that state looks like. I don't have a way to "provoke" this state on stock CRDB. |
9845641
to
12502e9
Compare
- disable size-based per-replica queue size (100) - disable split delay helper. It will backpressure because the LHS on n3 is always way behind. We only have one import processor per node and if their splits block here, it ruins the experiment. - drain n3, since n3's raftMu is held for extended periods of time and this artificially throttles when proposals are acked, thus slowing the import down to a crawl (which is not a natural mechanism). - disable quota pool - give n3 a 2000 thread scheduler - disable "max addsst per store" semaphores TODO pick up David's patch https://github.com/cockroachdb/cockroach/compare/master...dt:import-procs?expand=1 Hopefully easy way to reproduce [cockroachdb#71805]. [cockroachdb#71805]: cockroachdb#71805 (comment) Release note: None
Not quite working yet. Stay tuned.