-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dnm] kv: expose (and use) byte batch response size limit #44341
Conversation
TODO: due to the RocksDB/pebble modes of operation, there are currently two ways to do anything. Thanks to the two possible response types for batches (KVs vs bytes) there is another factor of two. In short, I expect to have to make three more similar changes to fully be able to implement byte hints for scans. TODO: testing and potentially write the code in a saner way. ---- A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to bound the size of the response in terms of bytes rather than rows. This commit provides the necessary functionality to pebbleMVCCScanner: The new maxBytes field stops the scan once the size of the result meets or exceeds the threshold (at least one key will be added, regardless of its size). The classic example of the problem this addresses is a table in which each row is, say, ~1mb in size. A full table scan will currently fetch data from KV in batches of [10k], causing at least 10GB of data held in memory at any given moment. This sort of thing does happen in practice; we have a long-failing roachtest cockroachdb#33660 because of just that, and anecdotally OOMs in production clusters are with regularity caused by individual queries consuming excessive amounts of memory at the KV level. Plumbing this limit into a header field on BatchRequest and down to the engine level will allow the batching in [10k] to become byte-sized in nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721 in general (many such requests could work together to consume a lot of memory after all), but it's a sane baby step that might just avoid a large portion of OOMs already. [10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29 Release note: None
This is a (very rough) prototype for allowing consumers of the KV API to restrict the size of result sets returned from spans. It does so by simply plumbing through a byte hint from the kvfetcher to the ultimate MVCC scan. I mostly wanted to see that this all works, and it does. In the below, the `kv` table contains ~10k 1mb rows: This PR: ``` root@:26257/defaultdb> select sum(length(v)) from kv; sum --------------- 10538188800 (1 row) Time: 39.029189s ``` Before: ``` root@:26257/defaultdb ?> select sum(length(v)) from kv; sum --------------- 10538188800 (1 row) Time: 56.87552s ``` Finished up, this work should already de-flake cockroachdb#33660. Release note: None
I can't think of anything. Seems like pure goodness in terms of both speed and memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad this was as straightforward as we had though. The approach looks good and the results in your experiment are very promising. Have you gotten a chance to run this with hotspotsplits/nodes=4
?
Reviewed 7 of 8 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)
pkg/kv/dist_sender.go, line 1226 at r2 (raw file):
} mightStopEarly := ba.MaxSpanRequestKeys > 0 || ba.MaxSpanResponseBytes > 0
nit: can this be pulled up above canParallelize
and only computed once? I think it can also be set as the initial value of canParallelize
.
pkg/sql/row/kv_batch_fetcher.go, line 229 at r2 (raw file):
var ba roachpb.BatchRequest ba.Header.MaxSpanRequestKeys = f.getBatchSize() ba.Header.MaxSpanResponseBytes = 1 << 20 // 1MB
I'm assuming this isn't going to stay constant like this. If not, what were you envisioning as the policy to use for setting the byte limit on these requests?
- pull out a constant like we have with
kvBatchSize
? - pull out a cluster setting?
- dynamically grow the response size like we do in
getBatchSizeForIdx
? (I don't think this makes sense because there's no concept of soft row-size limits) - use the optimizer to make size estimates using histograms and pass that in here?
- something else?
pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):
var kvData [][]byte var numKvs int64 opts := engine.MVCCScanOptions{
Are we not planning on supporting byte limits for the roachpb.KEY_VALUES
response format? We probably don't want to get into the business of estimating the additional overhead of the KeyValue
protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.
pkg/storage/batcheval/cmd_scan.go, line 53 at r2 (raw file):
} kvData, numKvs, resumeSpan, intents, err = engine.MVCCScanToBytes( ctx, reader, args.Key, args.EndKey, cArgs.MaxKeys, h.Timestamp,
This seems to indicate that the max
arg belongs in the MVCCScanOptions
struct as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)
pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Are we not planning on supporting byte limits for the
roachpb.KEY_VALUES
response format? We probably don't want to get into the business of estimating the additional overhead of theKeyValue
protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.
I think we should support MaxSpanResponseBytes
with the KEY_VALUES
format. And agree that we shouldn't try to estimate the additional overhead of the KeyValue
protos. Simply passing through the MaxSpanResponseBytes
setting to MVCCScan
seems sufficient.
@tbg is this going to make it into this release? I'm working on paginating Given this is coming soon I think I'm going to just do (1) above and leave (2) to adopt this. In the short term I'm going to leave a TODO in CDC backfills to adopt the byte limit code rather than implementing something of my own. Relates to #44482 and the CDC portion of #39717 (comment). |
It is planned for this release yes (whether it makes it will depend on how happy we are with the results) |
@ajwerner this is going to be my must have for the coming milestone, so yes (and I'm planning to get it done early, too). Mind linking me to the pagination issue? |
Nevermind, found the issue: #43356 |
In cockroachdb#44440 we added a `targetSize` parameter to enable pagination of export requests. In that PR we defined the targetSize to return just before the key that would lead to the `targetSize` being exceeded. This definition is unfortunate when thinking about a total size limit for pagination in the DistSender (which we'll add when cockroachdb#44341 comes in). Imagine a case where we set a total byte limit of 1MB and a file byte limit of 1MB. That setting should lead to at most a single file being emitted (assuming one range holds enough data). If we used the previous definition we'd create a file which is just below 1MB and then the DistSender would need send another request which would contain a tiny amount of data. This brings the behavior in line with the semantics introduced in cockroachdb#44341 for ScanRequests and is just easier to reason about. Release note: None
This commit adopts the API change in cockroachdb#44440 and the previous commit. It adds a hidden cluster setting to control the target size. There's not a ton of testing but there's some. Further work includes: * Adding a mechanism to limit the number of files returned from an ExportRequest for use in CDC backfills. This is currently blocked on cockroachdb#44341. I'm omitting a release note because the setting is hidden. Release note: None.
This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.
This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.
44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of #44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to #39717. Release note: None. 44719: sql: add telemetry for uses of alter primary key r=otan a=rohany Fixes #44716. This PR adds a telemetry counter for uses of the alter primary key command. Release note (sql change): This PR adds collected telemetry from clusters upon using the alter primary key command. 44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis * db: additional tweak to the sstable boundary generation * db: add memTable.logSeqNum * db: force flushing of overlapping queued memtables during ingestion * tool: lsm visualization tool * db: consistently handle key decoding failure * cmd/pebble: fix lint for maxOpsPerSec's type inference * tool: add "find" command * cmd/pebble: fix --rate flag * internal/metamorphic: use the strict MemFS and add an operation to reset the DB * db: make DB.Close() wait for any ongoing deletion of obsolete files * sstable: encode varints directly into buf in blockWriter.store * sstable: micro-optimize Writer.addPoint() * sstable: minor cleanup of Writer/blockWriter * sstable: micro-optimize blockWriter.store Fixes #44631 Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Peter Mattis <petermattis@gmail.com>
Very excited for this! |
Not much to see here in terms of code, but this is a good venue to ask the question: any reason to not be doing this?
First commit is #44339
This is a (very rough) prototype for allowing consumers of the KV API to
restrict the size of result sets returned from spans. It does so
by simply plumbing through a byte hint from the kvfetcher to the
ultimate MVCC scan.
I mostly wanted to see that this all works, and it does. In the below,
the
kv
table contains ~10k 1mb rows:This PR:
Before:
12:58 is this PR, ~13:00 is master.
Finished up, this work should already de-flake #33660.
Release note: None