[dnm] kv: expose (and use) byte batch response size limit #44341

tbg · 2020-01-24T13:14:33Z

Not much to see here in terms of code, but this is a good venue to ask the question: any reason to not be doing this?

First commit is #44339

This is a (very rough) prototype for allowing consumers of the KV API to
restrict the size of result sets returned from spans. It does so
by simply plumbing through a byte hint from the kvfetcher to the
ultimate MVCC scan.

I mostly wanted to see that this all works, and it does. In the below,
the kv table contains ~10k 1mb rows:

This PR:

root@:26257/defaultdb> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 39.029189s

Before:

root@:26257/defaultdb ?> select sum(length(v)) from kv;
      sum
---------------
  10538188800
(1 row)

Time: 56.87552s

12:58 is this PR, ~13:00 is master.

Finished up, this work should already de-flake #33660.

Release note: None

TODO: due to the RocksDB/pebble modes of operation, there are currently two ways to do anything. Thanks to the two possible response types for batches (KVs vs bytes) there is another factor of two. In short, I expect to have to make three more similar changes to fully be able to implement byte hints for scans. TODO: testing and potentially write the code in a saner way. ---- A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to bound the size of the response in terms of bytes rather than rows. This commit provides the necessary functionality to pebbleMVCCScanner: The new maxBytes field stops the scan once the size of the result meets or exceeds the threshold (at least one key will be added, regardless of its size). The classic example of the problem this addresses is a table in which each row is, say, ~1mb in size. A full table scan will currently fetch data from KV in batches of [10k], causing at least 10GB of data held in memory at any given moment. This sort of thing does happen in practice; we have a long-failing roachtest cockroachdb#33660 because of just that, and anecdotally OOMs in production clusters are with regularity caused by individual queries consuming excessive amounts of memory at the KV level. Plumbing this limit into a header field on BatchRequest and down to the engine level will allow the batching in [10k] to become byte-sized in nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721 in general (many such requests could work together to consume a lot of memory after all), but it's a sane baby step that might just avoid a large portion of OOMs already. [10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29 Release note: None

This is a (very rough) prototype for allowing consumers of the KV API to restrict the size of result sets returned from spans. It does so by simply plumbing through a byte hint from the kvfetcher to the ultimate MVCC scan. I mostly wanted to see that this all works, and it does. In the below, the `kv` table contains ~10k 1mb rows: This PR: ``` root@:26257/defaultdb> select sum(length(v)) from kv; sum --------------- 10538188800 (1 row) Time: 39.029189s ``` Before: ``` root@:26257/defaultdb ?> select sum(length(v)) from kv; sum --------------- 10538188800 (1 row) Time: 56.87552s ``` Finished up, this work should already de-flake cockroachdb#33660. Release note: None

cockroach-teamcity · 2020-01-24T13:14:42Z

This change is

petermattis · 2020-01-24T13:46:21Z

Not much to see here in terms of code, but this is a good venue to ask the question: any reason to not be doing this?

I can't think of anything. Seems like pure goodness in terms of both speed and memory usage.

nvanbenschoten

I'm glad this was as straightforward as we had though. The approach looks good and the results in your experiment are very promising. Have you gotten a chance to run this with hotspotsplits/nodes=4?

Reviewed 7 of 8 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)

pkg/kv/dist_sender.go, line 1226 at r2 (raw file):

			}

			mightStopEarly := ba.MaxSpanRequestKeys > 0 || ba.MaxSpanResponseBytes > 0

nit: can this be pulled up above canParallelize and only computed once? I think it can also be set as the initial value of canParallelize.

pkg/sql/row/kv_batch_fetcher.go, line 229 at r2 (raw file):

	var ba roachpb.BatchRequest
	ba.Header.MaxSpanRequestKeys = f.getBatchSize()
	ba.Header.MaxSpanResponseBytes = 1 << 20 // 1MB

I'm assuming this isn't going to stay constant like this. If not, what were you envisioning as the policy to use for setting the byte limit on these requests?

pull out a constant like we have with kvBatchSize?
pull out a cluster setting?
dynamically grow the response size like we do in getBatchSizeForIdx? (I don't think this makes sense because there's no concept of soft row-size limits)
use the optimizer to make size estimates using histograms and pass that in here?
something else?

pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):

		var kvData [][]byte
		var numKvs int64
		opts := engine.MVCCScanOptions{

Are we not planning on supporting byte limits for the roachpb.KEY_VALUES response format? We probably don't want to get into the business of estimating the additional overhead of the KeyValue protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.

pkg/storage/batcheval/cmd_scan.go, line 53 at r2 (raw file):

		}
		kvData, numKvs, resumeSpan, intents, err = engine.MVCCScanToBytes(
			ctx, reader, args.Key, args.EndKey, cArgs.MaxKeys, h.Timestamp,

This seems to indicate that the max arg belongs in the MVCCScanOptions struct as well.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, @nvanbenschoten, and @tbg)

pkg/storage/batcheval/cmd_scan.go, line 47 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are we not planning on supporting byte limits for the roachpb.KEY_VALUES response format? We probably don't want to get into the business of estimating the additional overhead of the KeyValue protos when we inflate the batch representation, but even just using the same mechanism to provide a limit would be useful.

I think we should support MaxSpanResponseBytes with the KEY_VALUES format. And agree that we shouldn't try to estimate the additional overhead of the KeyValue protos. Simply passing through the MaxSpanResponseBytes setting to MVCCScan seems sufficient.

ajwerner · 2020-01-29T19:46:24Z

@tbg is this going to make it into this release? I'm working on paginating ExportRequests right now. I think that'd I would like to utilize this logic or something like it. When it comes to ExportRequest there's two different parameters (1) How big should an individual file be (2) How much total data should be exported.

Given this is coming soon I think I'm going to just do (1) above and leave (2) to adopt this. In the short term I'm going to leave a TODO in CDC backfills to adopt the byte limit code rather than implementing something of my own. Relates to #44482 and the CDC portion of #39717 (comment).

knz · 2020-01-29T19:58:50Z

It is planned for this release yes (whether it makes it will depend on how happy we are with the results)

tbg · 2020-01-30T09:20:27Z

@ajwerner this is going to be my must have for the coming milestone, so yes (and I'm planning to get it done early, too). Mind linking me to the pagination issue?

tbg · 2020-01-30T09:50:20Z

Nevermind, found the issue: #43356

In cockroachdb#44440 we added a `targetSize` parameter to enable pagination of export requests. In that PR we defined the targetSize to return just before the key that would lead to the `targetSize` being exceeded. This definition is unfortunate when thinking about a total size limit for pagination in the DistSender (which we'll add when cockroachdb#44341 comes in). Imagine a case where we set a total byte limit of 1MB and a file byte limit of 1MB. That setting should lead to at most a single file being emitted (assuming one range holds enough data). If we used the previous definition we'd create a file which is just below 1MB and then the DistSender would need send another request which would contain a tiny amount of data. This brings the behavior in line with the semantics introduced in cockroachdb#44341 for ScanRequests and is just easier to reason about. Release note: None

This commit adopts the API change in cockroachdb#44440 and the previous commit. It adds a hidden cluster setting to control the target size. There's not a ton of testing but there's some. Further work includes: * Adding a mechanism to limit the number of files returned from an ExportRequest for use in CDC backfills. This is currently blocked on cockroachdb#44341. I'm omitting a release note because the setting is hidden. Release note: None.

@danhhz

This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.

@danhhz

This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.

@danhhz

44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of #44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to #39717. Release note: None. 44719: sql: add telemetry for uses of alter primary key r=otan a=rohany Fixes #44716. This PR adds a telemetry counter for uses of the alter primary key command. Release note (sql change): This PR adds collected telemetry from clusters upon using the alter primary key command. 44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis * db: additional tweak to the sstable boundary generation * db: add memTable.logSeqNum * db: force flushing of overlapping queued memtables during ingestion * tool: lsm visualization tool * db: consistently handle key decoding failure * cmd/pebble: fix lint for maxOpsPerSec's type inference * tool: add "find" command * cmd/pebble: fix --rate flag * internal/metamorphic: use the strict MemFS and add an operation to reset the DB * db: make DB.Close() wait for any ongoing deletion of obsolete files * sstable: encode varints directly into buf in blockWriter.store * sstable: micro-optimize Writer.addPoint() * sstable: minor cleanup of Writer/blockWriter * sstable: micro-optimize blockWriter.store Fixes #44631 Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Peter Mattis <petermattis@gmail.com>

jordanlewis · 2020-02-06T23:37:44Z

Very excited for this!

tbg · 2020-02-24T14:11:51Z

#45319

tbg added 2 commits January 24, 2020 13:26

tbg requested review from itsbilal, knz and nvanbenschoten January 24, 2020 13:14

nvanbenschoten reviewed Jan 28, 2020

View reviewed changes

petermattis reviewed Jan 28, 2020

View reviewed changes

ajwerner mentioned this pull request Jan 29, 2020

storageccl: add setting to control export file size and paginate export #44482

Merged

knz mentioned this pull request Feb 3, 2020

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

Closed

ajwerner mentioned this pull request Feb 3, 2020

changefeedccl: use ScanRequest instead of ExportRequest during backfills #44663

Merged

tbg closed this Feb 24, 2020

tbg deleted the kvfetcher-bytes-limit branch February 24, 2020 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm] kv: expose (and use) byte batch response size limit #44341

[dnm] kv: expose (and use) byte batch response size limit #44341

tbg commented Jan 24, 2020 •

edited

Loading

cockroach-teamcity commented Jan 24, 2020

petermattis commented Jan 24, 2020

nvanbenschoten left a comment

petermattis left a comment

ajwerner commented Jan 29, 2020

knz commented Jan 29, 2020

tbg commented Jan 30, 2020

tbg commented Jan 30, 2020

jordanlewis commented Feb 6, 2020

tbg commented Feb 24, 2020

[dnm] kv: expose (and use) byte batch response size limit #44341

[dnm] kv: expose (and use) byte batch response size limit #44341

Conversation

tbg commented Jan 24, 2020 • edited Loading

cockroach-teamcity commented Jan 24, 2020

petermattis commented Jan 24, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajwerner commented Jan 29, 2020

knz commented Jan 29, 2020

tbg commented Jan 30, 2020

tbg commented Jan 30, 2020

jordanlewis commented Feb 6, 2020

tbg commented Feb 24, 2020

tbg commented Jan 24, 2020 •

edited

Loading