Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission,kvserver: improved byte token estimation for writes #85059

Merged
merged 1 commit into from
Aug 9, 2022

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Jul 26, 2022

The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR (#83937). This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
estimation now uses a linear model y=a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo,
and y is the bytes added to L0 via write or ingestion.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
expect the a term being in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:

write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB

Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:

write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB

Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.

write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB

Fixes #79092
Informs #82536

Release note: None

@sumeerbhola sumeerbhola requested review from tbg, irfansharif and a team July 26, 2022 12:48
@sumeerbhola sumeerbhola requested review from a team as code owners July 26, 2022 12:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola
Copy link
Collaborator Author

Here is a kv0 workload with 1KB writes, and then an index creation was started for the v column. The first log statement is after the index creation has been running for 1+ min and all the ingested sstables have been going to levels below L0 (note ingested-model 0.00x+1 B (smoothed 0.03x+1 B)). Subsequent ones start seeing some ingested sstables going into L0: note the fluctuating fraction of bytes ingested into L0, which is why there is significant fluctation in the smoothed ingested-model.

I220726 15:18:43.046660 379 util/admission/granter.go:2078 ⋮ [s1] 749  IO overload: compaction score 0.850 (499 ssts, 17 sub-levels), L0 growth 518 MiB (write 518 MiB ingest 0 B): requests 168843 with 441 MiB acc-write + 147 MiB acc-ingest + write-model 1.17x+1 B (smoothed 1.18x+1 B) + ingested-model 0.00x+1 B (smoothed 0.03x+1 B) + at-admission-tokens 2.9 KiB, compacted 366 MiB [≈405 MiB], flushed 863 MiB [≈912 MiB]; admitting 1.3 GiB (rate 88 MiB/s) due to memtable flush (multiplier 1.450) (used 540 MiB)

I220726 15:18:58.079426 379 util/admission/granter.go:2078 ⋮ [s1] 1004  IO overload: compaction score 0.900 (701 ssts, 18 sub-levels), L0 growth 644 MiB (write 612 MiB ingest 33 MiB): requests 149733 with 400 MiB acc-write + 131 MiB acc-ingest + write-model 1.53x+1 B (smoothed 1.35x+1 B) + ingested-model 0.25x+1 B (smoothed 0.14x+1 B) + at-admission-tokens 3.7 KiB, compacted 335 MiB [≈370 MiB], flushed 856 MiB [≈884 MiB]; admitting 1.2 GiB (rate 81 MiB/s) due to memtable flush (multiplier 1.375) (used 481 MiB)
I220726 15:19:13.060785 379 util/admission/granter.go:2078 ⋮ [s1] 1256  IO overload: compaction score 0.800 (683 ssts, 16 sub-levels), L0 growth 594 MiB (write 561 MiB ingest 33 MiB): requests 156666 with 419 MiB acc-write + 154 MiB acc-ingest + write-model 1.34x+1 B (smoothed 1.35x+1 B) + ingested-model 0.21x+1 B (smoothed 0.18x+1 B) + at-admission-tokens 3.8 KiB, compacted 613 MiB [≈492 MiB], flushed 941 MiB [≈912 MiB]; admitting 1.2 GiB (rate 79 MiB/s) due to memtable flush (multiplier 1.300) (used 599 MiB)
I220726 15:19:28.046080 379 util/admission/granter.go:2078 ⋮ [s1] 1482  IO overload: compaction score 0.850 (727 ssts, 17 sub-levels), L0 growth 619 MiB (write 603 MiB ingest 16 MiB): requests 150585 with 403 MiB acc-write + 130 MiB acc-ingest + write-model 1.50x+1 B (smoothed 1.42x+1 B) + ingested-model 0.12x+1 B (smoothed 0.15x+1 B) + at-admission-tokens 4.0 KiB, compacted 579 MiB [≈535 MiB], flushed 879 MiB [≈896 MiB]; admitting 1.1 GiB (rate 73 MiB/s) due to memtable flush (multiplier 1.225) (used 567 MiB)
I220726 15:19:46.296262 379 util/admission/granter.go:2078 ⋮ [s1] 1739  IO overload: compaction score 0.850 (704 ssts, 17 sub-levels), L0 growth 624 MiB (write 624 MiB ingest 0 B): requests 148967 with 399 MiB acc-write + 147 MiB acc-ingest + write-model 1.57x+1 B (smoothed 1.49x+1 B) + ingested-model 0.00x+1 B (smoothed 0.08x+1 B) + at-admission-tokens 4.1 KiB, compacted 620 MiB [≈578 MiB], flushed 735 MiB [≈815 MiB]; admitting 938 MiB (rate 62 MiB/s) due to memtable flush (multiplier 1.150) (used 607 MiB)
I220726 15:20:01.296500 379 util/admission/granter.go:2078 ⋮ [s1] 2008  IO overload: compaction score 0.950 (738 ssts, 19 sub-levels), L0 growth 594 MiB (write 545 MiB ingest 49 MiB): requests 153130 with 409 MiB acc-write + 122 MiB acc-ingest + write-model 1.33x+1 B (smoothed 1.41x+1 B) + ingested-model 0.40x+1 B (smoothed 0.24x+1 B) + at-admission-tokens 4.1 KiB, compacted 522 MiB [≈550 MiB], flushed 841 MiB [≈828 MiB]; admitting 890 MiB (rate 59 MiB/s) due to memtable flush (multiplier 1.075) (used 623 MiB)
I220726 15:20:16.296577 379 util/admission/granter.go:2078 ⋮ [s1] 2293  IO overload: compaction score 1.050[L0-overload] (734 ssts, 21 sub-levels), L0 growth 631 MiB (write 614 MiB ingest 16 MiB): requests 152568 with 408 MiB acc-write + 130 MiB acc-ingest + write-model 1.51x+1 B (smoothed 1.46x+1 B) + ingested-model 0.12x+1 B (smoothed 0.18x+1 B) + at-admission-tokens 4.1 KiB, compacted 720 MiB [≈635 MiB], flushed 877 MiB [≈852 MiB]; admitting 434 MiB (rate 29 MiB/s) due to L0 growth (used 623 MiB)
I220726 15:20:31.296849 379 util/admission/granter.go:2078 ⋮ [s1] 2349  IO overload: compaction score 1.200[L0-overload] (427 ssts, 24 sub-levels), L0 growth 311 MiB (write 311 MiB ingest 0 B): requests 117114 with 312 MiB acc-write + 33 MiB acc-ingest + write-model 1.00x+1 B (smoothed 1.23x+1 B) + ingested-model 0.00x+1 B (smoothed 0.09x+1 B) + at-admission-tokens 3.4 KiB, compacted 909 MiB [≈772 MiB], flushed 1.3 GiB [≈1.1 GiB]; admitting 410 MiB (rate 27 MiB/s) due to L0 growth (used 565 MiB)
I220726 15:21:01.296077 379 util/admission/granter.go:2078 ⋮ [s1] 2541  IO overload: compaction score 0.700 (535 ssts, 14 sub-levels), L0 growth 527 MiB (write 511 MiB ingest 16 MiB): requests 159753 with 425 MiB acc-write + 146 MiB acc-ingest + write-model 1.20x+1 B (smoothed 1.03x+1 B) + ingested-model 0.11x+1 B (smoothed 0.10x+1 B) + at-admission-tokens 2.9 KiB, compacted 239 MiB [≈361 MiB], flushed 822 MiB [≈1.0 GiB]; admitting 899 MiB (rate 60 MiB/s) due to memtable flush (multiplier 0.875) (used 400 MiB)
I220726 15:21:16.296006 379 util/admission/granter.go:2078 ⋮ [s1] 2800  IO overload: compaction score 0.650 (527 ssts, 13 sub-levels), L0 growth 641 MiB (write 576 MiB ingest 65 MiB): requests 144533 with 388 MiB acc-write + 146 MiB acc-ingest + write-model 1.49x+1 B (smoothed 1.26x+1 B) + ingested-model 0.44x+1 B (smoothed 0.27x+1 B) + at-admission-tokens 3.7 KiB, compacted 442 MiB [≈402 MiB], flushed 882 MiB [≈954 MiB]; admitting 764 MiB (rate 51 MiB/s) due to memtable flush (multiplier 0.800) (used 411 MiB)
I220726 15:21:31.307433 379 util/admission/granter.go:2078 ⋮ [s1] 3029  IO overload: compaction score 0.600 (518 ssts, 12 sub-levels), L0 growth 564 MiB (write 548 MiB ingest 16 MiB): requests 152196 with 406 MiB acc-write + 122 MiB acc-ingest + write-model 1.35x+1 B (smoothed 1.30x+1 B) + ingested-model 0.13x+1 B (smoothed 0.20x+1 B) + at-admission-tokens 3.8 KiB, compacted 597 MiB [≈499 MiB], flushed 893 MiB [≈924 MiB]; admitting 670 MiB (rate 45 MiB/s) due to memtable flush (multiplier 0.725) (used 575 MiB)
I220726 15:21:46.296270 379 util/admission/granter.go:2078 ⋮ [s1] 3318  IO overload: compaction score 0.750 (625 ssts, 15 sub-levels), L0 growth 619 MiB (write 619 MiB ingest 0 B): requests 151091 with 404 MiB acc-write + 163 MiB acc-ingest + write-model 1.53x+1 B (smoothed 1.42x+1 B) + ingested-model 0.00x+1 B (smoothed 0.10x+1 B) + at-admission-tokens 4.0 KiB, compacted 482 MiB [≈490 MiB], flushed 830 MiB [≈877 MiB]; admitting 570 MiB (rate 38 MiB/s) due to memtable flush (multiplier 0.650) (used 579 MiB)
I220726 15:22:01.296298 379 util/admission/granter.go:2078 ⋮ [s1] 3430  IO overload: compaction score 1.050[L0-overload] (595 ssts, 21 sub-levels), L0 growth 570 MiB (write 570 MiB ingest 0 B): requests 148162 with 398 MiB acc-write + 81 MiB acc-ingest + write-model 1.43x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+1 B (smoothed 0.05x+1 B) + at-admission-tokens 4.0 KiB, compacted 767 MiB [≈629 MiB], flushed 977 MiB [≈927 MiB]; admitting 397 MiB (rate 26 MiB/s) due to L0 growth (used 700 MiB)
I220726 15:22:31.296533 379 util/admission/granter.go:2078 ⋮ [s1] 3567  IO overload: compaction score 0.650 (396 ssts, 13 sub-levels), L0 growth 514 MiB (write 465 MiB ingest 49 MiB): requests 165243 with 439 MiB acc-write + 131 MiB acc-ingest + write-model 1.06x+1 B (smoothed 1.01x+1 B) + ingested-model 0.38x+1 B (smoothed 0.21x+1 B) + at-admission-tokens 2.9 KiB, compacted 83 MiB [≈355 MiB], flushed 859 MiB [≈1.0 GiB]; admitting 526 MiB (rate 35 MiB/s) due to memtable flush (multiplier 0.500) (used 459 MiB)
I220726 15:22:46.296644 379 util/admission/granter.go:2078 ⋮ [s1] 3864  IO overload: compaction score 0.850 (646 ssts, 17 sub-levels), L0 growth 648 MiB (write 632 MiB ingest 16 MiB): requests 146041 with 389 MiB acc-write + 122 MiB acc-ingest + write-model 1.62x+1 B (smoothed 1.32x+1 B) + ingested-model 0.13x+1 B (smoothed 0.17x+1 B) + at-admission-tokens 3.7 KiB, compacted 249 MiB [≈302 MiB], flushed 848 MiB [≈950 MiB]; admitting 475 MiB (rate 32 MiB/s) due to memtable flush (multiplier 0.500) (used 438 MiB)
I220726 15:23:01.296200 379 util/admission/granter.go:2078 ⋮ [s1] 3869  IO overload: compaction score 0.650 (315 ssts, 13 sub-levels), L0 growth 343 MiB (write 327 MiB ingest 16 MiB): requests 139648 with 375 MiB acc-write + 16 MiB acc-ingest + write-model 0.87x+1 B (smoothed 1.09x+1 B) + ingested-model 0.99x+1 B (smoothed 0.58x+1 B) + at-admission-tokens 3.1 KiB, compacted 957 MiB [≈629 MiB], flushed 1.2 GiB [≈1.1 GiB]; admitting 556 MiB (rate 37 MiB/s) due to memtable flush (multiplier 0.500) (used 617 MiB)
I220726 15:23:16.300511 379 util/admission/granter.go:2078 ⋮ [s1] 3886  IO overload: compaction score 0.700 (389 ssts, 14 sub-levels), L0 growth 454 MiB (write 454 MiB ingest 0 B): requests 165673 with 441 MiB acc-write + 98 MiB acc-ingest + write-model 1.03x+1 B (smoothed 1.06x+1 B) + ingested-model 0.00x+1 B (smoothed 0.29x+1 B) + at-admission-tokens 3.0 KiB, compacted 218 MiB [≈424 MiB], flushed 871 MiB [≈992 MiB]; admitting 496 MiB (rate 33 MiB/s) due to memtable flush (multiplier 0.500) (used 702 MiB)
I220726 15:23:31.331614 379 util/admission/granter.go:2078 ⋮ [s1] 4076  IO overload: compaction score 0.700 (492 ssts, 14 sub-levels), L0 growth 580 MiB (write 580 MiB ingest 0 B): requests 153719 with 411 MiB acc-write + 114 MiB acc-ingest + write-model 1.41x+1 B (smoothed 1.24x+1 B) + ingested-model 0.00x+1 B (smoothed 0.15x+1 B) + at-admission-tokens 3.4 KiB, compacted 484 MiB [≈454 MiB], flushed 898 MiB [≈945 MiB]; admitting 473 MiB (rate 32 MiB/s) due to memtable flush (multiplier 0.500) (used 574 MiB)

@sumeerbhola
Copy link
Collaborator Author

for this previous experiment, over time the index backfill starts having more ingests into L0 so we start seeing log statements like:

ingested-model 0.99x+1 B (smoothed 0.68x+1 B)
ingested-model 0.67x+1 B (smoothed 0.67x+1 B)
ingested-model 1.00x+1 B (smoothed 0.83x+1 B)

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments.

Reviewed 7 of 17 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/mvcc_gc_queue.go line 473 at r1 (raw file):

		r.admissionController.AdmittedKVWorkDone(admissionHandle, writeBytes)
		if writeBytes != nil {
			writeBytes.Release()

BTW, you can fold this nil check into the implementation of .Release() to skip needing to do so at every caller. Then your callers can simply:

_, writeBytes, pErr := r.repl.SendWithWriteBytes(ctx, ba)
defer writeBytes.Release()

pkg/kv/kvserver/store.go line 3845 at r1 (raw file):

				if ah.storeAdmissionQ != nil {
					// No bytes were written.
					_ = ah.storeAdmissionQ.AdmittedWorkDone(ah.storeWorkHandle, admission.StoreWorkDoneInfo{})

For my own understanding: what happens if we don't call AdmittedWorkDone in the face of errors? We'll have taken away tokens from the underlying work queue without having used it?


pkg/util/admission/granter.go line 708 at r1 (raw file):

	// since the grant coordinator doesn't know when to call the tryGrant since
	// it does not know whether tokens have become available.
	sg.coord.mu.Lock()

[nit] Could you nest everything this mutex protects within some embedded struct?


pkg/util/admission/granter.go line 1824 at r1 (raw file):

// The interface is used by the entity that periodically looks at load and
// computes the tokens to grant (ioLoadListener).
type granterWithIOTokens interface {

Did you mean to embed the granter interface here? Like you did for granterWith{StoreWriteDone,LockedCalls}.


pkg/util/admission/granter.go line 1835 at r1 (raw file):

	// possible for tokens to be returned.
	setAvailableIOTokensLocked(tokens int64) (tokensUsed int64)
	// setAdmittedDoneAdjustments supplies the adjustments to use when

Comment needs updating, it refers to setAdmittedDoneAdjustments which isn't what we have below.


pkg/util/admission/granter.go line 2083 at r1 (raw file):

}

func (io *ioLoadListener) copyAuxFromPerWorkEstimator() {

This is doing more than just copying aux. This particular method is a bit difficult to understand, few things would make it easier:

  • comment on adjustTokensResult
  • comment on storePerWorkTokenEstimator

If I'm reading it all correctly, this is just for the pretty printing above near "IO overload: %s" (right?). Could we get rid of this copying by instead printing directly the contents of perWorkTokenEstimator? Maintaining the contents twice across two places makes it possible to forget to update one or the other, or not use the right one if not copied over correctly.


pkg/util/admission/granter.go line 2442 at r1 (raw file):

		ib(res.aux.perWorkTokensAux.intIngestedLinearModel.constant),
		res.ingestedLM.multiplier, ib(res.ingestedLM.constant))
	p.Printf("at-admission-tokens %s, ", ib(res.requestEstimates.writeTokens))

[nit] Add a comment here, the ones above and below are helpful for outsiders to quickly interpret each thing.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work!

Reviewed 10 of 17 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


-- commits line 13 at r1:
Add a PR reference here for future readers.


-- commits line 27 at r1:
Mention what the y term represents here.


-- commits line 35 at r1:
s/imagine being/imagine the a term being?


pkg/util/admission/store_token_estimation.go line 36 at r1 (raw file):

//    and ingests.
//
// Over time, we should change the code to perform step (b) for (1) and for

We should also be able to do (a) for (1) like mentioned above.


pkg/util/admission/store_token_estimation.go line 37 at r1 (raw file):

//
// Over time, we should change the code to perform step (b) for (1) and for
// the raft log for (3). This will require noticing under raft that a write is

Even if we didn't, the linear model here should settle to some appropriate multiplier term, right?


pkg/util/admission/store_token_estimation.go line 75 at r1 (raw file):

//
// Based on these observations we adopt a linear model for estimating the
// actual bytes (y), given the accounted bytes (x): y = a.x + b. The

y == actual bytes making into just L0, right?


pkg/util/admission/store_token_estimation.go line 88 at r1 (raw file):

[~0, 1] ... [~0, 1.5]

Why the ~0? And not just 0 and -0.5 respectively?


pkg/util/admission/store_token_estimation.go line 109 at r1 (raw file):

// linear regression, under the assumption that the workload is stable.
// However, the simple approach here should be an improvement on the additive
// approach we currently use.

s/currently use/previously used?


pkg/util/admission/store_token_estimation.go line 126 at r1 (raw file):

	// L0, the actual bytes will be zero and the accounted bytes non-zero. We
	// need to update the model in this case.
	updateWithZeroActualNonZeroAccounted bool

This appears in the constructor as well, and is a bit specific to the internals. If it's specifically for the ingested bytes model, perhaps call it just that: ingestedBytesModel bool. Looks like we were doing just that in the datadriven directive as well.


pkg/util/admission/store_token_estimation.go line 148 at r1 (raw file):

// bytes claimed by these work items (accountedBytes), and the actual bytes
// observed in the LSM for that interval (actualBytes).
func (f *tokensLinearModelFitter) updateModelUsingIntervalStats(

[nit] This looks right to me but the tiniest bit of pseudo-code up top will make it easier to follow. I personally don't have many points of references for how auto-adjusting linear models like this are typically built. I understand that we're setting the constant term to 1 to first learn the appropriate multiplier, and then using that multiplier to compute an additive adjustment for the constant; what I don't have as much of an intuition for is that this doesn't cause wild oscillations in the constant term. I can look at the tests, they're instructive, and probably play with this myself to build an intuition, but it would also be helpful to understand how you arrived at this scheme. I see that you mention it might not be a good scheme, so maybe much of what I'm saying is redundant and we've just found this to work well-enough in our experiments.


pkg/util/admission/store_token_estimation.go line 188 at r1 (raw file):

	// Put a lower bound of 1 on constant. So workCount tokens go into that.
	constant := int64(1)
	multiplier := float64(max(0, actualBytes-workCount*constant)) / float64(accountedBytes)

Could you explain the workCount * constant bit to me? A (naive) derivation:

y = ax + b
a = (y - b)/x
multiplier = (actualBytes - constant)/accountedBytes

pkg/util/admission/store_token_estimation.go line 306 at r1 (raw file):

// TODO(sumeer):
// - Make followers tell admission control about the size of their write and
//   sstable bytes, without asking for permission. We will include these in

By not seeking permission you just mean this work is never going to wait in admission queues, right?


pkg/util/admission/work_queue.go line 1553 at r1 (raw file):

	// time. The token subtraction at admission time is completely based on past
	// estimates. This estimation is improved at work completion time via size
	// information provided in StoreWorkDoneInfo.

Is it worth adding a TODO here to allow callers to optionally provide information at admission time that the admission package can entirely disregard? This is for cases where we do know more about the incoming request (say for ingests or snapshots) and our internal estimates may be off. I'm not sure how important this is (you've discussed this elsewhere).


pkg/util/admission/testdata/format_adjust_tokens_stats.txt line 6 at r1 (raw file):

compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 0 B (write 0 B ingest 0 B): requests 0 with 0 B acc-write + 0 B acc-ingest + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 0 B [≈0 B], flushed 0 B [≈0 B]; admitting all
real-numbers:
compaction score 2.700[L0-overload] (195 ssts, 27 sub-levels), L0 growth 577 MiB (write 0 B ingest 0 B): requests 0 with 0 B acc-write + 0 B acc-ingest + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 77 MiB [≈62 MiB], flushed 0 B [≈0 B]; admitting 116 MiB (rate 7.7 MiB/s) due to L0 growth (used 0 B)

Update the test to include non-zero values for some of this new data?


pkg/util/admission/testdata/tokens_linear_model_fitter line 12 at r1 (raw file):

update accounted-bytes=0 actual-bytes=100 work-count=2
----
int: 2.00x+48 smoothed: 1.88x+36 per-work-accounted: 1

Is there an error term or equivalent that we can print out here to understand how closely we're fitting to the most recent interval? Perhaps before the fitting and after to make sure we're trending in the right direction.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments, too.

Reviewed 4 of 17 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)


pkg/kv/kvserver/store.go line 3869 at r1 (raw file):

		err := ah.storeAdmissionQ.AdmittedWorkDone(ah.storeWorkHandle, doneInfo)
		if err != nil {
			// This shouldn't be happening, so log.

This looks like a hot path, so if we log here, it might tank the server assuming we hit an error repeatedly. Could you

a) make this an assertion if util.RaceEnabled is set, and
b) add a log.Every(10*time.Second) that is consulted before logging (defense in depth)?


pkg/server/node.go line 1018 at r1 (raw file):

		n.admissionController.AdmittedKVWorkDone(handle, writeBytes)
		if writeBytes != nil {
			writeBytes.Release()

ditto what Irfan said about making the nil check part of Release.


pkg/util/admission/granter.go line 702 at r1 (raw file):

}

func (sg *kvStoreTokenGranter) storeWriteDone(

Could you add a method comment here or to the interface this implements (and reference that interface here)?


pkg/util/admission/granter.go line 707 at r1 (raw file):

	// We don't bother with the *Locked dance through the GrantCoordinator here
	// since the grant coordinator doesn't know when to call the tryGrant since
	// it does not know whether tokens have become available.

I don't understand this comment, want to link to a method that does do the dance so that the interested reader can piece together what you mean?


pkg/util/admission/granter.go line 709 at r1 (raw file):

	// it does not know whether tokens have become available.
	sg.coord.mu.Lock()
	exhaustedFunc := func() bool {

Does this allocate on each call (see scripts/goescape)? If so, could make this a method on sg.


pkg/util/admission/granter.go line 1838 at r1 (raw file):

	// storeWriteDone is called. Note that this is not the adjustment to use at
	// admission time -- that is handled by StoreWorkQueue and is not in scope
	// of this granter. This asymmetry is due to the need to use all the

It could be helpful to point out that there is a large explanatory comment on tokensLinearModel.


pkg/util/admission/store_token_estimation.go line 17 at r1 (raw file):

// The logic in this file deals with token estimation for a store write in two
// situations: (a) at admission time, (b) when the admitted work is done. At
// (a) we have no information provided about the work size (NB: this choice is

I feel like this comment could do a better job driving home what I think is the main point here, namely that we use a model to "guess" the tokens that a given write should request at admission time, "trained" on actual resource consumptions that earlier writes reported upon completion. As is, I suspect that newcomers to this code would read everything five times and still walk away confused about this basic idea.


pkg/util/admission/store_token_estimation.go line 148 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] This looks right to me but the tiniest bit of pseudo-code up top will make it easier to follow. I personally don't have many points of references for how auto-adjusting linear models like this are typically built. I understand that we're setting the constant term to 1 to first learn the appropriate multiplier, and then using that multiplier to compute an additive adjustment for the constant; what I don't have as much of an intuition for is that this doesn't cause wild oscillations in the constant term. I can look at the tests, they're instructive, and probably play with this myself to build an intuition, but it would also be helpful to understand how you arrived at this scheme. I see that you mention it might not be a good scheme, so maybe much of what I'm saying is redundant and we've just found this to work well-enough in our experiments.

I've also stared at this a bit and also think the comments here could be improved to explain the "how" and not just the why. When we talk about fitting usually I expect us to try to choose the parameters such that a score is minimized. So what is the score here? And why is that score a good target to minimize? And if there isn't a principled store, why do we fit the way we do?

It looks as though we're primarily saying: we definitely know that writes have an overhead, so b should be at least one. Let's assume it is exactly one (i.e. summing up ax+b over all tracked requests we get a*accounted+workCount=actual). That then determines the other parameter (the slope a). But then we turn around and say that we still have to determine b? If everything were arbitrary-precision arithmetic we would get the same b out of that computation. I assume we don't in practice! And I assume we don't care about integer math errors either and that the main reason we might need to adjust b is because actualBytes could be less than workCount. Motivating why this could happen in practice would be helpful.


pkg/util/admission/store_token_estimation.go line 188 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Could you explain the workCount * constant bit to me? A (naive) derivation:

y = ax + b
a = (y - b)/x
multiplier = (actualBytes - constant)/accountedBytes

Each individual request contributes a*<accounted_for_request> + b, where we say that b is one. Summing this up across all requests, you get the constraint actual_combined_writes = a * sum_of_request_accounted + num_of_requests * b which is what you want to make true.


pkg/util/admission/store_token_estimation.go line 207 at r1 (raw file):

		constant:   constant,
	}
	// Smooth the multiplier and constant factors.

Is this the right order of things? I would've expected us to smooth the inputs first, and then do an exact fit on the smoothed values. Instead, we start with the choppy values, fit, and then smooth the parameters. This seems, mathematically speaking, more meaningless than the other way around. Any reason why one is better than the other here?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @sumeerbhola, and @tbg)


pkg/util/admission/store_token_estimation.go line 188 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Each individual request contributes a*<accounted_for_request> + b, where we say that b is one. Summing this up across all requests, you get the constraint actual_combined_writes = a * sum_of_request_accounted + num_of_requests * b which is what you want to make true.

Doh, I was missing the fact that this was a summation across all requests. Thanks.

@sumeerbhola sumeerbhola force-pushed the work_estimate3 branch 2 times, most recently from 8b6f568 to a83da0c Compare August 8, 2022 21:58
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


-- commits line 13 at r1:

Previously, irfansharif (irfan sharif) wrote…

Add a PR reference here for future readers.

Done


-- commits line 27 at r1:

Previously, irfansharif (irfan sharif) wrote…

Mention what the y term represents here.

Done


-- commits line 35 at r1:

Previously, irfansharif (irfan sharif) wrote…

s/imagine being/imagine the a term being?

Done


pkg/kv/kvserver/mvcc_gc_queue.go line 473 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

BTW, you can fold this nil check into the implementation of .Release() to skip needing to do so at every caller. Then your callers can simply:

_, writeBytes, pErr := r.repl.SendWithWriteBytes(ctx, ba)
defer writeBytes.Release()

Done. Also fixed the {Stores,Store,Replica}.Send methods to call Release.


pkg/kv/kvserver/store.go line 3845 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my own understanding: what happens if we don't call AdmittedWorkDone in the face of errors? We'll have taken away tokens from the underlying work queue without having used it?

Correct. This was just an oversight I noticed when making the changes in this PR, so it is unrelated.


pkg/kv/kvserver/store.go line 3869 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This looks like a hot path, so if we log here, it might tank the server assuming we hit an error repeatedly. Could you

a) make this an assertion if util.RaceEnabled is set, and
b) add a log.Every(10*time.Second) that is consulted before logging (defense in depth)?

Done


pkg/server/node.go line 1018 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto what Irfan said about making the nil check part of Release.

Done


pkg/util/admission/granter.go line 702 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you add a method comment here or to the interface this implements (and reference that interface here)?

I've added a references to the interfaces in all the various interface methods here and in slotGranter and tokenGranter.


pkg/util/admission/granter.go line 707 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't understand this comment, want to link to a method that does do the dance so that the interested reader can piece together what you mean?

I've added a longer comment spelling out how the granter, GrantCoordinator interaction usually happens and why we are making an exception in this case.


pkg/util/admission/granter.go line 708 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Could you nest everything this mutex protects within some embedded struct?

The mutex is in the GrantCoordinator while these various granter implementations are in their own struct. I've added a small reminder comment here.


pkg/util/admission/granter.go line 709 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does this allocate on each call (see scripts/goescape)? If so, could make this a method on sg.

I ran scripts/goescape (definitely more convenient than using go build -gcflags directly), and it does not escape.
Why would closures that are not returned from a function cause heap allocation due to what they capture?


pkg/util/admission/granter.go line 1824 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Did you mean to embed the granter interface here? Like you did for granterWith{StoreWriteDone,LockedCalls}.

No I didn't mean to embed granter. The reason granterWithStoreWriteDone embeds granter is because it is passed to StoreWorkQueue which needs a granter for its internal WorkQueue (the WorkQueue is a requester and the code is structured to have requester-granter pairs that interact with each other).

And granterWithLockedCalls embeds granter because of a historical artifact -- in practice the GrantCoordinator only calls the *Locked methods and does not need the granter methods. This embedding will get removed in #85722 for other reasons.


pkg/util/admission/granter.go line 1835 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Comment needs updating, it refers to setAdmittedDoneAdjustments which isn't what we have below.

Done


pkg/util/admission/granter.go line 1838 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It could be helpful to point out that there is a large explanatory comment on tokensLinearModel.

Done


pkg/util/admission/granter.go line 2083 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is doing more than just copying aux. This particular method is a bit difficult to understand, few things would make it easier:

  • comment on adjustTokensResult
  • comment on storePerWorkTokenEstimator

If I'm reading it all correctly, this is just for the pretty printing above near "IO overload: %s" (right?). Could we get rid of this copying by instead printing directly the contents of perWorkTokenEstimator? Maintaining the contents twice across two places makes it possible to forget to update one or the other, or not use the right one if not copied over correctly.

I've added comments in adjustTokensResult, perWorkTokensAux, adjustTokensAuxComputations about how these are encapsulating the numerical state, and in some cases auxiliary numerical state. I see your concern, but I think there is also the simplicity of having all the relevant logging state in one place. This should be easy to revisit later once the dust has settled after the queue of changes that will be made to ioLoadListener.


pkg/util/admission/granter.go line 2442 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Add a comment here, the ones above and below are helpful for outsiders to quickly interpret each thing.

Done


pkg/util/admission/store_token_estimation.go line 17 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I feel like this comment could do a better job driving home what I think is the main point here, namely that we use a model to "guess" the tokens that a given write should request at admission time, "trained" on actual resource consumptions that earlier writes reported upon completion. As is, I suspect that newcomers to this code would read everything five times and still walk away confused about this basic idea.

I've added something to this effect in the second paragraph.


pkg/util/admission/store_token_estimation.go line 36 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We should also be able to do (a) for (1) like mentioned above.

Ack. I didn't change anything in the commend since "this choice is debatable ..." comment earlier covered this.


pkg/util/admission/store_token_estimation.go line 37 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Even if we didn't, the linear model here should settle to some appropriate multiplier term, right?

Not necessarily. If you have 10 nodes in a cluster and 1/10th of the writes are local and the rest are unaccounted for, you would need the multiplier to be 10.a (if a was the multiplier that one should rightfully be fitting). I worry about allowing the multiplier to fluctuate without bound (there is a code comment below "to prevent wild fluctuations and to account for what we know about the system"). So we end up with a large b value, which leads to badness.


pkg/util/admission/store_token_estimation.go line 75 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

y == actual bytes making into just L0, right?

Yes -- I've added a comment below. In #85722 the same modeling approach is used to handle all incoming bytes into the LSM.


pkg/util/admission/store_token_estimation.go line 88 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[~0, 1] ... [~0, 1.5]

Why the ~0? And not just 0 and -0.5 respectively?

I changed this to 0 for the first case and 0.001 for the second. Also added a TODO to consider lowering the lower bound of 0.001.


pkg/util/admission/store_token_estimation.go line 109 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/currently use/previously used?

Done


pkg/util/admission/store_token_estimation.go line 126 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This appears in the constructor as well, and is a bit specific to the internals. If it's specifically for the ingested bytes model, perhaps call it just that: ingestedBytesModel bool. Looks like we were doing just that in the datadriven directive as well.

I added ForIngestedModel to the name. It is being used in makeStorePerWorkTokenEstimator() which passes true.


pkg/util/admission/store_token_estimation.go line 148 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I've also stared at this a bit and also think the comments here could be improved to explain the "how" and not just the why. When we talk about fitting usually I expect us to try to choose the parameters such that a score is minimized. So what is the score here? And why is that score a good target to minimize? And if there isn't a principled store, why do we fit the way we do?

It looks as though we're primarily saying: we definitely know that writes have an overhead, so b should be at least one. Let's assume it is exactly one (i.e. summing up ax+b over all tracked requests we get a*accounted+workCount=actual). That then determines the other parameter (the slope a). But then we turn around and say that we still have to determine b? If everything were arbitrary-precision arithmetic we would get the same b out of that computation. I assume we don't in practice! And I assume we don't care about integer math errors either and that the main reason we might need to adjust b is because actualBytes could be less than workCount. Motivating why this could happen in practice would be helpful.

There isn't a principled score. We are trying to exactly fit the model for the interval. My initial intuition was that the workload can fluctuate, so it would be good to fit the interval perfectly. But then there is of course the worry that we got unlucky with the different sources of information corresponding to accountedBytes and actualBytes, hence the smoothing. Even the choice of alpha=0.5 is arbitrary. It worked well enough. The other heuristics were more important, like the (a) updateWithZeroActualNonZeroAccounted and (b) halving the constant when we choose not to update, (c) the "anomaly" case, and came about based on running a series of experiments and observing the behavior.

I've added a comment at the top explaining the high-level approach and why we don't care about integer errors.

Also, I realized there is no code comment stating that we are trying to minimize b when fitting, i.e., we want the multiplier to do most/all the "lifting". That probably added to the confusion, since there are infinite number of models we can fit to this one equation. I've rectified that with a code comment.


pkg/util/admission/store_token_estimation.go line 207 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this the right order of things? I would've expected us to smooth the inputs first, and then do an exact fit on the smoothed values. Instead, we start with the choppy values, fit, and then smooth the parameters. This seems, mathematically speaking, more meaningless than the other way around. Any reason why one is better than the other here?

That's a good question. I honestly don't know which one is mathematically more meaningful -- I thought (without any mathematical justification) that it was important to fit what was observed in the last 15s interval, and then went from there. This worked ok. Note that there is some smoothing of the input for the one case when accountedBytes was <= 0 and actualBytes > 0.


pkg/util/admission/store_token_estimation.go line 306 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

By not seeking permission you just mean this work is never going to wait in admission queues, right?

Correct. This is done in #85127


pkg/util/admission/work_queue.go line 1553 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is it worth adding a TODO here to allow callers to optionally provide information at admission time that the admission package can entirely disregard? This is for cases where we do know more about the incoming request (say for ingests or snapshots) and our internal estimates may be off. I'm not sure how important this is (you've discussed this elsewhere).

Added

	// TODO(sumeer): in some cases, like AddSSTable requests, we do have size
	// information at proposal time, and may be able to use it fruitfully.

For snapshots we are not doing proposals, so that is not relevant here. For proposed writes, we could do better in some cases.


pkg/util/admission/testdata/format_adjust_tokens_stats.txt line 6 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Update the test to include non-zero values for some of this new data?

Good point. Added a TODO for a future PR

// TODO(sumeer): we now do more work outside adjustTokensInner, so the parts
// of the adjustTokensResult computed by adjustTokensInner has become a subset
// of what is logged below, and the rest is logged with 0 values. Expand this
// test to call adjustTokens.

pkg/util/admission/testdata/tokens_linear_model_fitter line 12 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is there an error term or equivalent that we can print out here to understand how closely we're fitting to the most recent interval? Perhaps before the fitting and after to make sure we're trending in the right direction.

Not sure I understand. Like usual datadriven tests, whether the output is actually right is based on manual inspection. I've added some comments here.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 12 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @sumeerbhola, and @tbg)


pkg/util/admission/store_token_estimation.go line 37 at r1 (raw file):

Previously, sumeerbhola wrote…

Not necessarily. If you have 10 nodes in a cluster and 1/10th of the writes are local and the rest are unaccounted for, you would need the multiplier to be 10.a (if a was the multiplier that one should rightfully be fitting). I worry about allowing the multiplier to fluctuate without bound (there is a code comment below "to prevent wild fluctuations and to account for what we know about the system"). So we end up with a large b value, which leads to badness.

Hm, forgot about the bounds on a, which do seem reasonable to have. I don't know if a better linear fitter could avoid constraining a (and get a more well-behaved b), but this seems fine as a first step.


pkg/util/admission/testdata/tokens_linear_model_fitter line 12 at r1 (raw file):

Previously, sumeerbhola wrote…

Not sure I understand. Like usual datadriven tests, whether the output is actually right is based on manual inspection. I've added some comments here.

I should've used an example. What I meant is printing out what the last smoothed model would compute the account-bytes as vs. the real amount, and what the new smoothed model would get. So something like:

update accounted-bytes=10 actual-bytes=10 work-count=1
----
int: 0.00x+0 smoothed: 1.88x+9 per-work-accounted: 1 error-reduction=N/A


update accounted-bytes=100 actual-bytes=180 work-count=4
----
int: 1.76x+1 smoothed: 1.82x+5 per-work-accounted: 13 error-reduction=5.55%

Math to get to error-reduction=5.55%:

  • Estimation before update: 1.88(100)+(1)9 = 197; error = (197-180)/180 = 9.44%
  • Estimation after update: 1.82(100)+(1)5 = 187; error = (187-180)/180 = 3.88%
  • Error reduction: 9.44% - 3.88%

By looking at error-reduction=5.55% I can easily tell that my model is trending in the right direction and by how much. If it was -ve, that'd be a bug. If later the error was computed over the last N intervals after making the fitter more complicated, the semantics would still hold.

The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.

The kvserver-side plumbing for improving (a) were done in a
preceding PR (cockroachdb#83937). This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.

The main changes in this PR are in the byte token
estimation logic in the admission package, where the
estimation now uses a linear model y=a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo,
and y is the bytes added to L0 via write or ingestion.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
expect the a term being in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.

This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.

Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
```
write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB
write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB
write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB
write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB
```
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.

After both the 4KB and 64KB writes are active the log
statements look like:
```
write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB
write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB
write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB
```
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.

When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
```
write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB
write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB
```

Fixes cockroachdb#79092
Informs cockroachdb#82536

Release note: None
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/util/admission/store_token_estimation.go line 37 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Hm, forgot about the bounds on a, which do seem reasonable to have. I don't know if a better linear fitter could avoid constraining a (and get a more well-behaved b), but this seems fine as a first step.

Ack


pkg/util/admission/testdata/tokens_linear_model_fitter line 12 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I should've used an example. What I meant is printing out what the last smoothed model would compute the account-bytes as vs. the real amount, and what the new smoothed model would get. So something like:

update accounted-bytes=10 actual-bytes=10 work-count=1
----
int: 0.00x+0 smoothed: 1.88x+9 per-work-accounted: 1 error-reduction=N/A


update accounted-bytes=100 actual-bytes=180 work-count=4
----
int: 1.76x+1 smoothed: 1.82x+5 per-work-accounted: 13 error-reduction=5.55%

Math to get to error-reduction=5.55%:

  • Estimation before update: 1.88(100)+(1)9 = 197; error = (197-180)/180 = 9.44%
  • Estimation after update: 1.82(100)+(1)5 = 187; error = (187-180)/180 = 3.88%
  • Error reduction: 9.44% - 3.88%

By looking at error-reduction=5.55% I can easily tell that my model is trending in the right direction and by how much. If it was -ve, that'd be a bug. If later the error was computed over the last N intervals after making the fitter more complicated, the semantics would still hold.

Got it.
I have added a TODO

# TODO(sumeer): it would be easier to confirm that the new model was a better
# fit for the latest stats if we printed the error for the existing model and
# the new model (see the comment at
# https://github.com/cockroachdb/cockroach/pull/85059#pullrequestreview-1065876690).
# One small problem with that is that tokensLinearModelFitter has heuristics
# that it uses to ignore accounted-bytes when 0. Such problems are
# surmountable.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to merge once CI is green, so that the succeeding PRs can be unblocked.
TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

@sumeerbhola
Copy link
Collaborator Author

bors r=irfansharif,tbg

@craig
Copy link
Contributor

craig bot commented Aug 9, 2022

Build succeeded:

@craig craig bot merged commit 9c840c6 into cockroachdb:master Aug 9, 2022
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 17 files at r1, 11 of 12 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

admission: byte tokens for store admission
4 participants