-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission,kvserver: improved byte token estimation for writes #85059
Conversation
feb6fef
to
5d1fa1e
Compare
Here is a kv0 workload with 1KB writes, and then an index creation was started for the v column. The first log statement is after the index creation has been running for 1+ min and all the ingested sstables have been going to levels below L0 (note
|
for this previous experiment, over time the index backfill starts having more ingests into L0 so we start seeing log statements like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some comments.
Reviewed 7 of 17 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/mvcc_gc_queue.go
line 473 at r1 (raw file):
r.admissionController.AdmittedKVWorkDone(admissionHandle, writeBytes) if writeBytes != nil { writeBytes.Release()
BTW, you can fold this nil check into the implementation of .Release() to skip needing to do so at every caller. Then your callers can simply:
_, writeBytes, pErr := r.repl.SendWithWriteBytes(ctx, ba)
defer writeBytes.Release()
pkg/kv/kvserver/store.go
line 3845 at r1 (raw file):
if ah.storeAdmissionQ != nil { // No bytes were written. _ = ah.storeAdmissionQ.AdmittedWorkDone(ah.storeWorkHandle, admission.StoreWorkDoneInfo{})
For my own understanding: what happens if we don't call AdmittedWorkDone in the face of errors? We'll have taken away tokens from the underlying work queue without having used it?
pkg/util/admission/granter.go
line 708 at r1 (raw file):
// since the grant coordinator doesn't know when to call the tryGrant since // it does not know whether tokens have become available. sg.coord.mu.Lock()
[nit] Could you nest everything this mutex protects within some embedded struct?
pkg/util/admission/granter.go
line 1824 at r1 (raw file):
// The interface is used by the entity that periodically looks at load and // computes the tokens to grant (ioLoadListener). type granterWithIOTokens interface {
Did you mean to embed the granter
interface here? Like you did for granterWith{StoreWriteDone,LockedCalls}
.
pkg/util/admission/granter.go
line 1835 at r1 (raw file):
// possible for tokens to be returned. setAvailableIOTokensLocked(tokens int64) (tokensUsed int64) // setAdmittedDoneAdjustments supplies the adjustments to use when
Comment needs updating, it refers to setAdmittedDoneAdjustments
which isn't what we have below.
pkg/util/admission/granter.go
line 2083 at r1 (raw file):
} func (io *ioLoadListener) copyAuxFromPerWorkEstimator() {
This is doing more than just copying aux. This particular method is a bit difficult to understand, few things would make it easier:
- comment on adjustTokensResult
- comment on storePerWorkTokenEstimator
If I'm reading it all correctly, this is just for the pretty printing above near "IO overload: %s" (right?). Could we get rid of this copying by instead printing directly the contents of perWorkTokenEstimator? Maintaining the contents twice across two places makes it possible to forget to update one or the other, or not use the right one if not copied over correctly.
pkg/util/admission/granter.go
line 2442 at r1 (raw file):
ib(res.aux.perWorkTokensAux.intIngestedLinearModel.constant), res.ingestedLM.multiplier, ib(res.ingestedLM.constant)) p.Printf("at-admission-tokens %s, ", ib(res.requestEstimates.writeTokens))
[nit] Add a comment here, the ones above and below are helpful for outsiders to quickly interpret each thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work!
Reviewed 10 of 17 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
-- commits
line 13 at r1:
Add a PR reference here for future readers.
-- commits
line 27 at r1:
Mention what the y
term represents here.
-- commits
line 35 at r1:
s/imagine being/imagine the a term being?
pkg/util/admission/store_token_estimation.go
line 36 at r1 (raw file):
// and ingests. // // Over time, we should change the code to perform step (b) for (1) and for
We should also be able to do (a) for (1) like mentioned above.
pkg/util/admission/store_token_estimation.go
line 37 at r1 (raw file):
// // Over time, we should change the code to perform step (b) for (1) and for // the raft log for (3). This will require noticing under raft that a write is
Even if we didn't, the linear model here should settle to some appropriate multiplier term, right?
pkg/util/admission/store_token_estimation.go
line 75 at r1 (raw file):
// // Based on these observations we adopt a linear model for estimating the // actual bytes (y), given the accounted bytes (x): y = a.x + b. The
y == actual bytes making into just L0, right?
pkg/util/admission/store_token_estimation.go
line 88 at r1 (raw file):
[~0, 1] ... [~0, 1.5]
Why the ~0? And not just 0 and -0.5 respectively?
pkg/util/admission/store_token_estimation.go
line 109 at r1 (raw file):
// linear regression, under the assumption that the workload is stable. // However, the simple approach here should be an improvement on the additive // approach we currently use.
s/currently use/previously used?
pkg/util/admission/store_token_estimation.go
line 126 at r1 (raw file):
// L0, the actual bytes will be zero and the accounted bytes non-zero. We // need to update the model in this case. updateWithZeroActualNonZeroAccounted bool
This appears in the constructor as well, and is a bit specific to the internals. If it's specifically for the ingested bytes model, perhaps call it just that: ingestedBytesModel bool
. Looks like we were doing just that in the datadriven directive as well.
pkg/util/admission/store_token_estimation.go
line 148 at r1 (raw file):
// bytes claimed by these work items (accountedBytes), and the actual bytes // observed in the LSM for that interval (actualBytes). func (f *tokensLinearModelFitter) updateModelUsingIntervalStats(
[nit] This looks right to me but the tiniest bit of pseudo-code up top will make it easier to follow. I personally don't have many points of references for how auto-adjusting linear models like this are typically built. I understand that we're setting the constant term to 1 to first learn the appropriate multiplier, and then using that multiplier to compute an additive adjustment for the constant; what I don't have as much of an intuition for is that this doesn't cause wild oscillations in the constant term. I can look at the tests, they're instructive, and probably play with this myself to build an intuition, but it would also be helpful to understand how you arrived at this scheme. I see that you mention it might not be a good scheme, so maybe much of what I'm saying is redundant and we've just found this to work well-enough in our experiments.
pkg/util/admission/store_token_estimation.go
line 188 at r1 (raw file):
// Put a lower bound of 1 on constant. So workCount tokens go into that. constant := int64(1) multiplier := float64(max(0, actualBytes-workCount*constant)) / float64(accountedBytes)
Could you explain the workCount * constant
bit to me? A (naive) derivation:
y = ax + b
a = (y - b)/x
multiplier = (actualBytes - constant)/accountedBytes
pkg/util/admission/store_token_estimation.go
line 306 at r1 (raw file):
// TODO(sumeer): // - Make followers tell admission control about the size of their write and // sstable bytes, without asking for permission. We will include these in
By not seeking permission you just mean this work is never going to wait in admission queues, right?
pkg/util/admission/work_queue.go
line 1553 at r1 (raw file):
// time. The token subtraction at admission time is completely based on past // estimates. This estimation is improved at work completion time via size // information provided in StoreWorkDoneInfo.
Is it worth adding a TODO here to allow callers to optionally provide information at admission time that the admission package can entirely disregard? This is for cases where we do know more about the incoming request (say for ingests or snapshots) and our internal estimates may be off. I'm not sure how important this is (you've discussed this elsewhere).
pkg/util/admission/testdata/format_adjust_tokens_stats.txt
line 6 at r1 (raw file):
compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 0 B (write 0 B ingest 0 B): requests 0 with 0 B acc-write + 0 B acc-ingest + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 0 B [≈0 B], flushed 0 B [≈0 B]; admitting all real-numbers: compaction score 2.700[L0-overload] (195 ssts, 27 sub-levels), L0 growth 577 MiB (write 0 B ingest 0 B): requests 0 with 0 B acc-write + 0 B acc-ingest + write-model 0.00x+0 B (smoothed 0.00x+0 B) + ingested-model 0.00x+0 B (smoothed 0.00x+0 B) + at-admission-tokens 0 B, compacted 77 MiB [≈62 MiB], flushed 0 B [≈0 B]; admitting 116 MiB (rate 7.7 MiB/s) due to L0 growth (used 0 B)
Update the test to include non-zero values for some of this new data?
pkg/util/admission/testdata/tokens_linear_model_fitter
line 12 at r1 (raw file):
update accounted-bytes=0 actual-bytes=100 work-count=2 ---- int: 2.00x+48 smoothed: 1.88x+36 per-work-accounted: 1
Is there an error term or equivalent that we can print out here to understand how closely we're fitting to the most recent interval? Perhaps before the fitting and after to make sure we're trending in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some comments, too.
Reviewed 4 of 17 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)
pkg/kv/kvserver/store.go
line 3869 at r1 (raw file):
err := ah.storeAdmissionQ.AdmittedWorkDone(ah.storeWorkHandle, doneInfo) if err != nil { // This shouldn't be happening, so log.
This looks like a hot path, so if we log here, it might tank the server assuming we hit an error repeatedly. Could you
a) make this an assertion if util.RaceEnabled
is set, and
b) add a log.Every(10*time.Second)
that is consulted before logging (defense in depth)?
pkg/server/node.go
line 1018 at r1 (raw file):
n.admissionController.AdmittedKVWorkDone(handle, writeBytes) if writeBytes != nil { writeBytes.Release()
ditto what Irfan said about making the nil check part of Release
.
pkg/util/admission/granter.go
line 702 at r1 (raw file):
} func (sg *kvStoreTokenGranter) storeWriteDone(
Could you add a method comment here or to the interface this implements (and reference that interface here)?
pkg/util/admission/granter.go
line 707 at r1 (raw file):
// We don't bother with the *Locked dance through the GrantCoordinator here // since the grant coordinator doesn't know when to call the tryGrant since // it does not know whether tokens have become available.
I don't understand this comment, want to link to a method that does do the dance so that the interested reader can piece together what you mean?
pkg/util/admission/granter.go
line 709 at r1 (raw file):
// it does not know whether tokens have become available. sg.coord.mu.Lock() exhaustedFunc := func() bool {
Does this allocate on each call (see scripts/goescape
)? If so, could make this a method on sg
.
pkg/util/admission/granter.go
line 1838 at r1 (raw file):
// storeWriteDone is called. Note that this is not the adjustment to use at // admission time -- that is handled by StoreWorkQueue and is not in scope // of this granter. This asymmetry is due to the need to use all the
It could be helpful to point out that there is a large explanatory comment on tokensLinearModel
.
pkg/util/admission/store_token_estimation.go
line 17 at r1 (raw file):
// The logic in this file deals with token estimation for a store write in two // situations: (a) at admission time, (b) when the admitted work is done. At // (a) we have no information provided about the work size (NB: this choice is
I feel like this comment could do a better job driving home what I think is the main point here, namely that we use a model to "guess" the tokens that a given write should request at admission time, "trained" on actual resource consumptions that earlier writes reported upon completion. As is, I suspect that newcomers to this code would read everything five times and still walk away confused about this basic idea.
pkg/util/admission/store_token_estimation.go
line 148 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] This looks right to me but the tiniest bit of pseudo-code up top will make it easier to follow. I personally don't have many points of references for how auto-adjusting linear models like this are typically built. I understand that we're setting the constant term to 1 to first learn the appropriate multiplier, and then using that multiplier to compute an additive adjustment for the constant; what I don't have as much of an intuition for is that this doesn't cause wild oscillations in the constant term. I can look at the tests, they're instructive, and probably play with this myself to build an intuition, but it would also be helpful to understand how you arrived at this scheme. I see that you mention it might not be a good scheme, so maybe much of what I'm saying is redundant and we've just found this to work well-enough in our experiments.
I've also stared at this a bit and also think the comments here could be improved to explain the "how" and not just the why. When we talk about fitting usually I expect us to try to choose the parameters such that a score is minimized. So what is the score here? And why is that score a good target to minimize? And if there isn't a principled store, why do we fit the way we do?
It looks as though we're primarily saying: we definitely know that writes have an overhead, so b
should be at least one. Let's assume it is exactly one (i.e. summing up ax+b
over all tracked requests we get a*accounted+workCount=actual
). That then determines the other parameter (the slope a
). But then we turn around and say that we still have to determine b
? If everything were arbitrary-precision arithmetic we would get the same b
out of that computation. I assume we don't in practice! And I assume we don't care about integer math errors either and that the main reason we might need to adjust b
is because actualBytes
could be less than workCount
. Motivating why this could happen in practice would be helpful.
pkg/util/admission/store_token_estimation.go
line 188 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Could you explain the
workCount * constant
bit to me? A (naive) derivation:y = ax + b a = (y - b)/x multiplier = (actualBytes - constant)/accountedBytes
Each individual request contributes a*<accounted_for_request> + b
, where we say that b
is one. Summing this up across all requests, you get the constraint actual_combined_writes = a * sum_of_request_accounted + num_of_requests * b
which is what you want to make true.
pkg/util/admission/store_token_estimation.go
line 207 at r1 (raw file):
constant: constant, } // Smooth the multiplier and constant factors.
Is this the right order of things? I would've expected us to smooth the inputs first, and then do an exact fit on the smoothed values. Instead, we start with the choppy values, fit, and then smooth the parameters. This seems, mathematically speaking, more meaningless than the other way around. Any reason why one is better than the other here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @sumeerbhola, and @tbg)
pkg/util/admission/store_token_estimation.go
line 188 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Each individual request contributes
a*<accounted_for_request> + b
, where we say thatb
is one. Summing this up across all requests, you get the constraintactual_combined_writes = a * sum_of_request_accounted + num_of_requests * b
which is what you want to make true.
Doh, I was missing the fact that this was a summation across all requests. Thanks.
8b6f568
to
a83da0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
Previously, irfansharif (irfan sharif) wrote…
Add a PR reference here for future readers.
Done
Previously, irfansharif (irfan sharif) wrote…
Mention what the
y
term represents here.
Done
Previously, irfansharif (irfan sharif) wrote…
s/imagine being/imagine the a term being?
Done
pkg/kv/kvserver/mvcc_gc_queue.go
line 473 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
BTW, you can fold this nil check into the implementation of .Release() to skip needing to do so at every caller. Then your callers can simply:
_, writeBytes, pErr := r.repl.SendWithWriteBytes(ctx, ba) defer writeBytes.Release()
Done. Also fixed the {Stores,Store,Replica}.Send
methods to call Release.
pkg/kv/kvserver/store.go
line 3845 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
For my own understanding: what happens if we don't call AdmittedWorkDone in the face of errors? We'll have taken away tokens from the underlying work queue without having used it?
Correct. This was just an oversight I noticed when making the changes in this PR, so it is unrelated.
pkg/kv/kvserver/store.go
line 3869 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This looks like a hot path, so if we log here, it might tank the server assuming we hit an error repeatedly. Could you
a) make this an assertion if
util.RaceEnabled
is set, and
b) add alog.Every(10*time.Second)
that is consulted before logging (defense in depth)?
Done
pkg/server/node.go
line 1018 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto what Irfan said about making the nil check part of
Release
.
Done
pkg/util/admission/granter.go
line 702 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you add a method comment here or to the interface this implements (and reference that interface here)?
I've added a references to the interfaces in all the various interface methods here and in slotGranter
and tokenGranter
.
pkg/util/admission/granter.go
line 707 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't understand this comment, want to link to a method that does do the dance so that the interested reader can piece together what you mean?
I've added a longer comment spelling out how the granter, GrantCoordinator interaction usually happens and why we are making an exception in this case.
pkg/util/admission/granter.go
line 708 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] Could you nest everything this mutex protects within some embedded struct?
The mutex is in the GrantCoordinator while these various granter implementations are in their own struct. I've added a small reminder comment here.
pkg/util/admission/granter.go
line 709 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this allocate on each call (see
scripts/goescape
)? If so, could make this a method onsg
.
I ran scripts/goescape
(definitely more convenient than using go build -gcflags
directly), and it does not escape.
Why would closures that are not returned from a function cause heap allocation due to what they capture?
pkg/util/admission/granter.go
line 1824 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Did you mean to embed the
granter
interface here? Like you did forgranterWith{StoreWriteDone,LockedCalls}
.
No I didn't mean to embed granter
. The reason granterWithStoreWriteDone
embeds granter
is because it is passed to StoreWorkQueue
which needs a granter
for its internal WorkQueue
(the WorkQueue
is a requester
and the code is structured to have requester-granter
pairs that interact with each other).
And granterWithLockedCalls
embeds granter
because of a historical artifact -- in practice the GrantCoordinator
only calls the *Locked
methods and does not need the granter
methods. This embedding will get removed in #85722 for other reasons.
pkg/util/admission/granter.go
line 1835 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Comment needs updating, it refers to
setAdmittedDoneAdjustments
which isn't what we have below.
Done
pkg/util/admission/granter.go
line 1838 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It could be helpful to point out that there is a large explanatory comment on
tokensLinearModel
.
Done
pkg/util/admission/granter.go
line 2083 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This is doing more than just copying aux. This particular method is a bit difficult to understand, few things would make it easier:
- comment on adjustTokensResult
- comment on storePerWorkTokenEstimator
If I'm reading it all correctly, this is just for the pretty printing above near "IO overload: %s" (right?). Could we get rid of this copying by instead printing directly the contents of perWorkTokenEstimator? Maintaining the contents twice across two places makes it possible to forget to update one or the other, or not use the right one if not copied over correctly.
I've added comments in adjustTokensResult
, perWorkTokensAux
, adjustTokensAuxComputations
about how these are encapsulating the numerical state, and in some cases auxiliary numerical state. I see your concern, but I think there is also the simplicity of having all the relevant logging state in one place. This should be easy to revisit later once the dust has settled after the queue of changes that will be made to ioLoadListener
.
pkg/util/admission/granter.go
line 2442 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] Add a comment here, the ones above and below are helpful for outsiders to quickly interpret each thing.
Done
pkg/util/admission/store_token_estimation.go
line 17 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I feel like this comment could do a better job driving home what I think is the main point here, namely that we use a model to "guess" the tokens that a given write should request at admission time, "trained" on actual resource consumptions that earlier writes reported upon completion. As is, I suspect that newcomers to this code would read everything five times and still walk away confused about this basic idea.
I've added something to this effect in the second paragraph.
pkg/util/admission/store_token_estimation.go
line 36 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
We should also be able to do (a) for (1) like mentioned above.
Ack. I didn't change anything in the commend since "this choice is debatable ..." comment earlier covered this.
pkg/util/admission/store_token_estimation.go
line 37 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Even if we didn't, the linear model here should settle to some appropriate multiplier term, right?
Not necessarily. If you have 10 nodes in a cluster and 1/10th of the writes are local and the rest are unaccounted for, you would need the multiplier to be 10.a (if a was the multiplier that one should rightfully be fitting). I worry about allowing the multiplier to fluctuate without bound (there is a code comment below "to prevent wild fluctuations and to account for what we know about the system"). So we end up with a large b value, which leads to badness.
pkg/util/admission/store_token_estimation.go
line 75 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
y == actual bytes making into just L0, right?
Yes -- I've added a comment below. In #85722 the same modeling approach is used to handle all incoming bytes into the LSM.
pkg/util/admission/store_token_estimation.go
line 88 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[~0, 1] ... [~0, 1.5]
Why the ~0? And not just 0 and -0.5 respectively?
I changed this to 0 for the first case and 0.001 for the second. Also added a TODO to consider lowering the lower bound of 0.001.
pkg/util/admission/store_token_estimation.go
line 109 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/currently use/previously used?
Done
pkg/util/admission/store_token_estimation.go
line 126 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This appears in the constructor as well, and is a bit specific to the internals. If it's specifically for the ingested bytes model, perhaps call it just that:
ingestedBytesModel bool
. Looks like we were doing just that in the datadriven directive as well.
I added ForIngestedModel
to the name. It is being used in makeStorePerWorkTokenEstimator()
which passes true.
pkg/util/admission/store_token_estimation.go
line 148 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I've also stared at this a bit and also think the comments here could be improved to explain the "how" and not just the why. When we talk about fitting usually I expect us to try to choose the parameters such that a score is minimized. So what is the score here? And why is that score a good target to minimize? And if there isn't a principled store, why do we fit the way we do?
It looks as though we're primarily saying: we definitely know that writes have an overhead, so
b
should be at least one. Let's assume it is exactly one (i.e. summing upax+b
over all tracked requests we geta*accounted+workCount=actual
). That then determines the other parameter (the slopea
). But then we turn around and say that we still have to determineb
? If everything were arbitrary-precision arithmetic we would get the sameb
out of that computation. I assume we don't in practice! And I assume we don't care about integer math errors either and that the main reason we might need to adjustb
is becauseactualBytes
could be less thanworkCount
. Motivating why this could happen in practice would be helpful.
There isn't a principled score. We are trying to exactly fit the model for the interval. My initial intuition was that the workload can fluctuate, so it would be good to fit the interval perfectly. But then there is of course the worry that we got unlucky with the different sources of information corresponding to accountedBytes and actualBytes, hence the smoothing. Even the choice of alpha=0.5 is arbitrary. It worked well enough. The other heuristics were more important, like the (a) updateWithZeroActualNonZeroAccounted
and (b) halving the constant when we choose not to update, (c) the "anomaly" case, and came about based on running a series of experiments and observing the behavior.
I've added a comment at the top explaining the high-level approach and why we don't care about integer errors.
Also, I realized there is no code comment stating that we are trying to minimize b when fitting, i.e., we want the multiplier to do most/all the "lifting". That probably added to the confusion, since there are infinite number of models we can fit to this one equation. I've rectified that with a code comment.
pkg/util/admission/store_token_estimation.go
line 207 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this the right order of things? I would've expected us to smooth the inputs first, and then do an exact fit on the smoothed values. Instead, we start with the choppy values, fit, and then smooth the parameters. This seems, mathematically speaking, more meaningless than the other way around. Any reason why one is better than the other here?
That's a good question. I honestly don't know which one is mathematically more meaningful -- I thought (without any mathematical justification) that it was important to fit what was observed in the last 15s interval, and then went from there. This worked ok. Note that there is some smoothing of the input for the one case when accountedBytes was <= 0 and actualBytes > 0.
pkg/util/admission/store_token_estimation.go
line 306 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
By not seeking permission you just mean this work is never going to wait in admission queues, right?
Correct. This is done in #85127
pkg/util/admission/work_queue.go
line 1553 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Is it worth adding a TODO here to allow callers to optionally provide information at admission time that the admission package can entirely disregard? This is for cases where we do know more about the incoming request (say for ingests or snapshots) and our internal estimates may be off. I'm not sure how important this is (you've discussed this elsewhere).
Added
// TODO(sumeer): in some cases, like AddSSTable requests, we do have size
// information at proposal time, and may be able to use it fruitfully.
For snapshots we are not doing proposals, so that is not relevant here. For proposed writes, we could do better in some cases.
pkg/util/admission/testdata/format_adjust_tokens_stats.txt
line 6 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Update the test to include non-zero values for some of this new data?
Good point. Added a TODO for a future PR
// TODO(sumeer): we now do more work outside adjustTokensInner, so the parts
// of the adjustTokensResult computed by adjustTokensInner has become a subset
// of what is logged below, and the rest is logged with 0 values. Expand this
// test to call adjustTokens.
pkg/util/admission/testdata/tokens_linear_model_fitter
line 12 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Is there an error term or equivalent that we can print out here to understand how closely we're fitting to the most recent interval? Perhaps before the fitting and after to make sure we're trending in the right direction.
Not sure I understand. Like usual datadriven tests, whether the output is actually right is based on manual inspection. I've added some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 12 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @sumeerbhola, and @tbg)
pkg/util/admission/store_token_estimation.go
line 37 at r1 (raw file):
Previously, sumeerbhola wrote…
Not necessarily. If you have 10 nodes in a cluster and 1/10th of the writes are local and the rest are unaccounted for, you would need the multiplier to be 10.a (if a was the multiplier that one should rightfully be fitting). I worry about allowing the multiplier to fluctuate without bound (there is a code comment below "to prevent wild fluctuations and to account for what we know about the system"). So we end up with a large b value, which leads to badness.
Hm, forgot about the bounds on a, which do seem reasonable to have. I don't know if a better linear fitter could avoid constraining a (and get a more well-behaved b), but this seems fine as a first step.
pkg/util/admission/testdata/tokens_linear_model_fitter
line 12 at r1 (raw file):
Previously, sumeerbhola wrote…
Not sure I understand. Like usual datadriven tests, whether the output is actually right is based on manual inspection. I've added some comments here.
I should've used an example. What I meant is printing out what the last smoothed model would compute the account-bytes as vs. the real amount, and what the new smoothed model would get. So something like:
update accounted-bytes=10 actual-bytes=10 work-count=1
----
int: 0.00x+0 smoothed: 1.88x+9 per-work-accounted: 1 error-reduction=N/A
update accounted-bytes=100 actual-bytes=180 work-count=4
----
int: 1.76x+1 smoothed: 1.82x+5 per-work-accounted: 13 error-reduction=5.55%
Math to get to error-reduction=5.55%
:
- Estimation before update:
1.88(100)+(1)9 = 197; error = (197-180)/180 = 9.44%
- Estimation after update:
1.82(100)+(1)5 = 187; error = (187-180)/180 = 3.88%
- Error reduction:
9.44% - 3.88%
By looking at error-reduction=5.55%
I can easily tell that my model is trending in the right direction and by how much. If it was -ve, that'd be a bug. If later the error was computed over the last N intervals after making the fitter more complicated, the semantics would still hold.
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (cockroachdb#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
a83da0c
to
c7396ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/util/admission/store_token_estimation.go
line 37 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Hm, forgot about the bounds on a, which do seem reasonable to have. I don't know if a better linear fitter could avoid constraining a (and get a more well-behaved b), but this seems fine as a first step.
Ack
pkg/util/admission/testdata/tokens_linear_model_fitter
line 12 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I should've used an example. What I meant is printing out what the last smoothed model would compute the account-bytes as vs. the real amount, and what the new smoothed model would get. So something like:
update accounted-bytes=10 actual-bytes=10 work-count=1 ---- int: 0.00x+0 smoothed: 1.88x+9 per-work-accounted: 1 error-reduction=N/A update accounted-bytes=100 actual-bytes=180 work-count=4 ---- int: 1.76x+1 smoothed: 1.82x+5 per-work-accounted: 13 error-reduction=5.55%
Math to get to
error-reduction=5.55%
:
- Estimation before update:
1.88(100)+(1)9 = 197; error = (197-180)/180 = 9.44%
- Estimation after update:
1.82(100)+(1)5 = 187; error = (187-180)/180 = 3.88%
- Error reduction:
9.44% - 3.88%
By looking at
error-reduction=5.55%
I can easily tell that my model is trending in the right direction and by how much. If it was -ve, that'd be a bug. If later the error was computed over the last N intervals after making the fitter more complicated, the semantics would still hold.
Got it.
I have added a TODO
# TODO(sumeer): it would be easier to confirm that the new model was a better
# fit for the latest stats if we printed the error for the existing model and
# the new model (see the comment at
# https://github.com/cockroachdb/cockroach/pull/85059#pullrequestreview-1065876690).
# One small problem with that is that tokensLinearModelFitter has heuristics
# that it uses to ignore accounted-bytes when 0. Such problems are
# surmountable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to merge once CI is green, so that the succeeding PRs can be unblocked.
TFTRs!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
bors r=irfansharif,tbg |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 17 files at r1, 11 of 12 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained
The existing scheme for byte token estimation simply looked
at the total bytes added to L0 and divided it among the number
of requests. This was because (a) the parameters to provide
better size information for the request were not populated by
kvserver, (b) the basic estimation approach was flawed since
it assumed that regular writes would be roughly equal sized,
and assumed that ingests would tell what fraction went into L0.
The kvserver-side plumbing for improving (a) were done in a
preceding PR (#83937). This one completes that plumbing to pass on
admission.StoreWorkDoneInfo to the admission control package.
In this scheme the {WriteBytes,IngestedBytes} are provided
post-proposal evaluation, and the IngestedBytes is for the
whole LSM. This PR makes changes to the plumbing in the
admission package: specifically, the post-work-done token
adjustments are performed via the granterWithStoreWriteDone
interface and the addition to granterWithIOTokens. The former
also returns the token adjustments to StoreWorkQueue so
that the per-tenant fairness accounting in WorkQueue can be
updated.
The main changes in this PR are in the byte token
estimation logic in the admission package, where the
estimation now uses a linear model y=a.x + b, where
x is the bytes provided in admission.StoreWorkDoneInfo,
and y is the bytes added to L0 via write or ingestion.
If we consider regular writes, one can expect that even
with many different sized workloads concurrently being
active on a node, we should be able to fit a model where
a is roughly 2 and b is tiny -- this is because x is the
bytes written to the raft log and does not include the
subsequent state machine application. Similarly, one can
expect the a term being in the interval [0,1] for ingested
work. The linear model is meant to fix flaw (b) mentioned
earlier. The current linear model fitting in
store_token_estimation.go is very simple and can be
independently improved in the future -- there are code
comments outlining this. Additionally, all the byte token
estimation logic in granter.go has been removed, which
is better from a code readability perspective.
This change was evaluated with a single node that first
saw a kv0 workload that writes 64KB blocks, then
additionally a kv0 workload that writes 4KB blocks, and
finally a third workload that starts doing an index
backfill due to creating an index on the v column in
the kv table.
Here are snippets from a sequence of log statements when
only the first workload (64KB writes) was running:
Note that the parameter a, in a.x does fluctuate. The
additive value b stays at the minimum of 1 bytes, which
is desirable. There is no change to the starting ingest
model since there are no ingestions.
After both the 4KB and 64KB writes are active the log
statements look like:
Note that the b value stays at 1 byte. The tokens consumed
at admission time are evenly divided among requests, so
the value has dropped.
When the index backfill is also running, the sstables are
ingested into L5 and L6, so the x value in the ingested
model is high, but what is ingested into L0 is low, which
means a becomes very small for the ingested-model -- see
the smoothed 0.00x+1 B below. There is choppiness in this
experiment wrt the write model and the at-admission-tokens,
which is caused by a high number of write stalls. This
was not planned for, and is a side-effect of huge Pebble
manifests caused by 64KB keys. So ignore those values in
the following log statements.
Fixes #79092
Informs #82536
Release note: None