-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: account for write-amp in disk bandwidth limiter #129005
admission: account for write-amp in disk bandwidth limiter #129005
Conversation
b46e84e
to
59b1ea3
Compare
Changes here are picked from and cleaned up from here: #127838 |
Experiment with these changes was run here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments on disk_bandwidth.go, concerning the inputs and outputs. Once we settle on that, I think the rest will just be plumbing adjustments.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
pkg/util/admission/disk_bandwidth.go
line 125 at r1 (raw file):
} type diskTokens struct {
This could use some code commentary that these are all in terms of bytes.
Presumably these are all for elastic work, since that is the only one for which we are initially going to impose disk bandwidth limits. But even regular work will subtract from these presumably (but won't wait for a positive value). Again, needs a comment
The absence of a write amp float is a bit puzzling. The caller will need to use the model of LSM incoming bytes and multiply by the estimated write amp to subtract from these for the usual case. And not do that multiplier when subtracting for incoming snapshots. Oh, I see now that ioLoadListener
is peering into io.diskBandwidthLimiter.state.estimatedWriteAmp
. I think its cleaner for it to be the output.
pkg/util/admission/disk_bandwidth.go
line 134 at r1 (raw file):
// computeElasticTokens is called every adjustmentInterval. func (d *diskBandwidthLimiter) computeElasticTokens( id intervalDiskLoadInfo, il intervalUtilInfo,
intervalUtilInfo
seems peculiar.
- I think it is only used for computing write-amp. So most of the fields here are not relevant.
- The tokens embedded in il.requestedTokens.writeBWTokens are not really bandwidth tokens. These are what we call IOTokens representing what got ingested + flushed into the LSM. One question is do we even those IOTokens. We have the ingested and flushed bytes from the LSM, and that can be the denominator. When there are a lot of ingestions, such as snapshot ingestions, that go into L6, this will cause write amp to artificially drop, which is not ideal. We could explicitly subtract that from both the numerator and denominator.
pkg/util/admission/disk_bandwidth.go
line 138 at r1 (raw file):
// TODO(aaditya): Include calculation for read and IOPS. // Issue: https://github.com/cockroachdb/cockroach/issues/107623 writeBWTokens := int64(float64(id.provisionedBandwidth)*id.elasticBandwidthMaxUtil) - id.readBandwidth
This needs a comment that we are using the read bandwidth over the last 15s interval as a predictor the needs of the next 15s. Eventually we want to be faster in reacting, but this is ok-ish. We should probably smooth the readBandwidth and take the max of the smoothed and the last interval, so that we err on the side of a higher value, and reduced writeBWTokens.
pkg/util/admission/disk_bandwidth.go
line 161 at r1 (raw file):
} d.state.estimatedWriteAmp = smoothedWriteAmp d.state.diskBWUtil = float64(usedTokens.writeBWTokens) / float64(writeBWTokens)
is diskBWUtil
for a metric?
3d4fd3b
to
bb503fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed quite a bit of code since the last review. PTAL again. Tests are completely broken, I will fix them later.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/util/admission/disk_bandwidth.go
line 125 at r1 (raw file):
Previously, sumeerbhola wrote…
This could use some code commentary that these are all in terms of bytes.
Presumably these are all for elastic work, since that is the only one for which we are initially going to impose disk bandwidth limits. But even regular work will subtract from these presumably (but won't wait for a positive value). Again, needs a commentThe absence of a write amp float is a bit puzzling. The caller will need to use the model of LSM incoming bytes and multiply by the estimated write amp to subtract from these for the usual case. And not do that multiplier when subtracting for incoming snapshots. Oh, I see now that
ioLoadListener
is peering intoio.diskBandwidthLimiter.state.estimatedWriteAmp
. I think its cleaner for it to be the output.
Thanks for the pointer here, I changed the code around quite a bit here. Now we use a linear model for w-amp just like the other IO token estimations.
pkg/util/admission/disk_bandwidth.go
line 134 at r1 (raw file):
Previously, sumeerbhola wrote…
intervalUtilInfo
seems peculiar.
- I think it is only used for computing write-amp. So most of the fields here are not relevant.
- The tokens embedded in il.requestedTokens.writeBWTokens are not really bandwidth tokens. These are what we call IOTokens representing what got ingested + flushed into the LSM. One question is do we even those IOTokens. We have the ingested and flushed bytes from the LSM, and that can be the denominator. When there are a lot of ingestions, such as snapshot ingestions, that go into L6, this will cause write amp to artificially drop, which is not ideal. We could explicitly subtract that from both the numerator and denominator.
- removed
intervalUtilInfo
altogether - Fixed the whole w-amp calculation. PTAL at
updateEstimates
instorePerWorkTokenEstimator
pkg/util/admission/disk_bandwidth.go
line 138 at r1 (raw file):
Previously, sumeerbhola wrote…
This needs a comment that we are using the read bandwidth over the last 15s interval as a predictor the needs of the next 15s. Eventually we want to be faster in reacting, but this is ok-ish. We should probably smooth the readBandwidth and take the max of the smoothed and the last interval, so that we err on the side of a higher value, and reduced writeBWTokens.
Done.
pkg/util/admission/disk_bandwidth.go
line 161 at r1 (raw file):
Previously, sumeerbhola wrote…
is
diskBWUtil
for a metric?
Yes, it's used as a threshold to print the log line. It also printed out in the log line. It used as a proxy of "overloaded" disk.
bb503fc
to
8489a9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bunch of small comments -- looks quite clean.
Reviewed 2 of 12 files at r1, 5 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
pkg/util/admission/disk_bandwidth.go
line 53 at r2 (raw file):
// // Due to these challenges, we adopt a goal of simplicity of design, and // strong abstraction boundaries.
The "Due to these challenges" and the two bullets following it are stale.
pkg/util/admission/disk_bandwidth.go
line 94 at r2 (raw file):
provisionedBandwidth int64 // intProvisionedDiskBandwidth represents the disk writes available in an // adjustmentInterval.
we shouldn't need both bandwidth (bytes/s) and bytes. Given that we need to produce byte tokens, I see that we need the bytes. Can you just change the code so that this file can use the package wide adjustmentInterval
.
pkg/util/admission/disk_bandwidth.go
line 99 at r2 (raw file):
// elastic requests // TODO(aaditya): hook this into the cluster setting elasticBandwidthMaxUtil float64
where is this max util being set? I am guessing this is being passed in to every call to diskBandwidthLimiter
since the caller will sample the latest cluster setting value?
pkg/util/admission/disk_bandwidth.go
line 145 at r2 (raw file):
const alpha = 0.5 smoothedReadBandwidth := alpha*float64(id.readBandwidth) + alpha*float64(d.state.diskLoad.readBandwidth) intReadBandwidth := int64(math.Max(smoothedReadBandwidth, float64(id.readBandwidth))) * adjustmentInterval
This is intReadBytes
, yes? We should use the word bandwidth for fields that are in bytes/s.
pkg/util/admission/disk_bandwidth.go
line 149 at r2 (raw file):
// TODO(aaditya): consider setting a different floor to avoid startving out // elastic writes completely due to out-sized reads from above. writeBWTokens = int64(math.Max(0, float64(writeBWTokens)))
let's call these diskWriteTokens
to avoid creating ambiguity about the units.
pkg/util/admission/disk_bandwidth.go
line 191 at r2 (raw file):
readBWTokens: l.readBWTokens + r.readBWTokens, writeBWTokens: l.writeBWTokens + r.writeBWTokens, readIOPSTokens: l.readBWTokens + r.readIOPSTokens,
typo: l.readIOPSTokens
pkg/util/admission/granter.go
line 405 at r2 (raw file):
if sg.coordMu.availableIOTokens[admissionpb.RegularWorkClass] > 0 { sg.subtractTokensLocked(count, count, false) sg.coordMu.diskTokensAvailable.writeBWTokens -= count
subtracting only the count isn't quite right, since it doesn't take into account the write amp. What we should be doing is to apply the linear model here and subtract that number, and in storeReplicatedWorkAdmittedLocked
apply the linear model again and subtract the difference.
There is one problem of course -- the linear model may have changed from this subtraction to the correction. Unfortunately it is painful to plumb more than the the originalTokens
uint64 from the admission time to the done time. I think we can tolerate the change in the linear model and leave a TODO to improve this situation later. Also, with RACv2 we won't have an initial admission phase for most requests, so this will be ok.
pkg/util/admission/io_load_listener.go
line 605 at r2 (raw file):
writeBandwidth: int64((diskStats.BytesWritten - prevCumBytesWritten) / adjustmentInterval), provisionedBandwidth: diskStats.ProvisionedBandwidth, intProvisionedDiskBandwidth: diskStats.ProvisionedBandwidth * adjustmentInterval,
it's odd to need both the total bytes and the rate (bandwidth).
pkg/util/admission/store_token_estimation.go
line 131 at r2 (raw file):
// compactions after bytes are written to L0. It models the relationship // between ingests (not including range snapshots) plus incoming L0 bytes and // total write throughput in a given interval. We ignore range snapshots here,
nit: total disk write throughput
pkg/util/admission/store_token_estimation.go
line 200 at r2 (raw file):
e.cumL0WriteBytes = l0Metrics.BytesFlushed e.cumL0IngestedBytes = l0Metrics.BytesIngested e.cumLSMIngestedBytes = cumLSMIngestedBytes
e.cumDiskWrites
needs to be initialized here.
pkg/util/admission/store_token_estimation.go
line 304 at r2 (raw file):
e.cumL0WriteBytes = l0Metrics.BytesFlushed e.cumL0IngestedBytes = l0Metrics.BytesIngested e.cumLSMIngestedBytes = cumLSMIngestedBytes
ditto
A unit test should catch this.
3f442a3
to
115abcc
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
065dfe0
to
ca3d6bd
Compare
746dc8b
to
65b4ad1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Addressed all the comments. PTAL. Most of the new changes are in the unit tests. Sorry, these are pretty verbose due to the data driven test and since we added a few fields to the output of io_load_listener.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/util/admission/disk_bandwidth.go
line 53 at r2 (raw file):
Previously, sumeerbhola wrote…
The "Due to these challenges" and the two bullets following it are stale.
Done. Reworded a little to make it more accurate.
pkg/util/admission/disk_bandwidth.go
line 94 at r2 (raw file):
Previously, sumeerbhola wrote…
we shouldn't need both bandwidth (bytes/s) and bytes. Given that we need to produce byte tokens, I see that we need the bytes. Can you just change the code so that this file can use the package wide
adjustmentInterval
.
Done. Removed the per second values.
pkg/util/admission/disk_bandwidth.go
line 99 at r2 (raw file):
Previously, sumeerbhola wrote…
where is this max util being set? I am guessing this is being passed in to every call to
diskBandwidthLimiter
since the caller will sample the latest cluster setting value?
Yes. Set by caller, comes from cluster setting.
pkg/util/admission/disk_bandwidth.go
line 145 at r2 (raw file):
Previously, sumeerbhola wrote…
This is
intReadBytes
, yes? We should use the word bandwidth for fields that are in bytes/s.
Yeah changed the wording.
pkg/util/admission/disk_bandwidth.go
line 149 at r2 (raw file):
Previously, sumeerbhola wrote…
let's call these
diskWriteTokens
to avoid creating ambiguity about the units.
Done.
pkg/util/admission/disk_bandwidth.go
line 191 at r2 (raw file):
Previously, sumeerbhola wrote…
typo: l.readIOPSTokens
Done.
pkg/util/admission/granter.go
line 405 at r2 (raw file):
Previously, sumeerbhola wrote…
subtracting only the count isn't quite right, since it doesn't take into account the write amp. What we should be doing is to apply the linear model here and subtract that number, and in
storeReplicatedWorkAdmittedLocked
apply the linear model again and subtract the difference.There is one problem of course -- the linear model may have changed from this subtraction to the correction. Unfortunately it is painful to plumb more than the the
originalTokens
uint64 from the admission time to the done time. I think we can tolerate the change in the linear model and leave a TODO to improve this situation later. Also, with RACv2 we won't have an initial admission phase for most requests, so this will be ok.
Done.
pkg/util/admission/io_load_listener.go
line 605 at r2 (raw file):
Previously, sumeerbhola wrote…
it's odd to need both the total bytes and the rate (bandwidth).
Done. Removed.
pkg/util/admission/store_token_estimation.go
line 131 at r2 (raw file):
Previously, sumeerbhola wrote…
nit: total disk write throughput
Done.
pkg/util/admission/store_token_estimation.go
line 200 at r2 (raw file):
Previously, sumeerbhola wrote…
e.cumDiskWrites
needs to be initialized here.
Done.
pkg/util/admission/store_token_estimation.go
line 304 at r2 (raw file):
Previously, sumeerbhola wrote…
ditto
A unit test should catch this.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 12 files at r1, 1 of 6 files at r3, 1 of 6 files at r6, 13 of 13 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi)
pkg/util/admission/disk_bandwidth.go
line 54 at r8 (raw file):
// Due to these challenges, we adopt a goal of simplicity of design. // // - The disk bandwidth limiter estimates write byte tokens using the
nit: estimates disk write ...
(i.e., repeat disk, to avoid any confusion)
pkg/util/admission/disk_bandwidth.go
line 57 at r8 (raw file):
// provisioned value and subtracts away reads seen in the previous interval. // // - We try to estimate the write amplification using a model of the incoming
nit: drop "try to", so simply "We estimate ..."
pkg/util/admission/disk_bandwidth.go
line 83 at r8 (raw file):
// intervalDiskLoadInfo provides disk stats over an adjustmentInterval. type intervalDiskLoadInfo struct { // intReadBytes represents read bytes in a given interval.
nit: represents measured disk read bytes ....
same for intWriteBytes
pkg/util/admission/disk_bandwidth.go
line 142 at r8 (raw file):
intReadBytes := int64(math.Max(smoothedReadBytes, float64(id.intReadBytes))) diskWriteTokens := int64(float64(id.intProvisionedDiskBytes)*id.elasticBandwidthMaxUtil) - intReadBytes // TODO(aaditya): consider setting a different floor to avoid startving out
nit: starving
pkg/util/admission/granter.go
line 405 at r8 (raw file):
// `storeReplicatedWorkAdmittedLocked()`. There is an obvious gap in // accounting here, since the model could change between the two calls for the // same request. It is acceptable to do so because the second call in
The second call being a better estimate of the size of the work is definitely true. But I think the "obvious gap" part could use some elaboration.
The problem is that say we initially thought the write was 50 bytes. And then applied the writeAmpLM of 10x + 1. So 501 bytes. Then the linear model changes to 5x + 1, and we find that the actual write was 200 bytes. So we need 1001 and we think we have subtracted 251, so we additionally take 750. But we have already taken 501 so we only needed to take 500. So we take more than needed, which will hurt other work. And one can construct an example where we take less. Good thing is that the linear model changes every 15s, so the number of such cases is hopefully low.
pkg/util/admission/granter.go
line 581 at r8 (raw file):
sg.coordMu.availableIOTokens[admissionpb.ElasticWorkClass] = max(sg.coordMu.availableIOTokens[admissionpb.ElasticWorkClass], 0) // Same for disk write tokens.
It may also be less necessary to remember the deficit in the token bucket for disk writes, since these writes have already happened. Shaping future writes to be lower because we exceeded the bucket size in the previous interval doesn't seem beneficial given there isn't memory here (LSM L0 has memory, in comparison, since the read amp may still be higher).
pkg/util/admission/io_load_listener.go
line 92 at r8 (raw file):
settings.SystemOnly, "kvadmission.store.elastic_disk_bandwidth_max_util", "sets the max utilization for disk bandwidth for elastic traffic", 0.9,
0.9 seems high as a default, given we are approximating many things. If we exceed provisioned bandwidth, it has a negative effect on foreground latency. Should we start with 0.8?
pkg/util/admission/store_token_estimation.go
line 182 at r8 (raw file):
// excluding ignored bytes. intAdjustedLSMWrites int64 intAdjustedWriteBytes int64
is this intAdjustedDiskWriteBytes
?
pkg/util/admission/store_token_estimation.go
line 255 at r8 (raw file):
intDiskWrite := int64(cumDiskWrite - e.cumDiskWrites) adjustedIntLSMWrites := adjustedIntL0WriteBytes + adjustedIntLSMIngestedBytes adjustedIntWrites := intDiskWrite - intIgnoredIngestedBytes - intL0IgnoredWriteBytes
nit: let's call this adjustedIntDiskWrites
pkg/util/admission/testdata/granter
line 582 at r8 (raw file):
(chain: id: 0 active: false index: 5) io-avail: 370(370), disk-write-tokens-avail: -20 # Both tokens become positive, since 280 tokens are returned, so one work is granted.
are two work items being granted here? Does the comment need updating?
65b4ad1
to
e07b0af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=sumeerbhola
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)
pkg/util/admission/disk_bandwidth.go
line 54 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: estimates disk write ...
(i.e., repeat disk, to avoid any confusion)
Done.
pkg/util/admission/disk_bandwidth.go
line 57 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: drop "try to", so simply "We estimate ..."
Done.
pkg/util/admission/disk_bandwidth.go
line 83 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: represents measured disk read bytes ....
same for
intWriteBytes
Done.
pkg/util/admission/disk_bandwidth.go
line 142 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: starving
Done.
pkg/util/admission/granter.go
line 405 at r8 (raw file):
Previously, sumeerbhola wrote…
The second call being a better estimate of the size of the work is definitely true. But I think the "obvious gap" part could use some elaboration.
The problem is that say we initially thought the write was 50 bytes. And then applied the writeAmpLM of 10x + 1. So 501 bytes. Then the linear model changes to 5x + 1, and we find that the actual write was 200 bytes. So we need 1001 and we think we have subtracted 251, so we additionally take 750. But we have already taken 501 so we only needed to take 500. So we take more than needed, which will hurt other work. And one can construct an example where we take less. Good thing is that the linear model changes every 15s, so the number of such cases is hopefully low.
Done.
pkg/util/admission/granter.go
line 581 at r8 (raw file):
Previously, sumeerbhola wrote…
It may also be less necessary to remember the deficit in the token bucket for disk writes, since these writes have already happened. Shaping future writes to be lower because we exceeded the bucket size in the previous interval doesn't seem beneficial given there isn't memory here (LSM L0 has memory, in comparison, since the read amp may still be higher).
Done.
pkg/util/admission/io_load_listener.go
line 92 at r8 (raw file):
Previously, sumeerbhola wrote…
0.9 seems high as a default, given we are approximating many things. If we exceed provisioned bandwidth, it has a negative effect on foreground latency. Should we start with 0.8?
Done.
pkg/util/admission/store_token_estimation.go
line 182 at r8 (raw file):
Previously, sumeerbhola wrote…
is this
intAdjustedDiskWriteBytes
?
Done.
pkg/util/admission/store_token_estimation.go
line 255 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: let's call this adjustedIntDiskWrites
Done.
pkg/util/admission/testdata/granter
line 582 at r8 (raw file):
Previously, sumeerbhola wrote…
are two work items being granted here? Does the comment need updating?
Done.
bors r- |
Canceled. |
f182a6d
to
701493c
Compare
Previously, we would calculate elastic disk bandwidth tokens using arbitrary load thresholds and an estimate on incoming bytes into the LSM through flushes and ingestions. This calculation lacked accounting for write amplification in the LSM. This patch simplifies the disk bandwidth limiter to remove the disk load watcher and simply adjust tokens using the known provisioned disk bandwidth. For token deducation, we create a write-amp model that is a relationship between incoming LSM bytes to actual disk writes. The token granting semantics are as follows: - elastic writes: deduct tokens, and wait for positive count in bucket. - regular writes: deduct tokens, but proceed even with no tokens available. The requested write bytes are "inflated" using the estimated write amplification to account for async compactions in the LSM. This patch also lays the framework for future integrations where we can account for range snapshot ingestions separately as those don't incur the same write amplification as flushed LSM bytes do. Informs: cockroachdb#86857 Release note: None
701493c
to
55f1edd
Compare
bors r=sumeerbhola |
Previously, we would calculate elastic disk bandwidth tokens using
arbitrary load thresholds and an estimate on incoming bytes into the LSM
through flushes and ingestions. This calculation lacked accounting for
write amplification in the LSM.
This patch simplifies the disk bandwidth limiter to remove the disk load
watcher and simply adjust tokens using the known provisioned disk
bandwidth. For token deducation, we create a write-amp model that is a
relationship between incoming LSM bytes to actual disk writes.
The token granting semantics are as follows:
available.
The requested write bytes are "inflated" using the estimated write
amplification to account for async compactions in the LSM.
This patch also lays the framework for future integrations where we can
account for range snapshot ingestions separately as those don't incur
the same write amplification as flushed LSM bytes do.
Informs: #86857
Release note: None