-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: byte tokens for store admission #79092
Comments
This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precurson to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs cockroachdb#79092 Informs cockroachdb#77357 Release note: None
This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precursor to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs cockroachdb#79092 Informs cockroachdb#77357 Release note: None
80480: admission: change store write admission control to use byte tokens r=ajwerner a=sumeerbhola This switch to byte tokens will result in better accounting for large writes, including ingests based on whether their bytes land in L0 or elsewhere. It is also a precurson to taking into account flush capacity (in bytes). The store write admission control path now uses a StoreWorkQueue which wraps a WorkQueue and provides additional functionality: - Work can specify WriteBytes and whether it is an IngestRequest. This is used to decide how many byte tokens to consume. - Done work specifies how many bytes were ingested into L0, so token consumption can be adjusted. The main framework change is that a single work item can consume multiple (byte) tokens, which ripples through the various interfaces including requester, granter. There is associated cleanup: kvGranter that was handling both slots and tokens is eliminated since in practice it was only doing one or the other. Instead, for the slot case the slotGranter is reused. For the token case there is a new kvStoreTokenGranter. The main logic change is in ioLoadListener which computes byte tokens and various estimates. The change is (mostly) neutral if no write provides WriteBytes, since the usual estimation will take over. The integration changes in this PR are superficial in that the requests don't provide WriteBytes. Improvements to the integration, along with experimental results, will happen in future PRs. Informs #79092 Informs #77357 Release note: None 80892: hlc: move ParseHLC / DecimalToHLC to util/hlc r=ajwerner a=otan util/hlc needs util/log, but ParseHLC/DecimalToHLC doesn't need to be in tree. To remove the tree dependency on util/log, move the mentioned functions. Release note: None 80898: util/cache: remove dependency on util/log r=ajwerner a=otan Use a hook in IntervalCache to log any errors instead. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
The changes to the admission control package are done. The remaining work is
|
I don't think bulk has any roachtests that have any real concurrent SQL writes with bulk ingest; the ingest ones (IMPORT, RESTORE) have the normal system-driven writes (liveness, job progress, etc) but that's it. Schema might though have something that adds an index while running a workload? @ajwerner ? |
There's a few, but I don't know much about them. See here for cockroach/pkg/cmd/roachtest/tests/schemachange.go Lines 304 to 329 in 21a286f
|
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR. This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can imagine being somewhere in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR. This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can imagine being somewhere in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (cockroachdb#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes cockroachdb#79092 Informs cockroachdb#82536 Release note: None
84761: schematelemetry,eventpb: add schema telemetry r=postamar a=postamar This commit adds: - the event definitions and logic for generating them, - the scheduling and jobs boilerplate to periodically log them. Care is taken to redact all strings present in descriptors which might unintentionally be leaking PIIs. The event generation logic is tested on the schema of a bootstrapped test cluster: the test checks that the events match expectations. Fixes #84284. Release note (general change): CRDB will now collect schema info if phoning home is enabled. This schema info is added to the telemetry log by a built-in scheduled job which runs on a weekly basis by default. This recurrence can be changed via the sql.schema.telemetry.recurrence cluster setting. The schedule can also be paused via PAUSE SCHEDULE followed by its ID, which can be retrieved by querying SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'. 85059: admission,kvserver: improved byte token estimation for writes r=irfansharif,tbg a=sumeerbhola The existing scheme for byte token estimation simply looked at the total bytes added to L0 and divided it among the number of requests. This was because (a) the parameters to provide better size information for the request were not populated by kvserver, (b) the basic estimation approach was flawed since it assumed that regular writes would be roughly equal sized, and assumed that ingests would tell what fraction went into L0. The kvserver-side plumbing for improving (a) were done in a preceding PR (#83937). This one completes that plumbing to pass on admission.StoreWorkDoneInfo to the admission control package. In this scheme the {WriteBytes,IngestedBytes} are provided post-proposal evaluation, and the IngestedBytes is for the whole LSM. This PR makes changes to the plumbing in the admission package: specifically, the post-work-done token adjustments are performed via the granterWithStoreWriteDone interface and the addition to granterWithIOTokens. The former also returns the token adjustments to StoreWorkQueue so that the per-tenant fairness accounting in WorkQueue can be updated. The main changes in this PR are in the byte token estimation logic in the admission package, where the estimation now uses a linear model y=a.x + b, where x is the bytes provided in admission.StoreWorkDoneInfo, and y is the bytes added to L0 via write or ingestion. If we consider regular writes, one can expect that even with many different sized workloads concurrently being active on a node, we should be able to fit a model where a is roughly 2 and b is tiny -- this is because x is the bytes written to the raft log and does not include the subsequent state machine application. Similarly, one can expect the a term being in the interval [0,1] for ingested work. The linear model is meant to fix flaw (b) mentioned earlier. The current linear model fitting in store_token_estimation.go is very simple and can be independently improved in the future -- there are code comments outlining this. Additionally, all the byte token estimation logic in granter.go has been removed, which is better from a code readability perspective. This change was evaluated with a single node that first saw a kv0 workload that writes 64KB blocks, then additionally a kv0 workload that writes 4KB blocks, and finally a third workload that starts doing an index backfill due to creating an index on the v column in the kv table. Here are snippets from a sequence of log statements when only the first workload (64KB writes) was running: ``` write-model 1.46x+1 B (smoothed 1.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 78 KiB write-model 1.37x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 80 KiB write-model 1.50x+1 B (smoothed 1.43x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 79 KiB write-model 1.39x+1 B (smoothed 1.30x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 77 KiB ``` Note that the parameter a, in a.x does fluctuate. The additive value b stays at the minimum of 1 bytes, which is desirable. There is no change to the starting ingest model since there are no ingestions. After both the 4KB and 64KB writes are active the log statements look like: ``` write-model 1.85x+1 B (smoothed 1.78x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 59 KiB write-model 1.23x+1 B (smoothed 1.51x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 47 KiB write-model 1.21x+1 B (smoothed 1.36x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 40 KiB ``` Note that the b value stays at 1 byte. The tokens consumed at admission time are evenly divided among requests, so the value has dropped. When the index backfill is also running, the sstables are ingested into L5 and L6, so the x value in the ingested model is high, but what is ingested into L0 is low, which means a becomes very small for the ingested-model -- see the smoothed 0.00x+1 B below. There is choppiness in this experiment wrt the write model and the at-admission-tokens, which is caused by a high number of write stalls. This was not planned for, and is a side-effect of huge Pebble manifests caused by 64KB keys. So ignore those values in the following log statements. ``` write-model 1.93x+1 B (smoothed 1.56x+2 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 120 KiB write-model 2.34x+1 B (smoothed 1.95x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 157 KiB ``` Fixes #79092 Informs #82536 Release note: None Co-authored-by: Marius Posta <marius@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Admission control is not aware of the bytes that will be written by each work item. It does compute tokens based on the bytes being written to L0 and compacted out of L0, but turns those into a token-per-work-item by using the mean bytes written per work. Estimates are fine for small writes but proper accounting for larger writes (like AddSSTable, and in the future for range snapshot application) is preferable, since it avoids spikes of over-admission, later compensated by under-admission.
Also, we should be properly accounting for how many of the ingested bytes were added to L0.
We plan to:
#75120 contains a WIP PR with this change.
Deficiencies:
The overload threshold at which we want to start constraining writes could be different for user-facing and background operations (like index backfills): By sharing a single queue and a single source of tokens for that queue we also share the same overload thresholds. This is probably not an immediate problem since rocksdb.ingest_backpressure.l0_file_count_threshold and admission.l0_sub_level_count_overload_threshold both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: the admission.ioLoadListener would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former (we are trying to do something similar for cpu slots for background activities like concurrent compression threads in Pebble compactions).
This issue is split off from the broader #75066
Jira issue: CRDB-14556
Epic CRDB-14607
The text was updated successfully, but these errors were encountered: