roachtest: attempt to handle VM overload under tpccbench #62361

tbg · 2021-03-22T15:59:45Z

See #62039.

tpccbench, by design, pushes CRDB into overload territory. The test
harness handles nodes crashing or tpmc tanking well. However, it was
not prepared to handle the cloud VMs going unresponsive for ~minutes,
which is one common failure mode.

This commit tweaks the line search to be resilient to failures to
communicate with the cloud VM in the one place where it matters
(stopping the cluster at the beginning of a new search attempt).

The hope is that this will allow the search to run to completion,
even in the face of overload-imposed temporary VM outages. It is
not expected to do this reliably, but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None

cockroach-teamcity · 2021-03-22T15:59:52Z

This change is

tbg · 2021-03-23T13:16:27Z

I realized that this isn't nearly enough. The line search ends on any error:

cockroach/pkg/util/search/search.go

Lines 43 to 46 in f80ec9c

    
           pass, err := pred(s.current()) 
        
           if err != nil { 
        
           	return 0, err 
        
           }

I did a quick sanity check running with 2500 warehouses (max is < 2200) and yep, you quickly get a node to crash, the line search ends, the test fails hard. It seems that any overload error encountered by the test would fail the test (though perhaps the stats for any passing run would still get picked up by roachperf, thus providing at least a data point? Not sure)

I'll do some more work here to make it actually resilient to overload, i.e. this should be interpreted as a failing run of TPCC, not a failure of the test.

irfansharif · 2021-03-24T15:19:55Z

It seems that any overload error encountered by the test would fail the test

I've seen instances where if the overload happens early on during the workload (say 3m in, instead of 8m in), and the unresponsive VM is "back" by the time we've finished the 10m workload (albeit in degraded mode, using compute of 2/3 nodes), and so is roachprod stop-able, the test doesn't actually fail. Not that that's anything relevant to this PR, but maybe you'd find interesting.

tbg · 2021-03-25T14:41:46Z

The version I just pushed is starting to look good. In the snippet below (from test.log) shows a node crashing during 2500 warehouses, then we line search to 2488 warehouses, but can't reset the cluster because of VM brown-out; we correctly retry and it works out and now the 2488 warehouse run is running.

I'm running ten instances of this test, so I'll see if anything else shakes out.

14:20:28 tpcc.go:903: test status: running benchmark, warehouses=2500
14:20:28 cluster.go:2318: [w6] > ./cockroach workload run tpcc --warehouses=2500 --active-warehouses=2500 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2500/stats.json {pgurl:1-3}
3: 7324
2: 7237
1: 7595
2: kill exited nonzero
2: dead
14:31:24 tpcc.go:974: --- SEARCH ITER FAIL: TPCC 2500 resulted in 0.0 tpmC and failed due to monitor failure: unexpected node event: 2: dead
14:31:24 tpcc.go:805: initializing cluster for 2488 warehouses (search attempt: 2)
14:31:24 cluster.go:2207: test status: stopping cluster
14:31:24 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: stopping and waiting....................................................................................................................................................................
2: exit status 255: 
I210325 14:34:09.311235 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
14:34:09 tpcc.go:824: unable to stop cluster; retrying to allow vm to recover: /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3 returned: exit status 1
14:34:39 cluster.go:2207: test status: stopping cluster
14:34:39 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: stopping and waiting.
14:34:41 cluster.go:2157: test status: starting cluster
14:34:41 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden start --encrypt=false tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: starting nodes
tobias-1616680721-05-n4cpu16: initializing cluster
tobias-1616680721-05-n4cpu16: setting cluster settings
14:35:11 tpcc.go:903: test status: running benchmark, warehouses=2488
14:35:11 cluster.go:2318: [w6] > ./cockroach workload run tpcc --warehouses=2500 --active-warehouses=2488 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2488/stats.json {pgurl:1-3}
2: 8998
3: 8411
1: 8950

tbg · 2021-03-25T14:43:00Z

Looks like some of these VMs are browning out pretty long (note the time since "stopping cluster" on some of these):

tbg · 2021-03-25T15:11:17Z

Ugh, all test runs are stopping their clusters. Some of them for 20 minutes :S I wonder if these VMs are ever coming back.

As sort of a trial run, I reset one VM I couldn't ssh to for a while via

CLOUDSDK_CORE_PROJECT=cockroach-ephemeral gcloud compute instances reset --zone us-east1-b tobias-1616680721-04-n4cpu16-0002

and I assumed this was going to backfire due to it losing the local ssd, but it didn't! So I think the test could in theory continue. Unfortunately, it's still stuck on

stopping cluster (set 36m57s ago)

I might need to add a cancellation here, so that retries are eager. But in either case, it's not really a thing to have a roachtest invoke cloud-specific commands to reset a vm, we'd have to properly add that and ugh, doesn't sound like a good idea honestly. But we need to somehow get this test to be resilient as it's not like we can run it on some other infrastructure anytime soon.

tbg · 2021-03-25T15:58:15Z

Starting a new attempt. I added roachprod reset and this will be invoked before trying to stop the cluster in each invocation (so that it's in the hot path and I can be sure it works all the time).

tbg · 2021-03-25T16:30:45Z

Hey look at this, this actually does seem to work. Tests reliably seem to make it from the 2500 into the 2488 run:

tbg · 2021-03-25T16:43:17Z

tbg · 2021-03-26T08:52:34Z

Some questionable results but at least results! One run failed because my computer went to sleep and this tripped up a gcloud invocation.

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ rg 'MAX WAREHOUSES' -- artifacts/tpccbench/
artifacts/tpccbench/nodes=3/cpu=16/run_3/test.log
469:MAX WAREHOUSES = 2170

artifacts/tpccbench/nodes=3/cpu=16/run_1/test.log
447:MAX WAREHOUSES = 1834

artifacts/tpccbench/nodes=3/cpu=16/run_8/test.log
438:MAX WAREHOUSES = 2086

artifacts/tpccbench/nodes=3/cpu=16/run_4/test.log
633:MAX WAREHOUSES = 961

artifacts/tpccbench/nodes=3/cpu=16/run_2/test.log
389:MAX WAREHOUSES = 2230

artifacts/tpccbench/nodes=3/cpu=16/run_9/test.log
442:MAX WAREHOUSES = 2086

artifacts/tpccbench/nodes=3/cpu=16/run_5/test.log
390:MAX WAREHOUSES = 2170

artifacts/tpccbench/nodes=3/cpu=16/run_6/test.log
647:MAX WAREHOUSES = 961

artifacts/tpccbench/nodes=3/cpu=16/run_10/test.log
463:MAX WAREHOUSES = 2122

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

irfansharif · 2021-04-01T21:40:28Z

Flakes were due to #62604. This PR worked beautifully for some test runs I did as part of #62998; will merge on green.

irfansharif · 2021-04-01T22:42:56Z

bors r+

See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. However, tpccbench was not handling the resulting conditions at all. Any node death or stalling VM would fail the whole test run, rather than retry again with a lower warehouse count. This commit makes most errors "recoverable" in the sense that they will simply treat the run as failing, but continue the line search. Exceptions are communicated via `t.Fatal`, which will abort the whole run instead. We also make the `c.Stop()` step at the beginning of each search step resilient to VMs browning out under memory pressure (from the previous run), by patiently retrying for a few minutes. The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this perfectly - after all, VMs don't generally return from brown-out within any fixed time period - but at least anecdotally most VMs seem to come back a few minutes in. Release note: None

craig · 2021-04-01T23:21:18Z

Canceled.

irfansharif · 2021-04-01T23:21:28Z

bors r+

craig · 2021-04-02T00:04:03Z

Build succeeded:

GitHub CI (Cockroach)

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

tbg requested review from irfansharif and nvanbenschoten March 22, 2021 15:59

irfansharif approved these changes Mar 22, 2021

View reviewed changes

tbg force-pushed the tpccbench-harden branch from 9d9d3db to dad8261 Compare March 22, 2021 16:54

tbg force-pushed the tpccbench-harden branch from 45cb9ac to fe2828f Compare March 25, 2021 14:41

This was referenced Mar 29, 2021

roachtest: schemachange/index/tpcc/w=1000 failed #62320

Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #61974

Closed

irfansharif mentioned this pull request Apr 1, 2021

sql: lower default sampling rate to 1% #62998

Merged

irfansharif force-pushed the tpccbench-harden branch from 4b29edc to 43cb138 Compare April 1, 2021 21:39

irfansharif force-pushed the tpccbench-harden branch from 43cb138 to 63618fc Compare April 1, 2021 22:42

irfansharif force-pushed the tpccbench-harden branch from 63618fc to 84c0136 Compare April 1, 2021 23:21

craig bot merged commit 0966b27 into cockroachdb:master Apr 2, 2021

tbg deleted the tpccbench-harden branch April 2, 2021 05:59

andreimatei mentioned this pull request Apr 2, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [overload,closed ts regressing from X to Y] #61981

Closed

tbg mentioned this pull request Apr 24, 2021

roachprod: very poor VM behavior when running out of memory #64176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: attempt to handle VM overload under tpccbench #62361

roachtest: attempt to handle VM overload under tpccbench #62361

tbg commented Mar 22, 2021

cockroach-teamcity commented Mar 22, 2021

tbg commented Mar 23, 2021

irfansharif commented Mar 24, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 26, 2021

irfansharif commented Apr 1, 2021

irfansharif commented Apr 1, 2021

craig bot commented Apr 1, 2021

irfansharif commented Apr 1, 2021

craig bot commented Apr 2, 2021

roachtest: attempt to handle VM overload under tpccbench #62361

roachtest: attempt to handle VM overload under tpccbench #62361

Conversation

tbg commented Mar 22, 2021

cockroach-teamcity commented Mar 22, 2021

tbg commented Mar 23, 2021

irfansharif commented Mar 24, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 25, 2021

tbg commented Mar 26, 2021

irfansharif commented Apr 1, 2021

irfansharif commented Apr 1, 2021

craig bot commented Apr 1, 2021

irfansharif commented Apr 1, 2021

craig bot commented Apr 2, 2021