roachtest: tpccbench/nodes=3/cpu=16 failed #59424

cockroach-teamcity · 2021-01-26T10:56:13Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@d86781c07065421f4a4d8bf5d988900ab07fdce5:

		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

	cluster.go:1666,context.go:140,cluster.go:1655,test_runner.go:848: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2615395-1611644173-19-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: dead
		2: dead
		1: 12227
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2021-01-26T15:47:31Z

Very similar symptoms to what we're seeing in #58298. It's unclear to me what's causing the crash. The dmesg file doesn't have any mention of oom-killer.

Heap profile doesn't seem to indicate any foulness and looks like 64d6d87 is doing what it's supposed to.

However, I am seeing a whole lot of gcTxnRecords in the stacktraces, as @nvanbenschoten previously observed.

cat goroutine_dump.2021-01-26T08_47_17.877.double_since_last_dump.000006950.txt | rg "gcTxnRecord" | wc -l
     251

irfansharif · 2021-02-01T22:54:17Z

Should no longer be a release blocker as long as the release is picking up #59431.

cockroach-teamcity · 2021-02-15T09:09:37Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5971ecb9dd1a25c81cd6012d6be1ff922802eae5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2219,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2676655-1613372337-19-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2676655-1613372337-19-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2676655-1613372337-19-n4cpu16: stopping and waiting..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 2: exit status 255: Connection to 35.222.134.47 closed by remote host.
		  |
		  | I210215 09:00:36.471961 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-18T09:05:27Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@3c223f5f5162103110a790743b687ef2bf952489:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-19T08:53:44Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60747 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-60635 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

nvanbenschoten · 2021-02-20T03:02:25Z

The last two failures are likely fallout from #60765.

cockroach-teamcity · 2021-02-21T09:02:31Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60747 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-60635 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-22T10:51:41Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bf9744bad5a416a4b06907f0f3dd42896f7342f3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | 1: exit status 255: 
		  | I210222 10:42:40.368808 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60747 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-60635 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-23T11:10:45Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5cfd7e5553a3072a1490d392390dddf968844215:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2707822-1614064242-26-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2707822-1614064242-26-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 1: exit status 255: 
		  | 2: exit status 255: 
		  | I210223 11:01:44.715647 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #60747 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-60635 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #60094 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2021-02-23T18:24:39Z

08:41:31 test.go:196: test status: 
08:41:31 tpcc.go:911: --- SEARCH ITER FAIL: TPCC 2200 resulted in 6142.5 tpmC and failed due to efficiency value of 22.150252691927975 is below ppassing threshold of 85
08:41:31 tpcc.go:805: initializing cluster for 2188 warehouses (search attempt: 2)

I think this is a real failure that needs to be investigated. Partway through the workload we seem to hit 0QPS due to crashed nodes. Except the logs for these nodes aren't available, because the machines have gone? Could be an infra-flake, but I'm not sure why this is the only test (that I'm aware of) that's seeing a pattern like this.

 238.0s        0           84.8          228.4  22548.6  47244.6  81604.4 103079.2 newOrder
  238.0s        0           17.9           24.7  12348.0  38654.7  47244.6  47244.6 orderStatus
  238.0s        0          195.5          241.4  24696.1  49392.1  90194.3 103079.2 payment
  238.0s        0           24.9           24.3  15569.3  34359.7  47244.6  47244.6 stockLevel
  239.0s        0           16.0           23.5  38654.7  49392.1  51539.6  51539.6 delivery
  239.0s        0          107.3          227.9  31138.5  73014.4 103079.2 103079.2 newOrder
  239.0s        0            8.0           24.7  21474.8  45097.2  45097.2  45097.2 orderStatus
  239.0s        0          104.3          240.8  23622.3  47244.6  68719.5  85899.3 payment
  239.0s        0           14.0           24.2  23622.3  36507.2  45097.2  45097.2 stockLevel
  240.0s        0            0.0           23.4      0.0      0.0      0.0      0.0 delivery
  240.0s        0            0.0          227.0      0.0      0.0      0.0      0.0 newOrder
  240.0s        0            0.0           24.6      0.0      0.0      0.0      0.0 orderStatus
  240.0s        0            0.0          239.8      0.0      0.0      0.0      0.0 payment
  240.0s        0            0.0           24.1      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  241.0s        0            0.0           23.3      0.0      0.0      0.0      0.0 delivery
  241.0s        0            0.0          226.0      0.0      0.0      0.0      0.0 newOrder
  241.0s        0            0.0           24.5      0.0      0.0      0.0      0.0 orderStatus
  241.0s        0            0.0          238.8      0.0      0.0      0.0      0.0 payment
  241.0s        0            0.0           24.0      0.0      0.0      0.0      0.0 stockLevel
  242.0s        0            0.0           23.2      0.0      0.0      0.0      0.0 delivery
  242.0s        0            0.0          225.1      0.0      0.0      0.0      0.0 newOrder
  242.0s        0            0.0           24.4      0.0      0.0      0.0      0.0 orderStatus
  242.0s        0            0.0          237.8      0.0      0.0      0.0      0.0 payment
  242.0s        0            0.0           23.9      0.0      0.0      0.0      0.0 stockLevel
  243.0s        0            0.0           23.1      0.0      0.0      0.0      0.0 delivery
  243.0s        0            0.0          224.1      0.0      0.0      0.0      0.0 newOrder
  243.0s        0            0.0           24.3      0.0      0.0      0.0      0.0 orderStatus
  243.0s        0            0.0          236.9      0.0      0.0      0.0      0.0 payment
  243.0s        0            0.0           23.8      0.0      0.0      0.0      0.0 stockLevel
  244.0s        0            0.0           23.0      0.0      0.0      0.0      0.0 delivery
  244.0s        0            0.0          223.2      0.0      0.0      0.0      0.0 newOrder
  244.0s        0            0.0           24.2      0.0      0.0      0.0      0.0 orderStatus
  244.0s        0            0.0          235.9      0.0      0.0      0.0      0.0 payment
  244.0s        0            0.0           23.7      0.0      0.0      0.0      0.0 stockLevel
E210222 08:35:35.374278 1 workload/cli/run.go:471  [-] 3  error in delivery: ERROR: rpc error: code = Canceled desc = context canceled (SQLSTATE XXUUU)
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  245.0s       15            1.0           22.9  21474.8  21474.8  21474.8  21474.8 delivery
  245.0s       15            0.0          222.3      0.0      0.0      0.0      0.0 newOrder
  245.0s       15            0.0           24.1      0.0      0.0      0.0      0.0 orderStatus
  245.0s       15            2.0          234.9  21474.8  27917.3  27917.3  27917.3 payment
  245.0s       15            0.0           23.6      0.0      0.0      0.0      0.0 stockLevel
  246.0s       15            0.0           22.8      0.0      0.0      0.0      0.0 delivery
  246.0s       15            0.0          221.4      0.0      0.0      0.0      0.0 newOrder
  246.0s       15            0.0           24.0      0.0      0.0      0.0      0.0 orderStatus
  246.0s       15            0.0          234.0      0.0      0.0      0.0      0.0 payment
  246.0s       15            0.0           23.5      0.0      0.0      0.0      0.0 stockLevel
  247.0s       15            0.0           22.7      0.0      0.0      0.0      0.0 delivery
  247.0s       15            0.0          220.5      0.0      0.0      0.0      0.0 newOrder
  247.0s       15            0.0           23.9      0.0      0.0      0.0      0.0 orderStatus
  247.0s       15            0.0          233.0      0.0      0.0      0.0      0.0 payment
  247.0s       15            0.0           23.4      0.0      0.0      0.0      0.0 stockLevel
  248.0s       15            0.0           22.6      0.0      0.0      0.0      0.0 delivery
  248.0s       15            0.0          219.6      0.0      0.0      0.0      0.0 newOrder
  248.0s       15            0.0           23.8      0.0      0.0      0.0      0.0 orderStatus
  248.0s       15            0.0          232.1      0.0      0.0      0.0      0.0 payment
  248.0s       15            0.0           23.3      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  249.0s       15            0.0           22.5      0.0      0.0      0.0      0.0 delivery
  249.0s       15            2.0          218.8  16106.1  24696.1  24696.1  24696.1 newOrder
  249.0s       15            0.0           23.7      0.0      0.0      0.0      0.0 orderStatus
  249.0s       15            0.0          231.2      0.0      0.0      0.0      0.0 payment
  249.0s       15            0.0           23.2      0.0      0.0      0.0      0.0 stockLevel
  250.0s       15            0.0           22.4      0.0      0.0      0.0      0.0 delivery
  250.0s       15            0.0          217.9      0.0      0.0      0.0      0.0 newOrder
  250.0s       15            0.0           23.6      0.0      0.0      0.0      0.0 orderStatus
  250.0s       15            0.0          230.2      0.0      0.0      0.0      0.0 payment
  250.0s       15            0.0           23.1      0.0      0.0      0.0      0.0 stockLevel
  251.0s       15            0.0           22.3      0.0      0.0      0.0      0.0 delivery
  251.0s       15            1.0          217.0  38654.7  38654.7  38654.7  38654.7 newOrder
  251.0s       15            0.0           23.5      0.0      0.0      0.0      0.0 orderStatus
  251.0s       15            0.0          229.3      0.0      0.0      0.0      0.0 payment
  251.0s       15            0.0           23.1      0.0      0.0      0.0      0.0 stockLevel
  252.0s       15            0.0           22.3      0.0      0.0      0.0      0.0 delivery
  252.0s       15            0.0          216.2      0.0      0.0      0.0      0.0 newOrder
  252.0s       15            0.0           23.4      0.0      0.0      0.0      0.0 orderStatus
  252.0s       15            0.0          228.4      0.0      0.0      0.0      0.0 payment
  252.0s       15            0.0           23.0      0.0      0.0      0.0      0.0 stockLevel

nvanbenschoten · 2021-02-23T18:27:49Z

These are exactly the symptoms we would expect to see due to the bug fixed by #60992, and the timing lines up. So I'd optimistically sit on this and wait to see if it continues reproducing before spending time investigating.

cockroach-teamcity · 2021-02-24T08:52:06Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@ec011620c7cf299fdbb898db692b36454defc4a2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-25T08:54:47Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@c7e088826bc079620dfd3b5ae75d1c15cd9cd16d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-26T11:16:00Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@6601d827b814d4e85a1081b03bf2562d8ac2a4ab:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2721381-1614322964-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2721381-1614322964-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | 2: exit status 255: 
		  | I210226 11:07:00.140381 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-02-27T08:57:27Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@9595a158f0233e1c3d86786ec4462dd39c7beb20:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

yuzefovich · 2021-03-10T18:15:05Z

Thanks for the find about mistakenly attaching flow stats regardless of the fact whether the execution stats are collected. I opened up a PR to fix it.

@irfansharif #61532 recently was merged which changed the code quoted here. We now should be using the span created here for stats propagation:

cockroach/pkg/sql/conn_executor.go

Lines 1467 to 1471 in c7f9851

    
           ctx, sp := tracing.EnsureChildSpan( 
        
           	ctx, ex.server.cfg.AmbientCtx.Tracer, 
        
           	// We print the type of command, not the String() which includes long 
        
           	// statements. 
        
           	cmd.command())

irfansharif · 2021-03-11T15:14:19Z

Gotcha, thanks for the heads up. I think we'll want to revert #61532. I'm assuming the original motivation was to reduce the overhead of sampling in order to not create any new spans? Well, with #61777 the statement's span will be a no-op one, so when we know we're sampling, we should go and create a new span. I think in the end we'd be doing no worse for sampled statements (relative to #61532).

yuzefovich · 2021-03-12T00:28:49Z

Yes, the only motivation to #61532 was to reduce the perf hit of sampling, so if #61777 effectively makes the stmt's span a noop, the revert of #61532 sounds good to me so that the instrumentation is responsible for ensuring that a proper span is present when sampling is needed.

This isn't something I am planning to finish, but I wanted to see if I would immediately run into any snags when trying to get `roachprod start` to spin up a Docker container instead of running bare-metal. The motivation is twofold: - I'm generally interested in how we involve roachprod over time. Specifically, the question is whether in the long run systems testing should be done in k8s and which, if any, stepping stones there are to evolve what we currently do into that direction. - The current bare-metal setup goes [unresponsive] when CRDB goes into overdrive, and tests fail in the most opaque ways. We then incur a large tax for debugging these situations since we can't access the cluster in that state. Work that would need to be done to really finish this: - maintain ubuntu images that come with docker installed (right now needs to be manually set up) and the `ubuntu` user set up and the images we need cached - we hit the old problem of having to pass the uid/gid to the container to avoid creating files as root, I hacked around it by hard-coding them - Setting the cluster settings, etc, is all done via the `./cockroach` binary but that is no longer a thing. Ideally that should use SQL - The whole premise of uploading binaries is out of the window, we need to deal in CRDB containers exclusively, though we could conceivably use a wrapper container running the uploaded binary if we wanted to retain how roachprod/test work. [unresponsive]: cockroachdb#59424 (comment) Release note: None

This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None

cockroach-teamcity · 2021-03-12T19:08:18Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4b98115dfda02a9498f566958bd915c45ec7e449:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #61696 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None

61777: sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif This drastically reduces the memory overhead for tracing we're observing in #59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes #59424. Release note: None --- +cc @cockroachdb/kv-east. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None

--- sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

aayushshah15 mentioned this issue Jan 26, 2021

tracing: revert trace.mode default to legacy #59431

Merged

irfansharif self-assigned this Jan 26, 2021

irfansharif removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 1, 2021

This was referenced Feb 9, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #60094

Closed

roachtest: tpccbench/nodes=3/cpu=16 failed #55802

Closed

cockroach-teamcity mentioned this issue Feb 18, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #60747

Closed

irfansharif removed their assignment Feb 23, 2021

tbg mentioned this issue Mar 12, 2021

[dnm] roachprod: start crdb in a container #61901

Closed

irfansharif mentioned this issue Mar 12, 2021

sql: only create real spans when session tracing/sampling #61777

Merged

craig bot closed this as completed in d7407d5 Mar 13, 2021

This was referenced Mar 13, 2021

release-21.1: sql: only create real spans when session tracing/sampling #61965

Merged

roachtest: lack of crdb process isolation makes for noisy tests #62010

Closed

tracing: improve memory usage by using metadata-only child span types #62020

Closed

irfansharif mentioned this issue Mar 22, 2021

[dnm] *: cherry-pick 61777 onto 59992 #62349

Closed

irfansharif mentioned this issue Apr 1, 2021

sql: lower default sampling rate to 1% #62998

Merged

irfansharif mentioned this issue Apr 2, 2021

release-21.1: sql: lower default sampling rate to 1% #63006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

cockroach-teamcity commented Jan 26, 2021

aayushshah15 commented Jan 26, 2021

irfansharif commented Feb 1, 2021

cockroach-teamcity commented Feb 15, 2021

cockroach-teamcity commented Feb 18, 2021

cockroach-teamcity commented Feb 19, 2021

nvanbenschoten commented Feb 20, 2021

cockroach-teamcity commented Feb 21, 2021

cockroach-teamcity commented Feb 22, 2021

cockroach-teamcity commented Feb 23, 2021

irfansharif commented Feb 23, 2021

nvanbenschoten commented Feb 23, 2021

cockroach-teamcity commented Feb 24, 2021

cockroach-teamcity commented Feb 25, 2021

cockroach-teamcity commented Feb 26, 2021

cockroach-teamcity commented Feb 27, 2021

yuzefovich commented Mar 10, 2021

irfansharif commented Mar 11, 2021

yuzefovich commented Mar 12, 2021

cockroach-teamcity commented Mar 12, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

Comments

cockroach-teamcity commented Jan 26, 2021

aayushshah15 commented Jan 26, 2021

irfansharif commented Feb 1, 2021

cockroach-teamcity commented Feb 15, 2021

cockroach-teamcity commented Feb 18, 2021

cockroach-teamcity commented Feb 19, 2021

nvanbenschoten commented Feb 20, 2021

cockroach-teamcity commented Feb 21, 2021

cockroach-teamcity commented Feb 22, 2021

cockroach-teamcity commented Feb 23, 2021

irfansharif commented Feb 23, 2021

nvanbenschoten commented Feb 23, 2021

cockroach-teamcity commented Feb 24, 2021

cockroach-teamcity commented Feb 25, 2021

cockroach-teamcity commented Feb 26, 2021

cockroach-teamcity commented Feb 27, 2021

yuzefovich commented Mar 10, 2021

irfansharif commented Mar 11, 2021

yuzefovich commented Mar 12, 2021

cockroach-teamcity commented Mar 12, 2021