roachtest: schemachange/index/tpcc/w=1000 failed #62320

cockroach-teamcity · 2021-03-21T12:15:49Z

(roachtest).schemachange/index/tpcc/w=1000 failed on master@893643b63ea0b1cfa4888c6b73b5c68a9c100c3a:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2688,tpcc.go:162,schemachange.go:302,test_runner.go:768: monitor failure: unexpected node event: 4: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:768
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2797327-1616306680-69-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		2: 4354
		1: 5126
		3: 4970
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /schemachange/index/tpcc/w=1000
Related:

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed C-test-failure O-roachtest O-robot branch-release-20.2

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

ajwerner · 2021-03-23T15:55:48Z

node 4 is dead, why? I don't know.

ajwerner · 2021-03-23T15:57:16Z

Feels similar to #54934.

tbg · 2021-03-29T14:19:40Z

I wonder what it would take to go "meta" and run a roachtest where CRDB intentionally leaks memory and the job of the test is to verify traces of the oom killer. Yes, desperate.

ajwerner · 2021-03-29T14:21:50Z

I wonder what it would take to go "meta" and run a roachtest where CRDB intentionally leaks memory and the job of the test is to verify traces of the oom killer. Yes, desperate.

ack, will give it a shot.

tbg · 2021-03-29T14:26:40Z

I wasn't suggesting that! But it is appreciated. You may want to crib the roachprod reset command from #62361 so that you can reliably access the VMs post overload

ajwerner · 2021-03-29T14:35:18Z

In a repro, I don't think the machine was hosed. The process was just gone without a trace.

tbg · 2021-03-29T14:44:00Z

That's what it looks like in the above run too, I just figure that when you do this programmatically you'll end up with the broken VM some fraction of the time. But yeah if it's a semi-manual experiment, probably doesn't help too much

This change unearths runtime out-of-memory errors. Before this change, a runtime-detected failure to allocate memory would not make it to the logs. This was tested by injecting a rapid memory leak into the process and determining first that it was not present in logs and then with this change that the stack traces were present in the stderr log. I'm omitting a release note primarily because I do not know how to frame one. Relates to debugging cockroachdb#62320. Release note: None

ajwerner · 2021-03-30T12:40:09Z

I'm inclined to take this off of the schema board as there isn't much evidence that this OOM lives in schema land. Happy to continue to provide assistance.

cockroach-teamcity · 2021-04-24T01:06:20Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 9e821aa456bfbe8b5ec3a8ef79ae3954d52cb675:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2774,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2762
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2770
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2917180-1619205254-67-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		3: 10663
		4: 11020
		2: 10657
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-04-25T00:59:15Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 3cd29547884feb24b150b67ca2a88073798a35fd:

		  |   992.0s      116            0.0           42.8      0.0      0.0      0.0      0.0 stockLevel
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 delivery
		  |   993.0s      116            0.0          426.9      0.0      0.0      0.0      0.0 newOrder
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 orderStatus
		  |   993.0s      116            0.0          427.3      0.0      0.0      0.0      0.0 payment
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 stockLevel
		  |   994.0s      116            0.0           42.6      0.0      0.0      0.0      0.0 delivery
		  |   994.0s      116            0.0          426.4      0.0      0.0      0.0      0.0 newOrder
		  |   994.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 orderStatus
		  |   994.0s      116            0.0          426.9      0.0      0.0      0.0      0.0 payment
		  |   994.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 stockLevel
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2774,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2762
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2770
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2818
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2732
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5652
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:191
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-04-26T05:39:34Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 2d7f9a08bac49040a6d77c5465e2b4fd8fdac1f4:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) 4: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *withstack.withStack (6) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

tbg · 2021-04-26T15:33:44Z

first new failure (#62320 (comment)) seems to have happened after the create index completed:

00:04:13 schemachange.go:475: addindex: running statement 1...
00:07:37 schemachange.go:480: addindex: statement 1: "CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);" took 3m24.02283682s
00:08:37 schemachange.go:475: addindex: running statement 2...
00:14:00 schemachange.go:480: addindex: statement 2: "CREATE INDEX ON tpcc.order (o_carrier_id);" took 5m23.131883449s
00:15:00 schemachange.go:475: addindex: running statement 3...
00:29:46 schemachange.go:480: addindex: statement 3: "CREATE INDEX ON tpcc.customer (c_last, c_first);" took 14m46.095225082s
00:29:46 schemachange.go:482: addindex: ran 3 statements in 26m33.250717138s

The CRDB logs say "unhappy". Nodes go in and out of liveness, raft ready handling is slow, etc.

The pebble log on n1 ends with:

I210424 00:54:35.660269 25295238 3@vendor/github.com/cockroachdb/pebble/db.go:1368 â‹® [n1,pebble] 57926 write stall beginning: memtable count limit reached

and that lines up with the end of the CRDB log, and likely the time the node died.

Here's the inuse_space:

tbg · 2021-04-26T15:37:55Z

second one #62320 (comment)
This has overloaded Vs, the ssh connections dropped and we weren't able to get all artifacts. There's this at the end of n3's pebble logs:

I210425 00:52:23.724033 7976094 3@storage/pebble.go:598 â‹® [n3,pebble] 24388 disk slowness detected: write to file â€¹/mnt/data1/cockroach/009967.sstâ€º has been ongoing for 16.5s

We're not getting anything for n5, which I think is the one hardest hit & browned out.

tbg · 2021-04-26T15:43:01Z

third #62320 (comment) failed before statement 3 (create index) returned, with n4 dying. Not oomkiller, but also nothing in the logs, unfortunately, despite #63472. Note that work is ongoing to improve our infra, #64177.

Not seeing any clear offenders in inuse_space, but also not sure what's expected. From the looks of it, we're past backfilling here.

tbg · 2021-04-26T15:56:53Z

There is currently a regression on master, #64214, which started April 21st. The first failure in this last burst is April 24th, which lines up.

@RaduBerinde could you (or someone on Queries) get 10 passing runs of this when the regression is fixed (if they still fail, report back here, otherwise close). This is straightforward via the script in https://go.crdb.dev/roachstress.

tbg · 2021-04-26T15:57:52Z

@ajwerner says:

[These profiles are] Likely past the index backfill
It could be running the validation query, though I doubt it. The reason I doubt it is that I think we weren't using vectorized in that query until like last week

RaduBerinde · 2021-04-26T19:31:51Z

I started some runs with my fix for #64214 and I already got a "unexpected node event: 4: dead" failure.

yuzefovich · 2021-04-30T00:35:20Z

I kicked off 10 runs of this test right before c3b1617 and including c3b1617. In all runs I observe that CPU is very high in both cases; however, the baseline RAM usage is definitely higher with that change in - before 7.5-9GB, after 9-11GB, so indeed that commit somehow makes things noticeably worse. Unfortunately heap profiles don't provide many clues (they look basically identical w/o and w/). Also, all other metrics seems about the same.

cockroach-teamcity · 2021-04-30T11:35:33Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ d84625c42d11e38775b6a1ff3f4089e839fe3e82:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2940313-1619763594-72-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		3: 10565
		1: 10918
		4: 10200
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

yuzefovich · 2021-04-30T16:31:27Z

I think I have a fix - got 10 successful runs with e05af1c.

tbg · 2021-04-30T16:47:27Z

Interesting. I know you said you didn't know on the commit message, but it feels like it would be really instructive if, after the fact, we could use hindsight to arrive at a way to catch this kind of problem by inspecting profiles next time.

yuzefovich · 2021-04-30T16:55:33Z

I tried to find a microbenchmark in pkg/bench that would have a significant regression on c3b1617, but all numbers are basically the same. So my current guess is that this issue only represents itself on a larger scale longer running benchmark.

The profiles have been quite unhelpful in this investigation. One thing that was kind of like a clue was the fact NewColOperator allocated slightly more objects (over 10s interval, the footprint of that method went from 2% to say 4%), but this clue wasn't that insightful because we have already identified the culprit commit. My intuition is that possibly the copies that are made due to operating with structs by value somehow escape the profile, but I have no proof.

cockroach-teamcity · 2021-05-01T01:20:38Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 42e2e4265087751695afc05fc5ad8ec534fe0121:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2941631-1619809231-73-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		5: skipped
		4: 10500
		1: 10127
		2: 10526
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

RaduBerinde · 2021-05-01T12:03:18Z

We should try to figure out which part of the fix is actually needed and why. For example, slices of pointers shouldn't help unless we're relying on some kind of aliasing.

RaduBerinde · 2021-05-01T12:49:02Z

One thing that was kind of like a clue was the fact NewColOperator allocated slightly more objects

Is that change in allocations fixed by e05af1c? I would be surprised, I can't see why that commit would reduce any allocations (except one small thing, we are now reusing r.OpWithMetaInfo.StatsCollectors).

By the way, note that when we are reusing slices by slicing with [:0] (e.g. in NewColOperator.Release), the slices still hold on to the previous objects. This is because we can access them by reslicing (example: https://play.golang.org/p/8LecNromKw3). It's worth auditing the slices and see if they can hold on to any large objects. The fix would be to zero out each element of the slice before slicing to zero length.

cockroach-teamcity · 2021-05-01T23:49:39Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 684e753c15f3fc58df79b6ea70e7b6715eae4835:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2944550-1619896066-75-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		2: 10294
		4: 10790
		3: 10279
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

yuzefovich · 2021-05-02T18:53:53Z

@RaduBerinde yeah, I think you're right that

By the way, note that when we are reusing slices by slicing with [:0] (e.g. in NewColOperator.Release), the slices still hold on to the previous objects. This is because we can access them by reslicing (example: https://play.golang.org/p/8LecNromKw3). It's worth auditing the slices and see if they can hold on to any large objects. The fix would be to zero out each element of the slice before slicing to zero length.

is the problem. The following diff fixes the memory leak:

diff --git a/pkg/sql/colexec/colexecargs/op_creation.go b/pkg/sql/colexec/colexecargs/op_creation.go
index 3a9341e9e6..a6c66a8bf9 100644
--- a/pkg/sql/colexec/colexecargs/op_creation.go
+++ b/pkg/sql/colexec/colexecargs/op_creation.go
@@ -147,10 +147,20 @@ func (r *NewColOperatorResult) Release() {
        for _, releasable := range r.Releasables {
                releasable.Release()
        }
+       for i := range r.StatsCollectors {
+               r.StatsCollectors[i] = nil
+       }
+       for i := range r.MetadataSources {
+               r.MetadataSources[i] = nil
+       }
+       for i := range r.ToClose {
+               r.ToClose[i] = nil
+       }
        *r = NewColOperatorResult{
                OpWithMetaInfo: OpWithMetaInfo{
-                       MetadataSources: r.OpWithMetaInfo.MetadataSources[:0],
-                       ToClose:         r.OpWithMetaInfo.ToClose[:0],
+                       StatsCollectors: r.StatsCollectors[:0],
+                       MetadataSources: r.MetadataSources[:0],
+                       ToClose:         r.ToClose[:0],
                },
                ColumnTypes: r.ColumnTypes[:0],
                OpMonitors:  r.OpMonitors[:0],

I am aware that slicing to :0 keeps the references to the old objects within the slice as "live", so they won't get GCed, but my assumption was that NewColOperatorResult objects are sync.Pooled, and the old references would get overwritten the next time r is picked up from the pool. Not sure why that assumption doesn't hold or causes the memory leaks though.

RaduBerinde · 2021-05-02T19:56:57Z

the old references would get overwritten the next time r is picked up from the pool

Yes, but only up to how many entries will be needed in the new instance. It's possible that occasionally the slices get way longer than normal.

tbg · 2021-05-03T08:41:46Z

So can we revisit why we didn't see this in profiles? If I am reading the analysis correctly we see a difference in memory use on the order of GB. This certainly needs to show up in heap profiles like inuse_space? This kind of problem is bound to occur again and we need to be able to use the go tooling to hunt it down proactively.

yuzefovich · 2021-05-03T15:52:58Z

Hm, I thought that the diff above fixes the issue (my "reproduction" is: running tpcc over a couple of minutes, restarting the cluster with different binaries, and looking at RAM usage; and the fix above was sufficient to bring the usage down to what is expected), but tpccbench/nodes=3/cpu=16 (#64093) still fails. However, with the commit that switches to pointers to struct I only had successful runs. Will look into this more.

yuzefovich · 2021-05-03T18:49:03Z

Hm, tpccbench/nodes=3/cpu=16 fails with the pointer change too. I definitely saw it succeed when working on older SHAs, without rebasing on top of master, but let's ignore that issue for now.

I think that the diff above (or 0100ba7) fixes the OOM scenario of this issue with schemachange roachtest.

My reproduction was running TPCC1000 for 5 minutes with different binaries and observing the reported RAM usage. For example, here I alternated "before" (without 0100ba7) and "after" (with 0100ba7) twice:

This clearly shows an increased RAM usage in the "before" case.

I also collected the heap profiles with heap?seconds=120&si=inuse_space options from node 1 profiles.zip.

Looking at the inuse_space there does exist noticeable difference:

before:

Showing nodes accounting for 269.18MB, 24.66% of 1091.70MB total
Dropped 990 nodes (cum <= 5.46MB)
Showing top 10 nodes out of 454
      flat  flat%   sum%        cum   cum%
   89.80MB  8.23%  8.23%   112.32MB 10.29%  github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter
      31MB  2.84% 11.07%       31MB  2.84%  github.com/cockroachdb/cockroach/pkg/sql/sem/tree.NewNumVal
   22.52MB  2.06% 13.13%    22.52MB  2.06%  github.com/cockroachdb/cockroach/pkg/sql/colconv.glob..func1
   21.51MB  1.97% 15.10%    21.51MB  1.97%  github.com/cockroachdb/cockroach/pkg/col/coldataext.newDatumVec

after

Showing nodes accounting for 65.91MB, 13.99% of 470.98MB total
Dropped 546 nodes (cum <= 2.35MB)
Showing top 10 nodes out of 569
      flat  flat%   sum%        cum   cum%
   19.67MB  4.18%  4.18%    35.18MB  7.47%  github.com/cockroachdb/pebble.finishInitializingIter
   15.51MB  3.29%  7.47%    15.51MB  3.29%  github.com/cockroachdb/pebble.finishInitializingIter.func1
    7.01MB  1.49%  8.96%     7.01MB  1.49%  github.com/cockroachdb/pebble/sstable.glob..func2
    6.50MB  1.38% 10.34%     6.50MB  1.38%  internal/profile.(*Profile).postDecode
    5.51MB  1.17% 11.51%     7.02MB  1.49%  github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter

In particular, github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter accounts for extra 100MB. That method is called when creating a materializer and the object is picked up from the sync.Pool. It is released back to the pool in Materializer.Release; the materializer is also picked up from another pool, so I'd expect that the materializers aren't released properly too if the converters aren't.

Additionally, I'm confused by the following on the "before" profile:

(pprof) list NewAllVecToDatum
Total: 1.07GB
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/colconv.NewAllVecToDatumConverter in /go/src/github.com/cockroachdb/cockroach/pkg/sql/colconv/vec_to_datum.eg.go
         0     3.50MB (flat, cum)  0.32% of Total
         .          .     77:}
         .          .     78:
         .          .     79:// NewAllVecToDatumConverter is like NewVecToDatumConverter except all of the
         .          .     80:// vectors in the batch will be converted.
         .          .     81:func NewAllVecToDatumConverter(batchWidth int) *VecToDatumConverter {
         .     3.50MB     82:	c := getNewVecToDatumConverter(batchWidth)
         .          .     83:	if cap(c.vecIdxsToConvert) < batchWidth {
         .          .     84:		c.vecIdxsToConvert = make([]int, batchWidth)
         .          .     85:	} else {
         .          .     86:		c.vecIdxsToConvert = c.vecIdxsToConvert[:batchWidth]
         .          .     87:	}
(pprof) list getNewVec
Total: 1.07GB
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter in /go/src/github.com/cockroachdb/cockroach/pkg/sql/colconv/vec_to_datum.eg.go
   89.80MB   112.32MB (flat, cum) 10.29% of Total
         .          .     56:		return &VecToDatumConverter{}
         .          .     57:	},
         .          .     58:}
         .          .     59:
         .          .     60:func getNewVecToDatumConverter(batchWidth int) *VecToDatumConverter {
         .    22.52MB     61:	c := vecToDatumConverterPool.Get().(*VecToDatumConverter)
         .          .     62:	if cap(c.convertedVecs) < batchWidth {
   89.80MB    89.80MB     63:		c.convertedVecs = make([]tree.Datums, batchWidth)
         .          .     64:	} else {
         .          .     65:		c.convertedVecs = c.convertedVecs[:batchWidth]
         .          .     66:	}
         .          .     67:	return c
         .          .     68:}

Why does the top part report 3.5MB while the bottom part over 100MB? They refer to the same code, so I'm quite confused the profiles.

Should I be taking profiles with some other parameters? Am I missing something?

RaduBerinde · 2021-05-03T18:54:50Z

Note that #64177 added a 95% hard memory limit to cockroach in roachprod. Maybe the test was just squeezing by before.

Why does the top part report 3.5MB while the bottom part over 100MB? They refer to the same code, so I'm quite confused the profiles.

That would be explained if getNewVecToDatumConverter is also called in other places, and most of the memory usage comes from those. The 3.50MB only covers the memory allocated by getNewVecToDatumConverter when called from that specific place.

yuzefovich · 2021-05-03T19:12:35Z

Note that #64177 added a 95% hard memory limit to cockroach in roachprod. Maybe the test was just squeezing by before.

Oh yeah, that must be it. Then I think we're good.

That would be explained if getNewVecToDatumConverter is also called in other places, and most of the memory usage comes from those. The 3.50MB only covers the memory allocated by getNewVecToDatumConverter when called from that specific place.

Indeed, silly me. I was too focused on a particular code path.

I guess looking at inuse_space and focusing on NewColOperator we can see the leak:

before:

after:

On the former, we see only "positive" changes over the 2 minute period (IIUC), whereas on the latter we see incremental changes.

yuzefovich · 2021-05-03T19:23:34Z

Hm, I am still very confused by this - those 100MB of unreleased converters come from NewBuiltinFunctionOperator and NewTupleProjOp; however, those operators aren't released at all either, so it's possible that 2 minute long heap profiles aren't comparable to each other at all although there the workload running against the cluster is the same.

yuzefovich · 2021-05-03T21:07:24Z

Collected another profile (over 30 seconds) when the cockroach is running with GODEBUG=memprofilerate=1024. This option indicates that one allocation should be sampled for each 1024 of bytes of allocations (by default this rate is 1 per 512KiB; I also tried using memprofilerate=1 to sample every allocation, but then fetching the profile timed out :/ ).

Looking at inuse_space, getNewVecToDatumConverter method again shows up noticeably:

Responding to my previous comment

Hm, I am still very confused by this - those 100MB of unreleased converters come from NewBuiltinFunctionOperator and NewTupleProjOp; however, those operators aren't released at all either

My understanding is that the fact whether the components are released (as in execinfra.Releasable interface) doesn't matter for the purposes of inuse_space heap profile - what matters is that the components are considered "live" from the perspective of Go GC. The increase in "inuse_space" by getNewVecToDatumConverter indicates a possible memory leak, so it should be investigated. Comparing two heap profiles taken over the period of the same length on "before" and "after" could be used roughly but shouldn't be treated in the way profiles from the microbenchmarks can be compared.

RaduBerinde · 2021-05-03T21:45:48Z

Maybe it's a similar problem - VecToDatumConverter.Release() should clear out the slices?

yuzefovich · 2021-05-04T00:46:33Z

Yeah, quite possibly, I added another commit auditing our Release behavior.

BTW, I meant my previous comment

The increase in "inuse_space" by getNewVecToDatumConverter indicates a possible memory leak, so it should be investigated. Comparing two heap profiles taken over the period of the same length on "before" and "after" could be used roughly but shouldn't be treated in the way profiles from the microbenchmarks can be compared.

as a rough summary of my understanding of how we could have investigated the memory leak if we didn't know which commit broke things or what is the fix.

cockroach-teamcity · 2021-05-04T01:29:56Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ dbf711a83969210bcf706d65f53f888856da6ad0:

		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) output in run_012822.982_n5_cockroach_workload_check_tpcc
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2945603-1620068645-72-n5cpu16:5 -- ./cockroach workload check tpcc --warehouses=1000 {pgurl:1} returned
		  | stderr:
		  | I210504 01:28:25.906717 1 workload/tpcc/tpcc.go:395  [-] 1  check 3.3.2.1 took 201.178305ms
		  | I210504 01:28:44.045525 1 workload/tpcc/tpcc.go:395  [-] 2  check 3.3.2.2 took 18.138692815s
		  | Error: check failed: 3.3.2.2: pq: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./cockroach workload check tpcc --warehouses=1000 {pgurl:1}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2945603-1620068645-72-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: 10693
		4: dead
		3: 9543
		2: 9636
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

yuzefovich · 2021-05-04T01:53:58Z

The last failure is interesting - n4 crashed with OOM during the check queries. I don't think I've seen a case like this before, but unfortunately the debug.zip collection from n4 wasn't successful. Anyway, we should fix known problems, and likely the OOMs will go away.

cockroach-teamcity · 2021-05-05T01:57:25Z

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ f9dca40d3afb0a605eeee76bec4a07ff65d6fec0:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2949834-1620155081-67-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		4: 10604
		2: 10427
		3: 10585
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

roachtest: schemachange/index/tpcc/w=1000 failed #54304 roachtest: schemachange/index/tpcc/w=1000 failed [C-test-failure O-roachtest O-robot branch-release-20.2]

/cc @cockroachdb/sql-schema _{This test on roachdash | Improve this report!}

ajwerner mentioned this issue Mar 29, 2021

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #62339

Closed

ajwerner mentioned this issue Mar 29, 2021

log/logconfig: change exit-on-error to default to true for stderr #62763

Closed

tbg assigned RaduBerinde Apr 26, 2021

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 26, 2021

yuzefovich mentioned this issue Apr 30, 2021

colexecargs: fix recent memory leak #64453

Merged

cockroach-teamcity mentioned this issue May 4, 2021

roachtest: schemachange/index/tpcc/w=1000 failed #54304

Closed

craig bot closed this as completed in #64453 May 5, 2021

craig bot closed this as completed in ecf484d May 5, 2021

roachtest: schemachange/index/tpcc/w=1000 failed #62320

roachtest: schemachange/index/tpcc/w=1000 failed #62320

Comments

cockroach-teamcity commented Mar 21, 2021

ajwerner commented Mar 23, 2021

ajwerner commented Mar 23, 2021

tbg commented Mar 29, 2021

ajwerner commented Mar 29, 2021

tbg commented Mar 29, 2021

ajwerner commented Mar 29, 2021

tbg commented Mar 29, 2021

ajwerner commented Mar 30, 2021

cockroach-teamcity commented Apr 24, 2021

cockroach-teamcity commented Apr 25, 2021

cockroach-teamcity commented Apr 26, 2021

tbg commented Apr 26, 2021 • edited Loading

tbg commented Apr 26, 2021

tbg commented Apr 26, 2021 • edited Loading

tbg commented Apr 26, 2021

tbg commented Apr 26, 2021

RaduBerinde commented Apr 26, 2021

yuzefovich commented Apr 30, 2021

cockroach-teamcity commented Apr 30, 2021

yuzefovich commented Apr 30, 2021

tbg commented Apr 30, 2021

yuzefovich commented Apr 30, 2021

cockroach-teamcity commented May 1, 2021

RaduBerinde commented May 1, 2021

RaduBerinde commented May 1, 2021 • edited Loading

cockroach-teamcity commented May 1, 2021

yuzefovich commented May 2, 2021 • edited Loading

RaduBerinde commented May 2, 2021

tbg commented May 3, 2021

yuzefovich commented May 3, 2021

yuzefovich commented May 3, 2021

RaduBerinde commented May 3, 2021

yuzefovich commented May 3, 2021 • edited Loading

yuzefovich commented May 3, 2021 • edited Loading

yuzefovich commented May 3, 2021 • edited Loading

RaduBerinde commented May 3, 2021

yuzefovich commented May 4, 2021 • edited Loading

cockroach-teamcity commented May 4, 2021

yuzefovich commented May 4, 2021

cockroach-teamcity commented May 5, 2021

tbg commented Apr 26, 2021 •

edited

Loading

tbg commented Apr 26, 2021 •

edited

Loading

RaduBerinde commented May 1, 2021 •

edited

Loading

yuzefovich commented May 2, 2021 •

edited

Loading

yuzefovich commented May 3, 2021 •

edited

Loading

yuzefovich commented May 3, 2021 •

edited

Loading

yuzefovich commented May 3, 2021 •

edited

Loading

yuzefovich commented May 4, 2021 •

edited

Loading