Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/index/tpcc/w=1000 failed #62320

Closed
cockroach-teamcity opened this issue Mar 21, 2021 · 63 comments · Fixed by #64453
Closed

roachtest: schemachange/index/tpcc/w=1000 failed #62320

cockroach-teamcity opened this issue Mar 21, 2021 · 63 comments · Fixed by #64453
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).schemachange/index/tpcc/w=1000 failed on master@893643b63ea0b1cfa4888c6b73b5c68a9c100c3a:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2688,tpcc.go:162,schemachange.go:302,test_runner.go:768: monitor failure: unexpected node event: 4: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:768
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:849: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2797327-1616306680-69-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		2: 4354
		1: 5126
		3: 4970
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /schemachange/index/tpcc/w=1000
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 21, 2021
@ajwerner
Copy link
Contributor

node 4 is dead, why? I don't know.

@ajwerner
Copy link
Contributor

Feels similar to #54934.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

I wonder what it would take to go "meta" and run a roachtest where CRDB intentionally leaks memory and the job of the test is to verify traces of the oom killer. Yes, desperate.

@ajwerner
Copy link
Contributor

I wonder what it would take to go "meta" and run a roachtest where CRDB intentionally leaks memory and the job of the test is to verify traces of the oom killer. Yes, desperate.

ack, will give it a shot.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

I wasn't suggesting that! But it is appreciated. You may want to crib the roachprod reset command from #62361 so that you can reliably access the VMs post overload

@ajwerner
Copy link
Contributor

In a repro, I don't think the machine was hosed. The process was just gone without a trace.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

That's what it looks like in the above run too, I just figure that when you do this programmatically you'll end up with the broken VM some fraction of the time. But yeah if it's a semi-manual experiment, probably doesn't help too much

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 29, 2021
This change unearths runtime out-of-memory errors. Before this change, a
runtime-detected failure to allocate memory would not make it to the logs.

This was tested by injecting a rapid memory leak into the process and
determining first that it was not present in logs and then with this
change that the stack traces were present in the stderr log.

I'm omitting a release note primarily because I do not know how to frame one.

Relates to debugging cockroachdb#62320.

Release note: None
@ajwerner
Copy link
Contributor

I'm inclined to take this off of the schema board as there isn't much evidence that this OOM lives in schema land. Happy to continue to provide assistance.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 9e821aa456bfbe8b5ec3a8ef79ae3954d52cb675:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2774,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2762
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2770
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2917180-1619205254-67-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		3: 10663
		4: 11020
		2: 10657
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 3cd29547884feb24b150b67ca2a88073798a35fd:

		  |   992.0s      116            0.0           42.8      0.0      0.0      0.0      0.0 stockLevel
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 delivery
		  |   993.0s      116            0.0          426.9      0.0      0.0      0.0      0.0 newOrder
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 orderStatus
		  |   993.0s      116            0.0          427.3      0.0      0.0      0.0      0.0 payment
		  |   993.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 stockLevel
		  |   994.0s      116            0.0           42.6      0.0      0.0      0.0      0.0 delivery
		  |   994.0s      116            0.0          426.4      0.0      0.0      0.0      0.0 newOrder
		  |   994.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 orderStatus
		  |   994.0s      116            0.0          426.9      0.0      0.0      0.0      0.0 payment
		  |   994.0s      116            0.0           42.7      0.0      0.0      0.0      0.0 stockLevel
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2774,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2762
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2770
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2818
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2732
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5652
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:191
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 2d7f9a08bac49040a6d77c5465e2b4fd8fdac1f4:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) 4: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *withstack.withStack (6) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Apr 26, 2021

first new failure (#62320 (comment)) seems to have happened after the create index completed:

00:04:13 schemachange.go:475: addindex: running statement 1...
00:07:37 schemachange.go:480: addindex: statement 1: "CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);" took 3m24.02283682s
00:08:37 schemachange.go:475: addindex: running statement 2...
00:14:00 schemachange.go:480: addindex: statement 2: "CREATE INDEX ON tpcc.order (o_carrier_id);" took 5m23.131883449s
00:15:00 schemachange.go:475: addindex: running statement 3...
00:29:46 schemachange.go:480: addindex: statement 3: "CREATE INDEX ON tpcc.customer (c_last, c_first);" took 14m46.095225082s
00:29:46 schemachange.go:482: addindex: ran 3 statements in 26m33.250717138s

The CRDB logs say "unhappy". Nodes go in and out of liveness, raft ready handling is slow, etc.

The pebble log on n1 ends with:

I210424 00:54:35.660269 25295238 3@vendor/github.com/cockroachdb/pebble/db.go:1368 â‹® [n1,pebble] 57926 write stall beginning: memtable count limit reached

and that lines up with the end of the CRDB log, and likely the time the node died.

Here's the inuse_space:

image

@tbg
Copy link
Member

tbg commented Apr 26, 2021

second one #62320 (comment)
This has overloaded Vs, the ssh connections dropped and we weren't able to get all artifacts. There's this at the end of n3's pebble logs:

I210425 00:52:23.724033 7976094 3@storage/pebble.go:598 ⋮ [n3,pebble] 24388 disk slowness detected: write to file ‹/mnt/data1/cockroach/009967.sst› has been ongoing for 16.5s

We're not getting anything for n5, which I think is the one hardest hit & browned out.

@tbg
Copy link
Member

tbg commented Apr 26, 2021

third #62320 (comment) failed before statement 3 (create index) returned, with n4 dying. Not oomkiller, but also nothing in the logs, unfortunately, despite #63472. Note that work is ongoing to improve our infra, #64177.

Not seeing any clear offenders in inuse_space, but also not sure what's expected. From the looks of it, we're past backfilling here.

image

@tbg
Copy link
Member

tbg commented Apr 26, 2021

There is currently a regression on master, #64214, which started April 21st. The first failure in this last burst is April 24th, which lines up.

@RaduBerinde could you (or someone on Queries) get 10 passing runs of this when the regression is fixed (if they still fail, report back here, otherwise close). This is straightforward via the script in https://go.crdb.dev/roachstress.

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 26, 2021
@tbg
Copy link
Member

tbg commented Apr 26, 2021

@ajwerner says:

[These profiles are] Likely past the index backfill
It could be running the validation query, though I doubt it. The reason I doubt it is that I think we weren't using vectorized in that query until like last week

@RaduBerinde
Copy link
Member

I started some runs with my fix for #64214 and I already got a "unexpected node event: 4: dead" failure.

@yuzefovich
Copy link
Member

I kicked off 10 runs of this test right before c3b1617 and including c3b1617. In all runs I observe that CPU is very high in both cases; however, the baseline RAM usage is definitely higher with that change in - before 7.5-9GB, after 9-11GB, so indeed that commit somehow makes things noticeably worse. Unfortunately heap profiles don't provide many clues (they look basically identical w/o and w/). Also, all other metrics seems about the same.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ d84625c42d11e38775b6a1ff3f4089e839fe3e82:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2940313-1619763594-72-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		3: 10565
		1: 10918
		4: 10200
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

I think I have a fix - got 10 successful runs with e05af1c.

@tbg
Copy link
Member

tbg commented Apr 30, 2021

Interesting. I know you said you didn't know on the commit message, but it feels like it would be really instructive if, after the fact, we could use hindsight to arrive at a way to catch this kind of problem by inspecting profiles next time.

@yuzefovich
Copy link
Member

I tried to find a microbenchmark in pkg/bench that would have a significant regression on c3b1617, but all numbers are basically the same. So my current guess is that this issue only represents itself on a larger scale longer running benchmark.

The profiles have been quite unhelpful in this investigation. One thing that was kind of like a clue was the fact NewColOperator allocated slightly more objects (over 10s interval, the footprint of that method went from 2% to say 4%), but this clue wasn't that insightful because we have already identified the culprit commit. My intuition is that possibly the copies that are made due to operating with structs by value somehow escape the profile, but I have no proof.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 42e2e4265087751695afc05fc5ad8ec534fe0121:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2941631-1619809231-73-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		5: skipped
		4: 10500
		1: 10127
		2: 10526
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@RaduBerinde
Copy link
Member

We should try to figure out which part of the fix is actually needed and why. For example, slices of pointers shouldn't help unless we're relying on some kind of aliasing.

@RaduBerinde
Copy link
Member

RaduBerinde commented May 1, 2021

One thing that was kind of like a clue was the fact NewColOperator allocated slightly more objects

Is that change in allocations fixed by e05af1c? I would be surprised, I can't see why that commit would reduce any allocations (except one small thing, we are now reusing r.OpWithMetaInfo.StatsCollectors).

By the way, note that when we are reusing slices by slicing with [:0] (e.g. in NewColOperator.Release), the slices still hold on to the previous objects. This is because we can access them by reslicing (example: https://play.golang.org/p/8LecNromKw3). It's worth auditing the slices and see if they can hold on to any large objects. The fix would be to zero out each element of the slice before slicing to zero length.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 684e753c15f3fc58df79b6ea70e7b6715eae4835:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2944550-1619896066-75-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		2: 10294
		4: 10790
		3: 10279
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

yuzefovich commented May 2, 2021

@RaduBerinde yeah, I think you're right that

By the way, note that when we are reusing slices by slicing with [:0] (e.g. in NewColOperator.Release), the slices still hold on to the previous objects. This is because we can access them by reslicing (example: https://play.golang.org/p/8LecNromKw3). It's worth auditing the slices and see if they can hold on to any large objects. The fix would be to zero out each element of the slice before slicing to zero length.

is the problem. The following diff fixes the memory leak:

diff --git a/pkg/sql/colexec/colexecargs/op_creation.go b/pkg/sql/colexec/colexecargs/op_creation.go
index 3a9341e9e6..a6c66a8bf9 100644
--- a/pkg/sql/colexec/colexecargs/op_creation.go
+++ b/pkg/sql/colexec/colexecargs/op_creation.go
@@ -147,10 +147,20 @@ func (r *NewColOperatorResult) Release() {
        for _, releasable := range r.Releasables {
                releasable.Release()
        }
+       for i := range r.StatsCollectors {
+               r.StatsCollectors[i] = nil
+       }
+       for i := range r.MetadataSources {
+               r.MetadataSources[i] = nil
+       }
+       for i := range r.ToClose {
+               r.ToClose[i] = nil
+       }
        *r = NewColOperatorResult{
                OpWithMetaInfo: OpWithMetaInfo{
-                       MetadataSources: r.OpWithMetaInfo.MetadataSources[:0],
-                       ToClose:         r.OpWithMetaInfo.ToClose[:0],
+                       StatsCollectors: r.StatsCollectors[:0],
+                       MetadataSources: r.MetadataSources[:0],
+                       ToClose:         r.ToClose[:0],
                },
                ColumnTypes: r.ColumnTypes[:0],
                OpMonitors:  r.OpMonitors[:0],

I am aware that slicing to :0 keeps the references to the old objects within the slice as "live", so they won't get GCed, but my assumption was that NewColOperatorResult objects are sync.Pooled, and the old references would get overwritten the next time r is picked up from the pool. Not sure why that assumption doesn't hold or causes the memory leaks though.

@RaduBerinde
Copy link
Member

the old references would get overwritten the next time r is picked up from the pool

Yes, but only up to how many entries will be needed in the new instance. It's possible that occasionally the slices get way longer than normal.

@tbg
Copy link
Member

tbg commented May 3, 2021

So can we revisit why we didn't see this in profiles? If I am reading the analysis correctly we see a difference in memory use on the order of GB. This certainly needs to show up in heap profiles like inuse_space? This kind of problem is bound to occur again and we need to be able to use the go tooling to hunt it down proactively.

@yuzefovich
Copy link
Member

Hm, I thought that the diff above fixes the issue (my "reproduction" is: running tpcc over a couple of minutes, restarting the cluster with different binaries, and looking at RAM usage; and the fix above was sufficient to bring the usage down to what is expected), but tpccbench/nodes=3/cpu=16 (#64093) still fails. However, with the commit that switches to pointers to struct I only had successful runs. Will look into this more.

@yuzefovich
Copy link
Member

Hm, tpccbench/nodes=3/cpu=16 fails with the pointer change too. I definitely saw it succeed when working on older SHAs, without rebasing on top of master, but let's ignore that issue for now.

I think that the diff above (or 0100ba7) fixes the OOM scenario of this issue with schemachange roachtest.

My reproduction was running TPCC1000 for 5 minutes with different binaries and observing the reported RAM usage. For example, here I alternated "before" (without 0100ba7) and "after" (with 0100ba7) twice:

Screen Shot 2021-05-03 at 11 30 35 AM

This clearly shows an increased RAM usage in the "before" case.

I also collected the heap profiles with heap?seconds=120&si=inuse_space options from node 1 profiles.zip.

Looking at the inuse_space there does exist noticeable difference:

  • before:
Showing nodes accounting for 269.18MB, 24.66% of 1091.70MB total
Dropped 990 nodes (cum <= 5.46MB)
Showing top 10 nodes out of 454
      flat  flat%   sum%        cum   cum%
   89.80MB  8.23%  8.23%   112.32MB 10.29%  github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter
      31MB  2.84% 11.07%       31MB  2.84%  github.com/cockroachdb/cockroach/pkg/sql/sem/tree.NewNumVal
   22.52MB  2.06% 13.13%    22.52MB  2.06%  github.com/cockroachdb/cockroach/pkg/sql/colconv.glob..func1
   21.51MB  1.97% 15.10%    21.51MB  1.97%  github.com/cockroachdb/cockroach/pkg/col/coldataext.newDatumVec
  • after
Showing nodes accounting for 65.91MB, 13.99% of 470.98MB total
Dropped 546 nodes (cum <= 2.35MB)
Showing top 10 nodes out of 569
      flat  flat%   sum%        cum   cum%
   19.67MB  4.18%  4.18%    35.18MB  7.47%  github.com/cockroachdb/pebble.finishInitializingIter
   15.51MB  3.29%  7.47%    15.51MB  3.29%  github.com/cockroachdb/pebble.finishInitializingIter.func1
    7.01MB  1.49%  8.96%     7.01MB  1.49%  github.com/cockroachdb/pebble/sstable.glob..func2
    6.50MB  1.38% 10.34%     6.50MB  1.38%  internal/profile.(*Profile).postDecode
    5.51MB  1.17% 11.51%     7.02MB  1.49%  github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter

In particular, github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter accounts for extra 100MB. That method is called when creating a materializer and the object is picked up from the sync.Pool. It is released back to the pool in Materializer.Release; the materializer is also picked up from another pool, so I'd expect that the materializers aren't released properly too if the converters aren't.

Additionally, I'm confused by the following on the "before" profile:

(pprof) list NewAllVecToDatum
Total: 1.07GB
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/colconv.NewAllVecToDatumConverter in /go/src/github.com/cockroachdb/cockroach/pkg/sql/colconv/vec_to_datum.eg.go
         0     3.50MB (flat, cum)  0.32% of Total
         .          .     77:}
         .          .     78:
         .          .     79:// NewAllVecToDatumConverter is like NewVecToDatumConverter except all of the
         .          .     80:// vectors in the batch will be converted.
         .          .     81:func NewAllVecToDatumConverter(batchWidth int) *VecToDatumConverter {
         .     3.50MB     82:	c := getNewVecToDatumConverter(batchWidth)
         .          .     83:	if cap(c.vecIdxsToConvert) < batchWidth {
         .          .     84:		c.vecIdxsToConvert = make([]int, batchWidth)
         .          .     85:	} else {
         .          .     86:		c.vecIdxsToConvert = c.vecIdxsToConvert[:batchWidth]
         .          .     87:	}
(pprof) list getNewVec
Total: 1.07GB
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/colconv.getNewVecToDatumConverter in /go/src/github.com/cockroachdb/cockroach/pkg/sql/colconv/vec_to_datum.eg.go
   89.80MB   112.32MB (flat, cum) 10.29% of Total
         .          .     56:		return &VecToDatumConverter{}
         .          .     57:	},
         .          .     58:}
         .          .     59:
         .          .     60:func getNewVecToDatumConverter(batchWidth int) *VecToDatumConverter {
         .    22.52MB     61:	c := vecToDatumConverterPool.Get().(*VecToDatumConverter)
         .          .     62:	if cap(c.convertedVecs) < batchWidth {
   89.80MB    89.80MB     63:		c.convertedVecs = make([]tree.Datums, batchWidth)
         .          .     64:	} else {
         .          .     65:		c.convertedVecs = c.convertedVecs[:batchWidth]
         .          .     66:	}
         .          .     67:	return c
         .          .     68:}

Why does the top part report 3.5MB while the bottom part over 100MB? They refer to the same code, so I'm quite confused the profiles.

Should I be taking profiles with some other parameters? Am I missing something?

@RaduBerinde
Copy link
Member

Note that #64177 added a 95% hard memory limit to cockroach in roachprod. Maybe the test was just squeezing by before.

Why does the top part report 3.5MB while the bottom part over 100MB? They refer to the same code, so I'm quite confused the profiles.

That would be explained if getNewVecToDatumConverter is also called in other places, and most of the memory usage comes from those. The 3.50MB only covers the memory allocated by getNewVecToDatumConverter when called from that specific place.

@yuzefovich
Copy link
Member

yuzefovich commented May 3, 2021

Note that #64177 added a 95% hard memory limit to cockroach in roachprod. Maybe the test was just squeezing by before.

Oh yeah, that must be it. Then I think we're good.

That would be explained if getNewVecToDatumConverter is also called in other places, and most of the memory usage comes from those. The 3.50MB only covers the memory allocated by getNewVecToDatumConverter when called from that specific place.

Indeed, silly me. I was too focused on a particular code path.

I guess looking at inuse_space and focusing on NewColOperator we can see the leak:

  • before:
    Screen Shot 2021-05-03 at 12 11 28 PM

  • after:
    Screen Shot 2021-05-03 at 12 11 51 PM

On the former, we see only "positive" changes over the 2 minute period (IIUC), whereas on the latter we see incremental changes.

@yuzefovich
Copy link
Member

yuzefovich commented May 3, 2021

Hm, I am still very confused by this - those 100MB of unreleased converters come from NewBuiltinFunctionOperator and NewTupleProjOp; however, those operators aren't released at all either, so it's possible that 2 minute long heap profiles aren't comparable to each other at all although there the workload running against the cluster is the same.

@yuzefovich
Copy link
Member

yuzefovich commented May 3, 2021

Collected another profile (over 30 seconds) when the cockroach is running with GODEBUG=memprofilerate=1024. This option indicates that one allocation should be sampled for each 1024 of bytes of allocations (by default this rate is 1 per 512KiB; I also tried using memprofilerate=1 to sample every allocation, but then fetching the profile timed out :/ ).

Looking at inuse_space, getNewVecToDatumConverter method again shows up noticeably:
Screen Shot 2021-05-03 at 1 58 15 PM


Screen Shot 2021-05-03 at 2 00 22 PM


Responding to my previous comment

Hm, I am still very confused by this - those 100MB of unreleased converters come from NewBuiltinFunctionOperator and NewTupleProjOp; however, those operators aren't released at all either

My understanding is that the fact whether the components are released (as in execinfra.Releasable interface) doesn't matter for the purposes of inuse_space heap profile - what matters is that the components are considered "live" from the perspective of Go GC. The increase in "inuse_space" by getNewVecToDatumConverter indicates a possible memory leak, so it should be investigated. Comparing two heap profiles taken over the period of the same length on "before" and "after" could be used roughly but shouldn't be treated in the way profiles from the microbenchmarks can be compared.

@RaduBerinde
Copy link
Member

Maybe it's a similar problem - VecToDatumConverter.Release() should clear out the slices?

@yuzefovich
Copy link
Member

yuzefovich commented May 4, 2021

Yeah, quite possibly, I added another commit auditing our Release behavior.

BTW, I meant my previous comment

The increase in "inuse_space" by getNewVecToDatumConverter indicates a possible memory leak, so it should be investigated. Comparing two heap profiles taken over the period of the same length on "before" and "after" could be used roughly but shouldn't be treated in the way profiles from the microbenchmarks can be compared.

as a rough summary of my understanding of how we could have investigated the memory leak if we didn't know which commit broke things or what is the fix.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ dbf711a83969210bcf706d65f53f888856da6ad0:

		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) output in run_012822.982_n5_cockroach_workload_check_tpcc
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2945603-1620068645-72-n5cpu16:5 -- ./cockroach workload check tpcc --warehouses=1000 {pgurl:1} returned
		  | stderr:
		  | I210504 01:28:25.906717 1 workload/tpcc/tpcc.go:395  [-] 1  check 3.3.2.1 took 201.178305ms
		  | I210504 01:28:44.045525 1 workload/tpcc/tpcc.go:395  [-] 2  check 3.3.2.2 took 18.138692815s
		  | Error: check failed: 3.3.2.2: pq: rpc error: code = Canceled desc = context canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./cockroach workload check tpcc --warehouses=1000 {pgurl:1}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2945603-1620068645-72-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: 10693
		4: dead
		3: 9543
		2: 9636
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

The last failure is interesting - n4 crashed with OOM during the check queries. I don't think I've seen a case like this before, but unfortunately the debug.zip collection from n4 wasn't successful. Anyway, we should fix known problems, and likely the OOMs will go away.

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ f9dca40d3afb0a605eeee76bec4a07ff65d6fec0:

		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2800,tpcc.go:162,schemachange.go:302,test_runner.go:777: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2788
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2796
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1716,context.go:140,cluster.go:1705,test_runner.go:858: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2949834-1620155081-67-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		5: skipped
		4: 10604
		2: 10427
		3: 10585
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1162
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:284
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2057
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@craig craig bot closed this as completed in #64453 May 5, 2021
@craig craig bot closed this as completed in ecf484d May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants