roachtest: kv/contention/nodes=4 failed #36089

cockroach-teamcity · 2019-03-24T10:43:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
	ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1626,kv.go:189,test.go:1214: Goexit() was called

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-03-25T20:00:25Z

spent 5.263158% of time below target of ... 5.000000%. We're close enough to the target that this is probably just a low-probability flake. I'll leave this as is and see if we fail over the next few days. If not, I'll bump the target threshold slightly, as it was chosen arbitrarily.

cockroach-teamcity · 2019-03-28T10:30:51Z

SHA: https://github.com/cockroachdb/cockroach/commits/a1e6e9decc9dec15a32bbb6d30efc67ca45a532a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204585&tab=buildLog

The test failed on release-19.1:
	ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1626,kv.go:189,test.go:1216: Goexit() was called

Fixes cockroachdb#36089. We've seen that this test can occasionally dip above 5% of time below the minimum QPS. This doesn't seem to be indicative of a full QPS stall and this passing criteria was somewhat arbitrary, so relax it. Release note: None

cockroach-teamcity · 2019-04-18T11:34:29Z

SHA: https://github.com/cockroachdb/cockroach/commits/83de585d331b05a4aa02a65b353bed6bf829b696

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 14.035088% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

cockroach-teamcity · 2019-04-23T11:30:30Z

SHA: https://github.com/cockroachdb/cockroach/commits/ec4728ae986b46d4f57009233b86971198b275ed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255121&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

cockroach-teamcity · 2019-04-25T11:18:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1761,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1782,kv.go:189,test.go:1245: Goexit() was called

cockroach-teamcity · 2019-04-28T11:10:28Z

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 10.344828% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

nvanbenschoten · 2019-04-29T17:53:02Z

The increased flakiness looks somewhat correlated with #36748. I wonder if there are cases where we were relying on liveness to expire for transactions to be unstuck. I'm going to run this with a huge txn liveness expiration and see if things get even worse.

nvanbenschoten · 2019-04-29T18:02:20Z

That did the trick. I can clearly see transactions waiting on other transactions for 30+ seconds after bumping TxnLivenessHeartbeatMultiplier to 30. I think it's likely that a transaction is failing to roll back its transaction record.

nvanbenschoten · 2019-04-29T18:23:04Z

I see the transactions that are failing to roll back their transaction records hitting errors like
TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql txn" id=c4215f04 key=/Table/53/1/-7656810924711545548/0 rw=true pri=0.01158742 stat=PENDING epo=0 ts=1556562081.872413329,1 orig=1556562051.258792452,0 max=1556562051.260084221,0 wto=true seq=13

nvanbenschoten · 2019-04-29T18:24:32Z

So actually that doesn't say very much.

nvanbenschoten · 2019-04-29T19:05:13Z

On the /debug/requests page I saw pushers getting stuck in the contentionQueue and repeatedly hitting TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) errors every 100ms (after waiting for cycle detection to be ready). This makes sense because contentionQueue.add doesn't return an error so transactions can fall into loops whether they hit an intent, try to push, fail to push due to an error, and repeat. I added a few lines to plumb the error back up through the IntentResolver and things immediately got a lot better. I no longer see any long stalls in the workload even with the 30s TxnLivenessThreshold.

I'm going to make the following changes:

propagate errors from the contentionQueue instead of swallowing them
tie TxnLivenessHeartbeatMultiplier to an environment variable and set it to 60s in this roachtest. We should never need to wait on a transaction's liveness to expire when all nodes are live, so the low timeout was only serving to hide bugs like this

nvanbenschoten · 2019-04-29T19:27:23Z

Another option is to have a cluster setting/env var that disables transaction expiration entirely so that a failure to roll a transaction back causes everyone else to get stuck. This benefit of this is that it would be very loud.

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

Informs cockroachdb#36089. Before this commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the contentionQueue. Release note: None

…hold Fixes cockroachdb#36089. This commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it and the test will fail to achieve its minimum QPS requirement. Release note: None

cockroach-teamcity · 2019-04-30T11:19:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

andreimatei · 2019-04-30T17:38:09Z

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

I like it.

cockroach-teamcity · 2019-05-01T11:23:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

cockroach-teamcity · 2019-05-02T11:49:25Z

SHA: https://github.com/cockroachdb/cockroach/commits/ff969dd6cb0e087327f0b210f728f588b2f5aeb0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1270738&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 22.807018% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

cockroach-teamcity · 2019-05-04T11:14:36Z

SHA: https://github.com/cockroachdb/cockroach/commits/856ba9108f112f85d406bbe88d2208651859336e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1274175&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 15.789474% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None

@tbg

38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten Fixes #36089. Informs #37199. This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. @tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

release-19.1 is susceptible to the issues described in cockroachdb#36089, so it won't reliably pass this test. Release note: None

39149: roachtest: skip kv/contention/nodes=4 for release-19.1 r=tbg a=nvanbenschoten Fixes #39116. release-19.1 is susceptible to the issues described in #36089, so it won't reliably pass this test. 39160: storage: add DisableRaftLogQueue to StoreTestingKnobs r=tbg a=nvanbenschoten Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test. Release note: None 39161: storage: address TODO in TestPushTxnHeartbeatTimeout r=tbg a=nvanbenschoten Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…stamp Fixes "Bug 1" from cockroachdb#36089 (comment). This commit fixes the bug described in the referenced issue where MVCCScan can read committed MVCC values at timestamps larger than a scan's read timestamp if it finds an intent for the same transaction but from a previous epoch at a timestamp larger than the scan's read timestamp. Fixing this bug resolves the current issue in cockroachdb#36089. Release note (bug fix): Fix bug where MVCC value at future timestamp is returned after a transaction restart.

cockroach-teamcity added this to the 19.1 milestone Mar 24, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 24, 2019

nvanbenschoten self-assigned this Mar 25, 2019

nvanbenschoten mentioned this issue Apr 16, 2019

roachtest: relax passing criteria for kv/contention/nodes=4 #36877

Closed

nvanbenschoten mentioned this issue Apr 30, 2019

storage: propagate errors from contentionQueue, catch stalls in roachtest #37199

Merged

nvanbenschoten mentioned this issue Jul 9, 2019

storage: persist minimum transaction timestamps in intents #38782

Merged

craig bot closed this as completed in #38782 Jul 11, 2019

nvanbenschoten mentioned this issue Jul 29, 2019

roachtest: kv/contention/nodes=4 failed #39116

Closed

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 29, 2019

roachtest: skip kv/contention/nodes=4 for release-19.1

5082052

release-19.1 is susceptible to the issues described in cockroachdb#36089, so it won't reliably pass this test. Release note: None

nvanbenschoten mentioned this issue Jul 29, 2019

roachtest: skip kv/contention/nodes=4 for release-19.1 #39149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv/contention/nodes=4 failed #36089

roachtest: kv/contention/nodes=4 failed #36089

cockroach-teamcity commented Mar 24, 2019

nvanbenschoten commented Mar 25, 2019

cockroach-teamcity commented Mar 28, 2019

cockroach-teamcity commented Apr 18, 2019

cockroach-teamcity commented Apr 23, 2019

cockroach-teamcity commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

cockroach-teamcity commented Apr 30, 2019

andreimatei commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 2, 2019

cockroach-teamcity commented May 4, 2019

roachtest: kv/contention/nodes=4 failed #36089

roachtest: kv/contention/nodes=4 failed #36089

Comments

cockroach-teamcity commented Mar 24, 2019

nvanbenschoten commented Mar 25, 2019

cockroach-teamcity commented Mar 28, 2019

cockroach-teamcity commented Apr 18, 2019

cockroach-teamcity commented Apr 23, 2019

cockroach-teamcity commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

nvanbenschoten commented Apr 29, 2019

cockroach-teamcity commented Apr 30, 2019

andreimatei commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 2, 2019

cockroach-teamcity commented May 4, 2019