Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/contention/nodes=4 failed #36089

Closed
cockroach-teamcity opened this issue Mar 24, 2019 · 56 comments · Fixed by #38782
Closed

roachtest: kv/contention/nodes=4 failed #36089

cockroach-teamcity opened this issue Mar 24, 2019 · 56 comments · Fixed by #38782
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
	ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1626,kv.go:189,test.go:1214: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 24, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 24, 2019
@nvanbenschoten nvanbenschoten self-assigned this Mar 25, 2019
@nvanbenschoten
Copy link
Member

spent 5.263158% of time below target of ... 5.000000%. We're close enough to the target that this is probably just a low-probability flake. I'll leave this as is and see if we fail over the next few days. If not, I'll bump the target threshold slightly, as it was chosen arbitrarily.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a1e6e9decc9dec15a32bbb6d30efc67ca45a532a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204585&tab=buildLog

The test failed on release-19.1:
	ts_util.go:89,kv.go:186,cluster.go:1605,errgroup.go:57: spent 5.263158% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1626,kv.go:189,test.go:1216: Goexit() was called

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 16, 2019
Fixes cockroachdb#36089.

We've seen that this test can occasionally dip above 5% of time below
the minimum QPS. This doesn't seem to be indicative of a full QPS stall
and this passing criteria was somewhat arbitrary, so relax it.

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/83de585d331b05a4aa02a65b353bed6bf829b696

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247383&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 14.035088% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ec4728ae986b46d4f57009233b86971198b275ed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255121&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1667,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1688,kv.go:189,test.go:1237: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	ts_util.go:89,kv.go:186,cluster.go:1761,errgroup.go:57: spent 7.017544% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1782,kv.go:189,test.go:1245: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 10.344828% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

@nvanbenschoten
Copy link
Member

The increased flakiness looks somewhat correlated with #36748. I wonder if there are cases where we were relying on liveness to expire for transactions to be unstuck. I'm going to run this with a huge txn liveness expiration and see if things get even worse.

@nvanbenschoten
Copy link
Member

That did the trick. I can clearly see transactions waiting on other transactions for 30+ seconds after bumping TxnLivenessHeartbeatMultiplier to 30. I think it's likely that a transaction is failing to roll back its transaction record.

@nvanbenschoten
Copy link
Member

I see the transactions that are failing to roll back their transaction records hitting errors like
TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql txn" id=c4215f04 key=/Table/53/1/-7656810924711545548/0 rw=true pri=0.01158742 stat=PENDING epo=0 ts=1556562081.872413329,1 orig=1556562051.258792452,0 max=1556562051.260084221,0 wto=true seq=13

@nvanbenschoten
Copy link
Member

So actually that doesn't say very much.

@nvanbenschoten
Copy link
Member

On the /debug/requests page I saw pushers getting stuck in the contentionQueue and repeatedly hitting TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) errors every 100ms (after waiting for cycle detection to be ready). This makes sense because contentionQueue.add doesn't return an error so transactions can fall into loops whether they hit an intent, try to push, fail to push due to an error, and repeat. I added a few lines to plumb the error back up through the IntentResolver and things immediately got a lot better. I no longer see any long stalls in the workload even with the 30s TxnLivenessThreshold.

I'm going to make the following changes:

  1. propagate errors from the contentionQueue instead of swallowing them
  2. tie TxnLivenessHeartbeatMultiplier to an environment variable and set it to 60s in this roachtest. We should never need to wait on a transaction's liveness to expire when all nodes are live, so the low timeout was only serving to hide bugs like this

@nvanbenschoten
Copy link
Member

Another option is to have a cluster setting/env var that disables transaction expiration entirely so that a failure to roll a transaction back causes everyone else to get stuck. This benefit of this is that it would be very loud.

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 30, 2019
Informs cockroachdb#36089.

Before this commit, requests could get stuck repeatedly attempting to
push a transaction only to repeatedly find that they themselves were
already aborted. The error would not propagate up to the transaction
coordinator and the request would get stuck.

This commit fixes this behavior by correctly propagating errors observed
by the contentionQueue.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 30, 2019
…hold

Fixes cockroachdb#36089.

This commit bumps the TxnLivenessThreshold for clusters running
`kv/contention/nodes=4` to 10 minutes. This is sufficiently large
such that if at any point a transaction is abandoned then all other
transactions will begin waiting for it and the test will fail to
achieve its minimum QPS requirement.

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1253: Goexit() was called

@andreimatei
Copy link
Contributor

A compromise would be to have a knob that allows us to set TxnLivenessThreshold to some high value and another to instruct PushTxn to fatal if we ever stumble upon an expired PENDING transaction. The nice part about this is that it would be equally as loud in roachtests as the first option but wouldn't cause the tests to hang forever. @andreimatei any opinion on this?

I like it.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 12.280702% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ff969dd6cb0e087327f0b210f728f588b2f5aeb0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1270738&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1814,errgroup.go:57: spent 22.807018% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1835,kv.go:189,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/856ba9108f112f85d406bbe88d2208651859336e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1274175&tab=buildLog

The test failed on branch=master, cloud=gce:
	ts_util.go:89,kv.go:186,cluster.go:1812,errgroup.go:57: spent 15.789474% of time below target of 10.000000 txn/s, wanted no more than 5.000000%
	cluster.go:1833,kv.go:189,test.go:1251: Goexit() was called

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 10, 2019
Fixes cockroachdb#36089.
Informs cockroachdb#37199.

This commit addresses the second concern from cockroachdb#37199 (comment)
by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's
minimum timestamp. This allows pushers to have perfect accuracy
into whether an intent is part of a transaction that can eventually
be committed or whether it has been aborted by any other pusher and
uncommittable.

This allows us to get rid of the false positive cases where a pusher
incorrectly detects that a transaction is still active and begins
waiting on it. In this worst case, this could lead to transactions
waiting for the entire transaction expiration for a contending txn.

Release note: None
craig bot pushed a commit that referenced this issue Jul 11, 2019
38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten

Fixes #36089.
Informs #37199.

This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1.

It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable.

This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn.

@tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #38782 Jul 11, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 29, 2019
release-19.1 is susceptible to the issues described in cockroachdb#36089, so it
won't reliably pass this test.

Release note: None
craig bot pushed a commit that referenced this issue Jul 30, 2019
39149: roachtest: skip kv/contention/nodes=4 for release-19.1 r=tbg a=nvanbenschoten

Fixes #39116.

release-19.1 is susceptible to the issues described in #36089, so it won't reliably pass this test.

39160: storage: add DisableRaftLogQueue to StoreTestingKnobs r=tbg a=nvanbenschoten

Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test.

Release note: None

39161: storage: address TODO in TestPushTxnHeartbeatTimeout r=tbg a=nvanbenschoten

Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2019
…stamp

Fixes "Bug 1" from cockroachdb#36089 (comment).

This commit fixes the bug described in the referenced issue where MVCCScan can
read committed MVCC values at timestamps larger than a scan's read timestamp if
it finds an intent for the same transaction but from a previous epoch at a
timestamp larger than the scan's read timestamp.

Fixing this bug resolves the current issue in cockroachdb#36089.

Release note (bug fix): Fix bug where MVCC value at future timestamp is
returned after a transaction restart.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 10, 2019
…stamp

Fixes "Bug 1" from cockroachdb#36089 (comment).

This commit fixes the bug described in the referenced issue where MVCCScan can
read committed MVCC values at timestamps larger than a scan's read timestamp if
it finds an intent for the same transaction but from a previous epoch at a
timestamp larger than the scan's read timestamp.

Fixing this bug resolves the current issue in cockroachdb#36089.

Release note (bug fix): Fix bug where MVCC value at future timestamp is
returned after a transaction restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)
Projects
None yet
5 participants