release-1.0: Backports and updates for table lease leak #22563

bdarnell · 2018-02-10T17:18:13Z

Four commits:

Backport util/log: don't panic #17871 to fix a hang when the log disk fills up (unrelated, but requested by a customer)
Backport fix for storage: avoid reading uncommitted tail of Raft log when becoming leader #18601, which fixes the endless election loop for ranges that have been affected by this bug
Address sql: no throttling for releasing table leases #20451. This is new work in this branch, although this change should be propagated to the other branches
A new 1.0-specific fix for the table lease leak sql: table lease leakage in 1.0.6 #20422.

cockroach-teamcity · 2018-02-10T17:18:19Z

This change is

bdarnell · 2018-02-10T17:19:23Z

@vivekmenezes Please look at the last commit.

bdarnell · 2018-02-10T17:39:47Z

@cockroachdb/build-prs I patched etcd directly in our vendored repo, which our linter is unhappy with. What should I be doing instead?

benesch · 2018-02-10T17:47:31Z

Fork etcd and point Gopkg.toml at the fork. If the patch is expected to be temporary, you can use the source field to avoid rewriting all the import paths. Then rerun dep and commit the diff.

…

On Sat, Feb 10, 2018 at 12:40 PM Ben Darnell ***@***.***> wrote: @cockroachdb/build-prs <https://github.com/orgs/cockroachdb/teams/build-prs> I patched etcd directly in our vendored repo, which our linter is unhappy with. What should I be doing instead? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#22563 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IDq078jYYAHFv8wcpfcJtmTIoNtVks5tTdRvgaJpZM4SBAIW> .

petermattis · 2018-02-10T18:17:50Z

Review status: 0 of 4 files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/sql/lease.go, line 577 at r4 (raw file):

			// it's our responsibility to delete it when we make it no
			// longer the newest.
			defer func(lease *LeaseState) {

Creating defers in a loop is suspicious, though I'm not seeing anything wrong, I'm also not particularly familiar with this code anymore. Is there a reason you're doing this with a defer? I'm also not clear on why this is only done if a new lease was acquired. Questions all over the place, mostly indicating my deficiencies.

How did you test this?

Comments from Reviewable

bdarnell · 2018-02-10T18:35:58Z

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/lease.go, line 577 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Creating defers in a loop is suspicious, though I'm not seeing anything wrong, I'm also not particularly familiar with this code anymore. Is there a reason you're doing this with a defer? I'm also not clear on why this is only done if a new lease was acquired. Questions all over the place, mostly indicating my deficiencies.

How did you test this?

I haven't tested it yet (aside from running make test). I'm not entirely sure how to test it.

Moving this out of a defer is a good idea. I'll set a local variable here and move the rest of the code below acquireFromStoreLocked.

The idea behind this change is that we're using a slightly non-standard refcounting pattern. Only the newest lease is allowed to increase its refcount (and as an optimization, the newest lease continues to exist even while its refcount is zero, instead of being destroyed and recreated); older leases are deleted when their refcounts reach zero. If we created a new lease, we've transitioned the previous one from "newest" to "not newest". If its refcount is already zero, no one else will be coming along to clean it up so we have to do it here.

Comments from Reviewable

petermattis · 2018-02-10T18:41:25Z

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/lease.go, line 577 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I haven't tested it yet (aside from running make test). I'm not entirely sure how to test it.

Moving this out of a defer is a good idea. I'll set a local variable here and move the rest of the code below acquireFromStoreLocked.

The idea behind this change is that we're using a slightly non-standard refcounting pattern. Only the newest lease is allowed to increase its refcount (and as an optimization, the newest lease continues to exist even while its refcount is zero, instead of being destroyed and recreated); older leases are deleted when their refcounts reach zero. If we created a new lease, we've transitioned the previous one from "newest" to "not newest". If its refcount is already zero, no one else will be coming along to clean it up so we have to do it here.

Ok, I need to look at this in more detail later. For testing, you could manually replicate what I did in #20422 (comment).

Comments from Reviewable

Previously, log.outputLogEntry could panic while holding the log mutex. This would deadlock any goroutine that logged while recovering from the panic, which is approximately all of the recover routines. Most annoyingly, the crash reporter would deadlock, swallowing the cause of the panic. Avoid panicking while holding the log mutex and use l.exit instead, which exists for this very purpose. In the process, enforce the invariant that l.mu is held when l.exit is called. (The previous behavior was, in fact, incorrect, as l.flushAll should not be called without holding l.mu.) Also add a Tcl test to ensure this doesn't break in the future.

bdarnell · 2018-02-10T21:55:39Z

OK, I've tested manually with the procedure from #20422 (comment) and verified that the number of leases holds steady (~~although the count varies between 4 and 6 for the 5-node cluster, spending more time at 4 than I'd expect~~ Nevermind, this was a 4-node cluster. I was confused by the way roachprod uses the last machine only for load generation).

I've also done the necessary glide (not dep in 1.0) magic to make the linter happy.

petermattis · 2018-02-11T02:15:34Z

Review status: 0 of 6 files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/sql/lease.go, line 820 at r8 (raw file):

	// Release to the store asynchronously, without the tableState lock.
	if err := t.stopper.RunLimitedAsyncTask(ctx, removeLeaseSem, true,

This will make removeLease block holding exitingLease.mu. Seems like that could be problematic.

Comments from Reviewable

bdarnell · 2018-02-11T04:12:39Z

Review status: 0 of 6 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/sql/lease.go, line 820 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will make removeLease block holding exitingLease.mu. Seems like that could be problematic.

Maybe. But under normal conditions the number of leases will be small and the limit will rarely be reached (I could increase the limit to make this less likely). Do you think it's worth doing anything more clever on the 1.0 branch?

Comments from Reviewable

vivekmenezes

LGTM

vivekmenezes · 2018-02-11T11:58:19Z

pkg/sql/lease.go

-	if err := t.stopper.RunAsyncTask(ctx, func(ctx context.Context) {
-		m.LeaseStore.Release(ctx, t.stopper, lease)
-	}); err != nil {
+	if err := t.stopper.RunLimitedAsyncTask(ctx, removeLeaseSem, true,


I worry about changing this to use the RunLimitedAsyncTask() because it blocks while holding on to the lock over all leases for a table. I think 1.0 users are likely to use very few tables and so it's very likely there will be only a few of these async tasks created.

petermattis · 2018-02-11T15:07:53Z

My worry is that blocking while holding that mutex could lead to deadlock. A more conservative form of throttling would be to continue to use RunAsyncTask and put a semaphore based throttle inside the task. We’d still have a mess of goroutines, but the load on the KV layer would be the same.

…

On Sat, Feb 10, 2018 at 11:12 PM Ben Darnell ***@***.***> wrote: Review status: 0 of 6 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. ------------------------------ *pkg/sql/lease.go, line 820 at r8 <https://beta.reviewable.io/reviews/cockroachdb/cockroach/22563#-L51cbqtDOWqznHLxPQZ:-L522VCi3tQJO9ybA0Mg:b-7gl931> (raw file <https://github.com/cockroachdb/cockroach/blob/5c3d46c0e9300dd1e6559917093d0e16898da943/pkg/sql/lease.go#L820>):* *Previously, petermattis (Peter Mattis) wrote…* This will make removeLease block holding exitingLease.mu. Seems like that could be problematic. Maybe. But under normal conditions the number of leases will be small and the limit will rarely be reached (I could increase the limit to make this less likely). Do you think it's worth doing anything more clever on the 1.0 branch? ------------------------------ *Comments from Reviewable <https://beta.reviewable.io/reviews/cockroachdb/cockroach/22563>* — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#22563 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6f9xYHu5q4AotgUDkDhSpzdiF6954Mks5tTmjBgaJpZM4SBAIW> .

Addresses cockroachdb#20451 for the release-1.0 branch

This prevents a leak (only present in 1.0) of these leases, which could accumulate into a huge amount of work when PurgeOldLeases is called. Fixes cockroachdb#20422

Newer branches have a more sophisticated solution for this (cockroachdb#20542)

petermattis · 2018-02-11T17:12:39Z

Review status: 0 of 7 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/sql/lease.go, line 820 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe. But under normal conditions the number of leases will be small and the limit will rarely be reached (I could increase the limit to make this less likely). Do you think it's worth doing anything more clever on the 1.0 branch?

Ack, this looks safer. If you see any test failures you'd need to combine this with watching for watching the stopper done channel.

Comments from Reviewable

bdarnell requested a review from a team February 10, 2018 17:18

bdarnell force-pushed the 1.0-backports branch 2 times, most recently from fa0014a to acd96e2 Compare February 10, 2018 19:38

benesch and others added 2 commits February 10, 2018 14:48

Update raft dependency for backport of etcd-io/etcd#9073

66ad7ad

bdarnell force-pushed the 1.0-backports branch 2 times, most recently from 9fc6047 to 5c3d46c Compare February 10, 2018 21:52

vivekmenezes reviewed Feb 11, 2018

View reviewed changes

bdarnell added 3 commits February 11, 2018 11:42

sql: Throttle lease removal

b43d5ba

Addresses cockroachdb#20451 for the release-1.0 branch

sql: Remove unused nearly-expired leases when acquiring a new one

0a9e9bd

This prevents a leak (only present in 1.0) of these leases, which could accumulate into a huge amount of work when PurgeOldLeases is called. Fixes cockroachdb#20422

build: Work around testrace goroutine limit for release-1.0

d9bcbef

Newer branches have a more sophisticated solution for this (cockroachdb#20542)

bdarnell force-pushed the 1.0-backports branch from 7185a69 to d9bcbef Compare February 11, 2018 16:42

bdarnell merged commit b277f36 into cockroachdb:release-1.0 Feb 11, 2018

bdarnell deleted the 1.0-backports branch February 11, 2018 17:18

bdarnell mentioned this pull request Feb 13, 2018

sql: table lease leakage in 1.0.6 #20422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-1.0: Backports and updates for table lease leak #22563

release-1.0: Backports and updates for table lease leak #22563

bdarnell commented Feb 10, 2018 •

edited

Loading

cockroach-teamcity commented Feb 10, 2018

bdarnell commented Feb 10, 2018 •

edited

Loading

bdarnell commented Feb 10, 2018

benesch commented Feb 10, 2018 via email

petermattis commented Feb 10, 2018

bdarnell commented Feb 10, 2018

petermattis commented Feb 10, 2018

bdarnell commented Feb 10, 2018 •

edited

Loading

petermattis commented Feb 11, 2018

bdarnell commented Feb 11, 2018

vivekmenezes left a comment

vivekmenezes Feb 11, 2018

petermattis commented Feb 11, 2018 via email

petermattis commented Feb 11, 2018

release-1.0: Backports and updates for table lease leak #22563

release-1.0: Backports and updates for table lease leak #22563

Conversation

bdarnell commented Feb 10, 2018 • edited Loading

cockroach-teamcity commented Feb 10, 2018

bdarnell commented Feb 10, 2018 • edited Loading

bdarnell commented Feb 10, 2018

benesch commented Feb 10, 2018 via email

petermattis commented Feb 10, 2018

bdarnell commented Feb 10, 2018

petermattis commented Feb 10, 2018

bdarnell commented Feb 10, 2018 • edited Loading

petermattis commented Feb 11, 2018

bdarnell commented Feb 11, 2018

vivekmenezes left a comment

Choose a reason for hiding this comment

vivekmenezes Feb 11, 2018

Choose a reason for hiding this comment

petermattis commented Feb 11, 2018 via email

petermattis commented Feb 11, 2018

bdarnell commented Feb 10, 2018 •

edited

Loading

bdarnell commented Feb 10, 2018 •

edited

Loading

bdarnell commented Feb 10, 2018 •

edited

Loading