sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

knz · 2016-09-21T23:21:23Z

Steps:

create fresh cluster rm -rf cockroach-data; ./cockroach start --background
load the example db: ./cockroach gen example-data startrek | ./cockroach sql
confirm that the example db can be reloaded multiple times (can DROP then CREATE multiple times): repeat step 2 a few times.
stop the server with killall cockroach - wait for the server to finish shutting down
restart the server ./cockroach start --background
attempt step 2 again; observe: the server hangs after SET, upon encountering the first DROP
interrupt the client with Ctrl+C
attempt step 2 again; observe: the first DROP fails with an error, then the client hangs (error: pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary)
interrupt the client again with Ctrl+C
attempt step 2 again; observe: major confusion

CREATE DATABASE
SET
pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary
pq: table "episodes" has been deleted
pq: relation "episodes" already exists
pq: table is being deleted
pq: relation "quotes" already exists
pq: table is being deleted
Error: pq: table is being deleted
Failed running "sql"

From that point forward the db cannot be used but also cannot be deleted.

The text was updated successfully, but these errors were encountered:

knz · 2016-09-21T23:21:46Z

cc @andreimatei @dt (the issue does not occur if the table being dropped does not contain FKs)

knz · 2016-09-21T23:23:21Z

NB: after a while (a few minutes), the error disappears.

knz · 2016-09-22T15:33:38Z

After discussing with Andrei:

the root cause is that the server does not release its leases properly upon the shutdown in step 4. As a result when the DROP runs for the first time after that, it has to wait for the existing lease to expire (resulting in the client suspending accordingly). The solution to this is either release the lease upon shutdown, or reuse the existing lease after the restart (or both).
when the client is interrupted in step 7, the DROP has marked the table as deleted. Consequently, any subsequent attempt to DROP will see the table as deleted and fail accordingly. There is a UX issue with this, which can be addressed in a number of ways:
- the subsequent DROPs could wait, like the first.
- the error message could be clarified, e.g. "cannot DROP this table, another DROP is already in progress"
- the entire DROP logic could start by renaming the table to something else, so that subsequent schema updates that want to re-create a table of the same name can succeed while the initial DROP is in progress.

In any case, once the DROP has been initiated, we should care that a subsequent CREATE TABLE with the same name succeeds without error. Right now the user would see "table has been deleted" upon DROP, and then deduce that it's safe to create, but then CREATE fails with "table already exists.".

This is a catastrophic UX failure.

Related: #8497, #9318.

knz · 2016-09-22T15:36:23Z

@dt it would be great if you could expound on the explanation above to clarify where the FK message comes from? pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary this happens during a 2nd DROP.

andreimatei · 2016-09-26T21:39:20Z

I'll work on improving sql draining to release those leases... I'm real scared.

vivekmenezes · 2016-10-21T21:36:36Z

@andreimatei are you really working on releasing those leases. I don't think we should do that for 1.0

knz · 2016-10-22T08:41:17Z

Vivek I'm not sure whether releasing the leases are the best way to address the issue at hand but the UX problem initially reported here must absolutely be fixed before 1.0 IMO.

andreimatei · 2016-10-23T18:44:14Z

(I've replied this by email 2 days ago, but for some reason it didn't make it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and getting in the weeds and realizing it's not very easy and I started bringing it up with Peter and others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come up multiple times from external people in the context of not being able to delete tables.

petermattis · 2016-10-24T00:11:40Z

Draining table leases when a server is stopped doesn't sound too bad, but then I haven't looked at these weeds recently. Alternately, can we clear any previously held table leases when a server restarts?

Re: double DROP of a table. All of the proposed UX options sound reasonable. This sounds like the easiest short term fix:

the error message could be clarified, e.g. "cannot DROP this table, another DROP is already in progress"

andreimatei · 2016-10-25T18:25:28Z

I'm not working on it; I gave up after investigating draining a bit and
realizing it's not very easy and I started bringing it up with Peter and
others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come
up multiple times from external people in the context of not being able to
delete tables.

On Fri, Oct 21, 2016 at 5:36 PM, vivekmenezes notifications@github.com
wrote:

@andreimatei https://github.com/andreimatei are you really working on
releasing those leases. I don't think we should do that for 1.0

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAXBcYlUUwI-IgcT0xqJHO606ESecugVks5q2TBogaJpZM4KDYRw
.

vivekmenezes · 2016-10-25T18:29:29Z

With all our schema changes being chunked now we certainly can reduce the
timeout to perhaps 1 minute. That along with better error messages should
suffice for 1.0

On Sun, Oct 23, 2016, 2:44 PM Andrei Matei notifications@github.com wrote:

(I've replied this by email 2 days ago, but for some reason it didn't make
it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and
getting in the weeds and realizing it's not very easy and I started
bringing it up with Peter and others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come
up multiple times from external people in the context of not being able to
delete tables.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBJAtYiNcPJbdiUxBUQmPGPQyA8nZks5q26sQgaJpZM4KDYRw
.

vivekmenezes · 2016-10-25T18:45:35Z

@knz do let me know what you'd like to see besides what I've done in
#10063

On Sat, Oct 22, 2016 at 4:41 AM kena notifications@github.com wrote:

Vivek I'm not sure whether releasing the leases are the best way to
address the issue at hand but the UX problem initially reported here must
absolutely be fixed before 1.0 IMO.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBGwT8nSUeEpTJXMPbvIU2WFoOOEiks5q2cw-gaJpZM4KDYRw
.

dt · 2016-12-20T19:26:20Z

I think we want an early return in Validate, before checking the cross-table references, if the table is being dropped (as it is then expected that back-references do not exist).

knz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2016

knz assigned vivekmenezes Sep 21, 2016

knz assigned dt and andreimatei Sep 22, 2016

knz added this to the 1.0 milestone Sep 22, 2016

knz changed the title ~~sql: schema change after server restart blocks once then corrupts DB after retry~~ sql: schema change after server restart blocks once then looks like DB corruption after retry Sep 22, 2016

andreimatei mentioned this issue Oct 3, 2016

sql: Dropping table caused infinite publish (count leases) loop #6837

Closed

vivekmenezes mentioned this issue Oct 21, 2016

sql: SHOW TABLES marks a dropped table with a " (dropped)" suffix #10063

Merged

andreimatei assigned asubiotto Nov 2, 2016

knz added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Nov 2, 2016

asubiotto mentioned this issue Dec 20, 2016

sql: add drain mode to LeaseManager #12501

Merged

asubiotto mentioned this issue Dec 20, 2016

sql/sqlbase: skip validation of FKs during drop #12514

Merged

asubiotto closed this as completed in #12501 Jan 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

knz commented Sep 21, 2016 •

edited

Loading

knz commented Sep 21, 2016

knz commented Sep 21, 2016

knz commented Sep 22, 2016

knz commented Sep 22, 2016 •

edited

Loading

andreimatei commented Sep 26, 2016

vivekmenezes commented Oct 21, 2016

knz commented Oct 22, 2016

andreimatei commented Oct 23, 2016

petermattis commented Oct 24, 2016

andreimatei commented Oct 25, 2016

vivekmenezes commented Oct 25, 2016

vivekmenezes commented Oct 25, 2016

dt commented Dec 20, 2016

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

Comments

knz commented Sep 21, 2016 • edited Loading

knz commented Sep 21, 2016

knz commented Sep 21, 2016

knz commented Sep 22, 2016

knz commented Sep 22, 2016 • edited Loading

andreimatei commented Sep 26, 2016

vivekmenezes commented Oct 21, 2016

knz commented Oct 22, 2016

andreimatei commented Oct 23, 2016

petermattis commented Oct 24, 2016

andreimatei commented Oct 25, 2016

vivekmenezes commented Oct 25, 2016

vivekmenezes commented Oct 25, 2016

dt commented Dec 20, 2016

knz commented Sep 21, 2016 •

edited

Loading

knz commented Sep 22, 2016 •

edited

Loading