Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: schema change after server restart blocks once then looks like DB corruption after retry #9493

Closed
knz opened this issue Sep 21, 2016 · 13 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Milestone

Comments

@knz
Copy link
Contributor

knz commented Sep 21, 2016

Steps:

  1. create fresh cluster rm -rf cockroach-data; ./cockroach start --background
  2. load the example db: ./cockroach gen example-data startrek | ./cockroach sql
  3. confirm that the example db can be reloaded multiple times (can DROP then CREATE multiple times): repeat step 2 a few times.
  4. stop the server with killall cockroach - wait for the server to finish shutting down
  5. restart the server ./cockroach start --background
  6. attempt step 2 again; observe: the server hangs after SET, upon encountering the first DROP
  7. interrupt the client with Ctrl+C
  8. attempt step 2 again; observe: the first DROP fails with an error, then the client hangs (error: pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary)
  9. interrupt the client again with Ctrl+C
  10. attempt step 2 again; observe: major confusion
CREATE DATABASE
SET
pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary
pq: table "episodes" has been deleted
pq: relation "episodes" already exists
pq: table is being deleted
pq: relation "quotes" already exists
pq: table is being deleted
Error: pq: table is being deleted
Failed running "sql"

From that point forward the db cannot be used but also cannot be deleted.

@knz knz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2016
@knz
Copy link
Contributor Author

knz commented Sep 21, 2016

cc @andreimatei @dt (the issue does not occur if the table being dropped does not contain FKs)

@knz
Copy link
Contributor Author

knz commented Sep 21, 2016

NB: after a while (a few minutes), the error disappears.

@knz
Copy link
Contributor Author

knz commented Sep 22, 2016

After discussing with Andrei:

  • the root cause is that the server does not release its leases properly upon the shutdown in step 4. As a result when the DROP runs for the first time after that, it has to wait for the existing lease to expire (resulting in the client suspending accordingly). The solution to this is either release the lease upon shutdown, or reuse the existing lease after the restart (or both).
  • when the client is interrupted in step 7, the DROP has marked the table as deleted. Consequently, any subsequent attempt to DROP will see the table as deleted and fail accordingly. There is a UX issue with this, which can be addressed in a number of ways:
    • the subsequent DROPs could wait, like the first.
    • the error message could be clarified, e.g. "cannot DROP this table, another DROP is already in progress"
    • the entire DROP logic could start by renaming the table to something else, so that subsequent schema updates that want to re-create a table of the same name can succeed while the initial DROP is in progress.

In any case, once the DROP has been initiated, we should care that a subsequent CREATE TABLE with the same name succeeds without error. Right now the user would see "table has been deleted" upon DROP, and then deduce that it's safe to create, but then CREATE fails with "table already exists.".

This is a catastrophic UX failure.

Related: #8497, #9318.

@knz
Copy link
Contributor Author

knz commented Sep 22, 2016

@dt it would be great if you could expound on the explanation above to clarify where the FK message comes from? pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary this happens during a 2nd DROP.

@knz knz added this to the 1.0 milestone Sep 22, 2016
@knz knz changed the title sql: schema change after server restart blocks once then corrupts DB after retry sql: schema change after server restart blocks once then looks like DB corruption after retry Sep 22, 2016
@andreimatei
Copy link
Contributor

I'll work on improving sql draining to release those leases... I'm real scared.

@vivekmenezes
Copy link
Contributor

@andreimatei are you really working on releasing those leases. I don't think we should do that for 1.0

@knz
Copy link
Contributor Author

knz commented Oct 22, 2016

Vivek I'm not sure whether releasing the leases are the best way to address the issue at hand but the UX problem initially reported here must absolutely be fixed before 1.0 IMO.

@andreimatei
Copy link
Contributor

(I've replied this by email 2 days ago, but for some reason it didn't make it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and getting in the weeds and realizing it's not very easy and I started bringing it up with Peter and others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come up multiple times from external people in the context of not being able to delete tables.

@petermattis
Copy link
Collaborator

Draining table leases when a server is stopped doesn't sound too bad, but then I haven't looked at these weeds recently. Alternately, can we clear any previously held table leases when a server restarts?

Re: double DROP of a table. All of the proposed UX options sound reasonable. This sounds like the easiest short term fix:

the error message could be clarified, e.g. "cannot DROP this table, another DROP is already in progress"

@andreimatei
Copy link
Contributor

I'm not working on it; I gave up after investigating draining a bit and
realizing it's not very easy and I started bringing it up with Peter and
others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come
up multiple times from external people in the context of not being able to
delete tables.

On Fri, Oct 21, 2016 at 5:36 PM, vivekmenezes notifications@github.com
wrote:

@andreimatei https://github.com/andreimatei are you really working on
releasing those leases. I don't think we should do that for 1.0


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAXBcYlUUwI-IgcT0xqJHO606ESecugVks5q2TBogaJpZM4KDYRw
.

@vivekmenezes
Copy link
Contributor

With all our schema changes being chunked now we certainly can reduce the
timeout to perhaps 1 minute. That along with better error messages should
suffice for 1.0

On Sun, Oct 23, 2016, 2:44 PM Andrei Matei notifications@github.com wrote:

(I've replied this by email 2 days ago, but for some reason it didn't make
it in the issue... weird...)

I'm not working on it; I gave up after investigating draining a bit and
getting in the weeds and realizing it's not very easy and I started
bringing it up with Peter and others for getting someone to work on it :)
But I think we definitely need to do something about it soon. It has come
up multiple times from external people in the context of not being able to
delete tables.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBJAtYiNcPJbdiUxBUQmPGPQyA8nZks5q26sQgaJpZM4KDYRw
.

@vivekmenezes
Copy link
Contributor

@knz do let me know what you'd like to see besides what I've done in
#10063

On Sat, Oct 22, 2016 at 4:41 AM kena notifications@github.com wrote:

Vivek I'm not sure whether releasing the leases are the best way to
address the issue at hand but the UX problem initially reported here must
absolutely be fixed before 1.0 IMO.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#9493 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBGwT8nSUeEpTJXMPbvIU2WFoOOEiks5q2cw-gaJpZM4KDYRw
.

@knz knz added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Nov 2, 2016
@dt
Copy link
Member

dt commented Dec 20, 2016

I think we want an early return in Validate, before checking the cross-table references, if the table is being dropped (as it is then expected that back-references do not exist).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

6 participants