jobs,sql: transient errors can leave descriptors permanently corrupted #66685

lucacri · 2021-06-21T20:41:33Z

Describe the problem

Hello, I have a mid size deployment of cockroachDB (12 nodes, 288vcores total). A few days ago I attempted to ALTER a table with the following SQL:

ALTER TABLE messages
	ADD COLUMN sent_f      BOOL NOT NULL AS (sent::BOOL) STORED,
	ADD COLUMN bounced_f   BOOL NOT NULL AS (bounced::BOOL) STORED,
	ADD COLUMN delivered_f BOOL NOT NULL AS (delivered::BOOL) STORED,
	ADD COLUMN opened_f    BOOL NOT NULL AS (opened::BOOL) STORED,
	ADD COLUMN clicked_f   BOOL NOT NULL AS (clicked::BOOL) STORED;

The process failed, eventually. The table is not that big at 29.4GiB, 153 ranges.

Now every time I try to modify that table, I receive the message:

[0A000] ERROR: relation "messages" (1849): unimplemented: cannot perform a schema change operation while an ALTER COLUMN TYPE schema change is in progress

I checked in the SHOW JOBS and there isn't anything running anymore.

To Reproduce

I assume you might be able to reproduce it by altering a table, making it fail, and then try again later.

Expected behavior
Release the table from the state of "being under schema change"

Additional context
Is there any other command to find the state of a table? Any internal table/command to release it?

Epic: CRDB-7912

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-06-21T20:41:37Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@ajwerner (found keywords: ALTER TABLE)
@jordanlewis (found keywords: ALTER TABLE)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner · 2021-06-21T21:11:43Z

It'd be cool to know the reason that the schema change failed as well as the version. I can help you to repair it and, hopefully, ensure that we've fixed whatever bug lead to the problem.

lucacri · 2021-06-21T21:18:30Z

Honestly, I don't remember precisely because we just migrated from MySQL to Cockroach and I had to do a lot of manual changes.

What I think happened is that when I imported from MySQL, Cockroach converted the bool fields to INT2, so I had to run several alter table messages alter column colname type BOOLEAN using colname::BOOLEAN. The problem is that this kind of alter was very impactful to the load, and unreliable sometimes.

That's why I'm now doing the alters in a roundabout way:

create a stored field named field_f ( ALTER TABLE messages ADD COLUMN sent_f BOOL NOT NULL AS (sent::BOOL) STORED;)
then run the following for each field:

ALTER TABLE messages
	ALTER COLUMN sent_f DROP STORED,
	RENAME COLUMN sent TO sent_bk,
	RENAME COLUMN sent_f TO sent,
	ALTER COLUMN sent SET DEFAULT FALSE;

eventually drop the _bk column

I noticed that in this way the change is seamless and has no data loss at all.

ajwerner · 2021-06-21T21:24:57Z

In principle the alter column type ought to be doing roughly exactly that. Sorry you hit issues. Unfortunately we'll need to manually repair the table. We've generally gotten better about being robust to this class of failure and are actively working to make schema changes more bulletproof. Which version are you running? Also, you'll need to give me access to the table descriptor (which does specify its schema) in order to help you repair it. You can get that with:

SELECT encode(descriptor, 'hex') FROM system.descriptor WHERE id = '<table name>'::regclass;

If you don't feel comfortable sharing the schema in public, let me know.

lucacri · 2021-06-21T21:32:02Z

No problem sharing this table schema, it's a pretty standard one :)

Here it is

               encode
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0ae10c0a086d6573736167657318b90e20eb0d28063a00423e0a02696410011a0c08011040180030005014600020002a186e65787476616c28313834383a3a3a524547434c41535329300050b80e680070007800800100422c0a0d64697363757373696f6e5f696410021a0c0801104018003000501460002000300068007000780080010042340a0b66726f6d5f636c69656e7410031a0c08011010180030005015600020002a08303a3a3a494e5438300068007000780080010042260a07757365725f696410041a0c0801104018003000501460002001300068007000780080010042260a076d65737361676510051a0c08071000180030005019600020003000680070007800800100422f0a10636f6d6d756e69636174696f6e5f696410061a0c08011040180030005014600020013000680070007800800100422a0a0a637265617465645f617410071a0d080910001800300050a009600020013000680070007800800100422a0a0a757064617465645f617410081a0d080910001800300050a00960002001300068007000780080010042280a077375626a65637410091a0e080710fe0118003007509308600020013000680070007800800100422d0a0473656e74100a1a0c08011010180030005015600020002a08303a3a3a494e5438300068007000780080010042300a07626f756e636564100b1a0c08011010180030005015600020002a08303a3a3a494e5438300068007000780080010042320a0964656c697665726564100c1a0c08011010180030005015600020002a08303a3a3a494e54383000680070007800800100422f0a066f70656e6564100d1a0c08011010180030005015600020002a08303a3a3a494e5438300068007000780080010042300a07636c69636b6564100e1a0c08011010180030005015600020002a08303a3a3a494e543830006800700078008001004810524d0a077072696d6172791001180122026964300140004a10080010001a00200028003000380040005a007a0408002000800100880100900103980100a20106080012001800a80100b20100ba01005a710a1e6d657373616765735f64697363757373696f6e5f69645f666f726569676e10021800220d64697363757373696f6e5f69643002380140004a10080010001a00200028003000380040005a007a0408002000800100880100900103980100a20106080012001800a80100b20100ba010060036a2c0a090a0561646d696e10020a0d0a09686d6462757365723110020a080a04726f6f7410021204726f6f74180172500a460a0d66726f6d5f636c69656e745f31100f1a0c08001000180030005010600020002a0566616c736530005a1166726f6d5f636c69656e743a3a424f4f4c6800700078008001001801200228013801729502528a02080f10031a8302637264625f696e7465726e616c2e666f7263655f6572726f7228273033303030272c20636f6e6361742827636f6c756d6e2066726f6d5f636c69656e7420697320756e646572676f696e672074686520414c54455220434f4c554d4e2054595045205553494e472045585052455353494f4e20736368656d61206368616e67652c20696e736572747320617265206e6f7420737570706f7274656420756e74696c2074686520736368656d61206368616e67652069732066696e616c697a65642c20272c2027747269656420746f20696e7365727420272c2066726f6d5f636c69656e743a3a535452494e472c202720696e746f2066726f6d5f636c69656e742729291801200228013801800102880103980100b201ca010a077072696d61727910001a0269641a0d64697363757373696f6e5f69641a0b66726f6d5f636c69656e741a07757365725f69641a076d6573736167651a10636f6d6d756e69636174696f6e5f69641a0a637265617465645f61741a0a757064617465645f61741a077375626a6563741a0473656e741a07626f756e6365641a0964656c6976657265641a066f70656e65641a07636c69636b65641a0d66726f6d5f636c69656e745f31200120022003200420052006200720082009200a200b200c200d200e200f2800b80101c20100da010c0801108380829deaeb99a309e80100f2010408001200f801008002009202009a020a08e5b9e3f1a491a0c316a2023208b90e1002180120910e2a1e6d657373616765735f64697363757373696f6e5f69645f666f726569676e3001380440004800b20200b80200c0021dc80200e00200
(1 row)

Time: 6ms total (execution 3ms / network 3ms)

I hope I copied it correctly. Meanwhile, thank you for your help, it's really appreciated

ajwerner · 2021-06-22T12:04:18Z

Please let me know which version you're using so I can help provide a fix.

lucacri · 2021-06-22T13:18:20Z

I'm using v21.1.2

Server version: CockroachDB CCL v21.1.2 (x86_64-unknown-linux-gnu, built 2021/06/07 18:09:50, go1.15.11)

ajwerner · 2021-06-22T13:40:43Z

Alright, this one is unfortunate. It failed to do the ALTER COLUMN TYPE and then it tried to rollback and that hasn't definitely finished. One thing I'd like to see, if it exists is:

SELECT id, status, created, encode(payload, 'hex') as payload, encode(progress, 'hex') as progress from system.jobs where id = 668335250880888835;

I'll come up with something for you soon that should do the trick. Just so I understand, how live is this table? Is it driving a production app?

lucacri · 2021-06-22T13:44:16Z

Something did come up:

  668335250880888835 | failed | 2021-06-18 15:28:52.276572 | 0ab20120414c54455220434f4c554d4e2066726f6d5f636c69656e74205345542044415441205459504520424f4f4c205553494e472066726f6d5f636c69656e743a3a424f4f4c3b414c544552205441424c452068656c6c6f6d616e61676572732e7075626c69632e6d6573736167657320414c54455220434f4c554d4e2066726f6d5f636c69656e74205345542044415441205459504520424f4f4c205553494e472066726f6d5f636c69656e743a3a424f4f4c1209686d646275736572311888a79982c0a1f10220b491bcf5c0a1f1023202b90e4285036a6f62203636383333353235303838303838383833353a2063616e6e6f742062652072657665727465642c206d616e75616c20636c65616e7570206d61792062652072657175697265643a206661696c656420746f20636f6e6e65637420746f206e3132206174206d792d636f636b726f61636864622d372e6d792d636f636b726f61636864622e6c6f6f70737061726b2d70726f64756374696f6e2e7376632e636c75737465722e6c6f63616c3a32363235373a20696e697469616c20636f6e6e656374696f6e20686561727462656174206661696c65643a20727063206572726f723a20636f6465203d205065726d697373696f6e44656e6965642064657363203d206e313220776173207065726d616e656e746c792072656d6f7665642066726f6d2074686520636c757374657220617420323032312d30362d31372031373a33383a32362e303232363335373833202b30303030205554433b206974206973206e6f7420616c6c6f77656420746f2072656a6f696e2074686520636c75737465726229120e0a0c1a04f70739892204f707398a120e0a0c1a04f70739892204f707398a28b90e3001380252009a01f83212f5320a910a128e0a0a89051286050aa9030aa6030a80016e313220776173207065726d616e656e746c792072656d6f7665642066726f6d2074686520636c757374657220617420323032312d30362d31372031373a33383a32362e303232363335373833202b30303030205554433b206974206973206e6f7420616c6c6f77656420746f2072656a6f696e2074686520636c757374657212a0020a34676f6f676c652e676f6c616e672e6f72672f677270632f696e7465726e616c2f7374617475732f2a7374617475732e4572726f7212360a34676f6f676c652e676f6c616e672e6f72672f677270632f696e7465726e616c2f7374617475732f2a7374617475732e4572726f7222af010a25747970652e676f6f676c65617069732e636f6d2f676f6f676c652e7270632e53746174757312850108071280016e313220776173207065726d616e656e746c792072656d6f7665642066726f6d2074686520636c757374657220617420323032312d30362d31372031373a33383a32362e303232363335373833202b30303030205554433b206974206973206e6f7420616c6c6f77656420746f2072656a6f696e2074686520636c75737465721223696e697469616c20636f6e6e656374696f6e20686561727462656174206661696c65641ab2010a566769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7574696c2f6e65747574696c2f2a6e65747574696c2e496e697469616c4865617274626561744661696c65644572726f7212580a566769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7574696c2f6e65747574696c2f2a6e65747574696c2e496e697469616c4865617274626561744661696c65644572726f7212b9026661696c656420746f20636f6e6e65637420746f206e3132206174206d792d636f636b726f61636864622d372e6d792d636f636b726f61636864622e6c6f6f70737061726b2d70726f64756374696f6e2e7376632e636c75737465722e6c6f63616c3a32363235373a20696e697469616c20636f6e6e656374696f6e20686561727462656174206661696c65643a20727063206572726f723a20636f6465203d205065726d697373696f6e44656e6965642064657363203d206e313220776173207065726d616e656e746c792072656d6f7665642066726f6d2074686520636c757374657220617420323032312d30362d31372031373a33383a32362e303232363335373833202b30303030205554433b206974206973206e6f7420616c6c6f77656420746f2072656a6f696e2074686520636c75737465721ac3020a396769746875622e636f6d2f636f636b726f61636864622f6572726f72732f6572727574696c2f2a6572727574696c2e77697468507265666978123b0a396769746875622e636f6d2f636f636b726f61636864622f6572726f72732f6572727574696c2f2a6572727574696c2e776974685072656669781a1e6661696c656420746f20636f6e6e65637420746f206e313220617420c39722a8010a34747970652e676f6f676c65617069732e636f6d2f636f636b726f6163682e6572726f727370622e537472696e675061796c6f616412700a6e6661696c656420746f20636f6e6e65637420746f206e313220617420e280b96d792d636f636b726f61636864622d372e6d792d636f636b726f61636864622e6c6f6f70737061726b2d70726f64756374696f6e2e7376632e636c75737465722e6c6f63616c3a3236323537e280ba1ade280a3c6769746875622e636f6d2f636f636b726f61636864622f6572726f72732f77697468737461636b2f2a77697468737461636b2e77697468537461636b123e0a3c6769746875622e636f6d2f636f636b726f61636864622f6572726f72732f77697468737461636b2f2a77697468737461636b2e77697468537461636b1add270a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7270632f6e6f64656469616c65722e282a4469616c6572292e6469616c0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7270632f6e6f64656469616c65722f6e6f64656469616c65722e676f3a3136390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7270632f6e6f64656469616c65722e282a4469616c6572292e4469616c496e7465726e616c436c69656e740a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f7270632f6e6f64656469616c65722f6e6f64656469616c65722e676f3a3133330a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a677270635472616e73706f7274292e4e657874496e7465726e616c436c69656e740a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f7472616e73706f72742e676f3a3232320a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a677270635472616e73706f7274292e53656e644e6578740a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f7472616e73706f72742e676f3a3136390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a4469737453656e646572292e73656e64546f5265706c696361730a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f646973745f73656e6465722e676f3a313838390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a4469737453656e646572292e73656e645061727469616c42617463680a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f646973745f73656e6465722e676f3a313530370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a4469737453656e646572292e646976696465416e6453656e644261746368546f52616e6765730a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f646973745f73656e6465722e676f3a313134350a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a4469737453656e646572292e53656e640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f646973745f73656e6465722e676f3a3738340a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e4c6f636b476174656b6565706572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f6c6f636b5f676174656b65657065722e676f3a38360a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e4d65747269635265636f72646572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f6d65747269635f7265636f726465722e676f3a34360a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e436f6d6d6974746572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f636f6d6d69747465722e676f3a3132360a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e5370616e526566726573686572292e73656e644c6f636b65645769746852656672657368417474656d7074730a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f7370616e5f7265667265736865722e676f3a3236370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e5370616e526566726573686572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f7370616e5f7265667265736865722e676f3a3230320a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e506970656c696e6572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f706970656c696e65722e676f3a3235350a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e5365714e756d416c6c6f6361746f72292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f7365715f6e756d5f616c6c6f6361746f722e676f3a3130350a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a74786e4865617274626561746572292e53656e644c6f636b65640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f696e746572636570746f725f68656172746265617465722e676f3a3137310a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642e282a54786e436f6f726453656e646572292e53656e640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f6b76636c69656e742f6b76636f6f72642f74786e5f636f6f72645f73656e6465722e676f3a3531330a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762e282a4442292e73656e645573696e6753656e6465720a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f64622e676f3a3830390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762e282a54786e292e53656e640a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f74786e2e676f3a3935310a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e6d616b654b564261746368466574636865722e66756e63310a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f6b765f62617463685f666574636865722e676f3a3231370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a74786e4b5646657463686572292e66657463680a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f6b765f62617463685f666574636865722e676f3a3337390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a74786e4b5646657463686572292e6e65787442617463680a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f6b765f62617463685f666574636865722e676f3a3438300a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a4b5646657463686572292e4e6578744b560a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f6b765f666574636865722e676f3a3132390a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a46657463686572292e4e6578744b65790a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f666574636865722e676f3a3730370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a46657463686572292e53746172745363616e46726f6d0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f666574636865722e676f3a3639370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772e282a46657463686572292e53746172745363616e0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f772f666574636865722e676f3a3538370a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f6261636b66696c6c2e282a436f6c756d6e4261636b66696c6c6572292e52756e436f6c756d6e4261636b66696c6c4368756e6b0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f6261636b66696c6c2f6261636b66696c6c2e676f3a3239380a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f77657865632e282a636f6c756d6e4261636b66696c6c6572292e72756e4368756e6b2e66756e63310a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f77657865632f636f6c756d6e6261636b66696c6c65722e676f3a3132330a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762e282a4442292e54786e2e66756e63310a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f64622e676f3a3737330a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762e282a54786e292e657865630a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f74786e2e676f3a3834330a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762e282a4442292e54786e0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f6b762f64622e676f3a3737320a6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f77657865632e282a636f6c756d6e4261636b66696c6c6572292e72756e4368756e6b0a092f676f2f7372632f6769746875622e636f6d2f636f636b726f61636864622f636f636b726f6163682f706b672f73716c2f726f77657865632f636f6c756d6e6261636b66696c6c65722e676f3a313036 | 10c4c49abcc0a1f1022211706f70756c6174696e6720736368656d616200

The table is very much in production and live. It's not hammered with reads as much, but the writes are very important (it tracks SMS/emails sent and received from our customers, so I can't miss any of them in fear of duplicate messages being sent)

ajwerner · 2021-06-22T16:14:45Z

At its core, this was due to transient errors being treated as permanent. This is especially bad in the revert. We've repaired the cluster. I'm going to link this to #44594. The basic idea is that we should only fail to revert on very specific errors rather than all errors. The steps to do that are:

First implement some exponential backoff running the jobs.
Change the default to fail out of reverting to treat all errors as retriable rather than as permanent.

Previously, only non-cancelable reverting jobs were retried by default. This commit makes all reverting jobs to retry when they fail. As a result, reverting jobs do not fail due to transient errors. Release justification: a bug fix and high benefit changes to existing functionality Release note: None Fixes: cockroachdb#66685

Previously, only non-cancelable reverting jobs were retried by default. This commit makes all reverting jobs to retry when they fail. As a result, reverting jobs do not fail due to transient errors. Release justification: a bug fix and low-risk updates to new functionality. Release note: Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685

Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685

Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685

69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact the jobs that are reverting. Fixes: #66685 69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola Release justification: Non-production code change. Release note: None 70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde The test-only code that checks the invariants of the `tenant_usage` table inadvertently panics if the query hits an error (such as one that would be expected if the server is shutting down). We now just log the error instead. Fixes #70089. Release note: None Release justification: non-production code change to fix test failure. 70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde This change restricts the configuration of tenant cost control from the tenant side. In the future, we will want to have settings where the values come from the host cluster but we don't have that infrastructure today. With tenants being able to set their own settings, they could easily sabotage the cost control mechanisms. This change restricts the allowed values for the target period and the CPU usage allowance, and fixes the cost model configuration to the default. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers. 70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2 Addresses: #67400, #69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in #69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit). 70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker ### kvcoord: lock mutex in `TxnCoordSender.Epoch()` Methods that access `TxnCoordSender.mu` fields must lock the mutex first. `Epoch()` didn't. Resolves #70071. Release note: None ### kv: fix mutex locking for `Txn.status` `Txn.status()` fetches the transaction status from the mutex-protected `Txn.mu.sender` field, but callers did not take out the mutex lock when calling it. This patch renames the method to `Txn.statusLocked()`, and updates all callers to take out the lock before calling it. Release note: None Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685

lucacri added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 21, 2021

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jun 21, 2021

ajwerner self-assigned this Jun 22, 2021

ajwerner changed the title ~~Table in constant "schema change" state~~ jobs,sql: transient errors can leave descriptors permanently corrupted Jun 22, 2021

ajwerner assigned sajjadrizvi and unassigned ajwerner Jun 30, 2021

ajwerner mentioned this issue Jul 13, 2021

ALTER table failed and job stuck in reverting. Field now dropped and not able to recreate it anymore #67465

Closed

sajjadrizvi mentioned this issue Aug 18, 2021

jobs: always retry non-cancelable reverting jobs #69087

Merged

sajjadrizvi mentioned this issue Aug 26, 2021

jobs: retry non-cancelable running and all reverting jobs #69300

Merged

craig bot closed this as completed in d7ab27e Sep 13, 2021

This was referenced Sep 20, 2021

release-21.2: jobs: retry non-cancelable running and all reverting jobs #70418

Closed

release-21.2: jobs: allow create table as select jobs to be canceled #70752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs,sql: transient errors can leave descriptors permanently corrupted #66685

jobs,sql: transient errors can leave descriptors permanently corrupted #66685

lucacri commented Jun 21, 2021 •

edited by postamar

Loading

blathers-crl bot commented Jun 21, 2021

ajwerner commented Jun 21, 2021

lucacri commented Jun 21, 2021

ajwerner commented Jun 21, 2021

lucacri commented Jun 21, 2021

ajwerner commented Jun 22, 2021

lucacri commented Jun 22, 2021

ajwerner commented Jun 22, 2021

lucacri commented Jun 22, 2021

ajwerner commented Jun 22, 2021 •

edited

Loading

jobs,sql: transient errors can leave descriptors permanently corrupted #66685

jobs,sql: transient errors can leave descriptors permanently corrupted #66685

Comments

lucacri commented Jun 21, 2021 • edited by postamar Loading

blathers-crl bot commented Jun 21, 2021

ajwerner commented Jun 21, 2021

lucacri commented Jun 21, 2021

ajwerner commented Jun 21, 2021

lucacri commented Jun 21, 2021

ajwerner commented Jun 22, 2021

lucacri commented Jun 22, 2021

ajwerner commented Jun 22, 2021

lucacri commented Jun 22, 2021

ajwerner commented Jun 22, 2021 • edited Loading

lucacri commented Jun 21, 2021 •

edited by postamar

Loading

ajwerner commented Jun 22, 2021 •

edited

Loading