-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs,sql: transient errors can leave descriptors permanently corrupted #66685
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
It'd be cool to know the reason that the schema change failed as well as the version. I can help you to repair it and, hopefully, ensure that we've fixed whatever bug lead to the problem. |
Honestly, I don't remember precisely because we just migrated from MySQL to Cockroach and I had to do a lot of manual changes. What I think happened is that when I imported from MySQL, Cockroach converted the That's why I'm now doing the alters in a roundabout way:
ALTER TABLE messages
ALTER COLUMN sent_f DROP STORED,
RENAME COLUMN sent TO sent_bk,
RENAME COLUMN sent_f TO sent,
ALTER COLUMN sent SET DEFAULT FALSE;
I noticed that in this way the change is seamless and has no data loss at all. |
In principle the alter column type ought to be doing roughly exactly that. Sorry you hit issues. Unfortunately we'll need to manually repair the table. We've generally gotten better about being robust to this class of failure and are actively working to make schema changes more bulletproof. Which version are you running? Also, you'll need to give me access to the table descriptor (which does specify its schema) in order to help you repair it. You can get that with: SELECT encode(descriptor, 'hex') FROM system.descriptor WHERE id = '<table name>'::regclass; If you don't feel comfortable sharing the schema in public, let me know. |
No problem sharing this table schema, it's a pretty standard one :) Here it is
I hope I copied it correctly. Meanwhile, thank you for your help, it's really appreciated |
Please let me know which version you're using so I can help provide a fix. |
I'm using v21.1.2
|
Alright, this one is unfortunate. It failed to do the SELECT id, status, created, encode(payload, 'hex') as payload, encode(progress, 'hex') as progress from system.jobs where id = 668335250880888835; I'll come up with something for you soon that should do the trick. Just so I understand, how live is this table? Is it driving a production app? |
Something did come up:
The table is very much in production and live. It's not hammered with reads as much, but the writes are very important (it tracks SMS/emails sent and received from our customers, so I can't miss any of them in fear of duplicate messages being sent) |
At its core, this was due to transient errors being treated as permanent. This is especially bad in the revert. We've repaired the cluster. I'm going to link this to #44594. The basic idea is that we should only fail to revert on very specific errors rather than all errors. The steps to do that are:
|
Previously, only non-cancelable reverting jobs were retried by default. This commit makes all reverting jobs to retry when they fail. As a result, reverting jobs do not fail due to transient errors. Release justification: a bug fix and high benefit changes to existing functionality Release note: None Fixes: cockroachdb#66685
Previously, only non-cancelable reverting jobs were retried by default. This commit makes all reverting jobs to retry when they fail. As a result, reverting jobs do not fail due to transient errors. Release justification: a bug fix and low-risk updates to new functionality. Release note: Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact the jobs that are reverting. Fixes: #66685 69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola Release justification: Non-production code change. Release note: None 70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde The test-only code that checks the invariants of the `tenant_usage` table inadvertently panics if the query hits an error (such as one that would be expected if the server is shutting down). We now just log the error instead. Fixes #70089. Release note: None Release justification: non-production code change to fix test failure. 70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde This change restricts the configuration of tenant cost control from the tenant side. In the future, we will want to have settings where the values come from the host cluster but we don't have that infrastructure today. With tenants being able to set their own settings, they could easily sabotage the cost control mechanisms. This change restricts the allowed values for the target period and the CPU usage allowance, and fixes the cost model configuration to the default. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers. 70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2 Addresses: #67400, #69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in #69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit). 70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker ### kvcoord: lock mutex in `TxnCoordSender.Epoch()` Methods that access `TxnCoordSender.mu` fields must lock the mutex first. `Epoch()` didn't. Resolves #70071. Release note: None ### kv: fix mutex locking for `Txn.status` `Txn.status()` fetches the transaction status from the mutex-protected `Txn.mu.sender` field, but callers did not take out the mutex lock when calling it. This patch renames the method to `Txn.statusLocked()`, and updates all callers to take out the lock before calling it. Release note: None Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
Describe the problem
Hello, I have a mid size deployment of cockroachDB (12 nodes, 288vcores total). A few days ago I attempted to
ALTER
a table with the following SQL:The process failed, eventually. The table is not that big at 29.4GiB, 153 ranges.
Now every time I try to modify that table, I receive the message:
I checked in the
SHOW JOBS
and there isn't anything running anymore.To Reproduce
I assume you might be able to reproduce it by altering a table, making it fail, and then try again later.
Expected behavior
Release the table from the state of "being under schema change"
Additional context
Is there any other command to find the state of a table? Any internal table/command to release it?
Epic: CRDB-7912
The text was updated successfully, but these errors were encountered: