-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Phase 1][colocation][DocDB][SQLsmith] Colocated table corruption after crash loop #11415
Comments
@def- can you clarify if this was during some DDL phase of the app, or pure CRUD operations? |
Note, I would even ignore the footer not found warnings -- that's just because you're restarting nodes and we don't close the WALs on hard restart |
After the initial DDL phase, so this was only SELECT/INSERT/UPDATE. I have postgres logs, but it's 4 GB uncompressed, so not sure what to search for. Should I upload it somewhere?
Is there an easy way to find this out from the logs? I ran:
No drops were run during the test. A |
I think you cropped the log entry that starts with
Were there any DROP TABLE or ALTER TABLE ADD PRIMARY KEY?
For all of the logs, are there any fatals or errors? So far, you've only shown warning logs, and the real reason is probably somewhere else. Does postgres crash loop? Is the "can't start up anymore" only for when you try to connect to the database (e.g. with ysqlsh)? Are postgres background workers and the postmaster running fine? A lot of information can be answered with a
The log shows
It's more likely to appear in the master logs. |
Full file (new run) is attached: yb-master.dev-server-dfelsing.dfelsing.log.INFO.20220214-103230.4236.txt
No fatals, two errors, attached:
Doesn't crash, but keeps getting restarted:
Postgres log:
|
You didn't answer this question:
Are you sure the issue is in this log? I don't see any CreateTable at all. I was originally asking only for the rest of that "Missing table" log entry since I wanted to know what other
These don't look serious.
That's what I'm looking for. The postgres logs are more helpful, then. Regarding
that's probably the main culprit. I never saw it before. Did you try removing the file like the hint says? @bmatican, I wouldn't call this "Colocated table corruption" yet since there's no evidence of that. The master log "Missing table" is nonfatal and looks more like issue #11129 like I previously mentioned. |
Sorry, forgot to answer. There should have been no DROP TABLE commands, but ALTER TABLES yes. For an initial setup I ran:
There is a line |
Tried it now, that helps. Starts up fine now. The content of the file was:
|
That initial setup is interesting.
There's still a chance ALTER TABLE ADD PRIMARY KEY happened. That internally replaces a nonpk table with a pk one, meaning it internally drops the nonpk table. Then, issue #11129 is possible. If you try filtering out lines with "ADD PRIMARY KEY" and rerun, you may not see that warning any more.
Sorry, I missed that and the many others in the file. It looks like expected, and I wonder what that table was. If CreateTable were in the master logs, we could find out. Do you have those earlier master logs? About that backup_label file, it's probably the main reason for this issue. Is it reproducible? |
Not so easy. I ran this a few times for a full day (initial setup, 4 SQLsmith instances against it for the rest of the day) and only two times I got this issue. The initial setup alone is not enough, crashing less commonly by running with
I uploaded the full var directory: https://ddnet.tw/var-11415.tar.gz |
Found the CreateTable for 000043b10000300080000000000045b2 in
so master dropped the docdb table
but issue #11129 means the drop doesn't fully clean up the metadata. This is not anything new. The backup_label issue needs more investigation. git grep -rl 'backup_label' | sort -r gave me
Can you get a backtrace from the corresponding core, |
Thanks for checking, good to know this is a known issue.
So this is #11363 |
Jira Link: [DB-372](https://yugabyte.atlassian.net/browse/DB-372)
Description
After running SQLsmith for a while (Release build of code state 596eecc) with many trigger crashes (#10152) the data state becomes invalid and database can't start up anymore. Starting up:
master log during startup:
tserver log looks similar:
ysqlsh then fails:
I can provide the
~/var
directory if it helps (1 GB).The text was updated successfully, but these errors were encountered: