-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classic queue can run into an exception after upgrade to RabbitMQ 3.13.7 (or 4.0.2) #12367
Comments
@netrmqdev how large are those "big messages"? |
Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important. |
Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit. Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated. |
The size varies a lot between a few KB to a few MB, but most messages are less than 1MB.
I wish I had identified a reproductible pattern, but:
I've found a similar occurence where I have access to more logs than the first one (the ingestion of the logs into Grafana makes things a bit messy in this case):
|
@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related. Side note: this queue is configured as |
we are also investigating a similar/same case: two single-node clusters were upgraded from 3.12.14 -> 3.13.7 We couldn't reproduce the issue in a test environment so we don't know yet what attributes of the usage are relevant. The queues crash when trying to fetch a message from the shared msg store, but the location does not match the content of the rdq file:
or
I suspect that compaction somehow corrupts the rdq file or gets the rabbit_msg_store_ets_index out of sync. Compaction should work in a way that it moves still referenced messages from the end towards the beginning of the file. We could not rule out 100% that there is some data from before the upgrade, but some crashes happened more than a week after the upgrade. |
No we flush everything before updating RMQ.
Indeed, I noticed that while investigating and I plan to clean that very soon! |
When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync. |
Has any of these environments run out of disk at some point? |
No they have a lot of free space
I believe the dropped messages log when the queue restarts only exists in 4.0.x, so I only have those 3 logs to share (all from the same server):
|
Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1. The first two crashes don't drop any message.
This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index). Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are. |
There are two types of cases I investigated The first ones when the crash was actively happening, the second ones just inspecting backup of the data dir (without any logs or in-memory info available) First casesFirst case I was able to trace the msg store index lookup by
And the MsgId was always found in the index with ref count = 1, eg:
The crash contains the queue state, from field
Processed 33915.rdq and associated entries with queue sequence Ids. The first message 77555 is missing, but all the others are from the head (q3) of the crashed queue 33915.rdq contentFormat is
The badmatch crash has a huge binary in it (luckily). I managed to identify messages 77623-77614 in them. Because there are no zeros between them and they are a continuous reversed block I suspected they were written there in one go during a compaction - potentially overwriting the missing 77555 message. Second casesIn the second cases I only had the queue index (CQv1) on disk and the rdq files. In one example the first index segment file starts with (format
The first message 137 is acked, so should have ref count = 0 in the msg store. rdq contentFormat is
I speculate the last compaction was maybe triggered by the ack of 137. This is supported by file timestamps as well. Both 137 and 139 are present in the rdq file (but 137 is not moved towards the beginning). There is an interesting 10MB zero-hole in the middle (right before 137) which is beyond the latest truncation line, so must be a result of a previous compaction. I haven't untangled from the order of SeqIds which compaction might have moved which message where and what could have been the original position of the missing message. A theoryTheoretically it is possible that
The payload size of 26000 bytes is selected (binary for This is just speculation and I don't know how likely it is to hit this coincidence. |
Oh yes indeed, all our queues are in CQv1.
I tried that, but I'm not sure it brings a lot more information, although we can see the 2 affected queues dropping messages on restart:
Full log
|
Thanks all. I think the working theory is correct. The odds of this happening are very low, but over the many RabbitMQ installations this is bound to happen regularly. I will think about a way to efficiently make sure the message found is a real message; a naive approach would be to check that the remainder here can be decoded with |
sure, if you mean that I create a patch with the naive approach (So far I couldn't trigger a compaction in a system test env with naive approaches so first I need to figure that out to verify the patch. Is there anything more to it than waiting 15 seconds after more than half of the messages in an rdq file (which is not the last one) are consumed?) As the size of a msg cannot be larger than 512MB which can be stored on 4 bytes, what if the first byte of the size would be set to 255 (as a beginning of msg mark), or some other magic value in the first few bytes? Would that help at all? My concern is that calling binary_to_term is fine at startup during recovery, but it might create too much garbage/too resource intensive during normal operation. What if there was a "secondary index", another ets table keeping track of "rdq file" -> "msg id" mappings, so it can be cross checked or serve as an input for the scanning. (There used to be a query to return all the msg ids for a given file, but it used the index table and was inefficient) |
Careful as the naive approach is expensive so only do if you got the right test files. Yes 15s since file access and half of the data gone. It might be easier to trigger by sending messages that are basically just all bytes 255. Ideally we could just make a better format, but it's a bit complex. This is best done later if/when we want to stop having the index fully in memory. Right now my approach to fix this is that when we encounter a potential message we still check if there's a potential message in the next 8 bytes. Then when we cross reference with the index we should get the invalid messages dropped and only keep the valid ones, including the message we currently lose. The cross reference is strict (includes size and offset) so I don't expect any errors after that. |
aha, so scanning could return overlapping messages, but cross checking would filter out the misinterpreted one |
Right. Got a test case:
This fails with:
So the first attempt with |
Actually we should probably do the The reason we want to do this is because even if we try the next 8 bytes, I don't know how much data I need to skip after doing that. While if we lookup immediately while scanning, I know whether the message is real or not. I think what needs to happen is that the fun from the two |
how does this work in case of node recovery? is the index already populated with msg ids from queue indexes before reading the rdq files? |
Yes. |
I have opened #12392 with a patch. I have only tested via unit tests so this will require more extensive testing. I had to disable file scanning when we delete files, for now. Checking against the index in that case is too costly, because when the message isn't in the index we try the next byte, we don't skip the message. And deleted files are not supposed to have anything in the index (or in the file). This was basically just an assert so I am considering just removing it. |
This comment was marked as outdated.
This comment was marked as outdated.
@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course) |
I can deploy a special build on some environments where the issue happened, yes. This will give a good idea of the patch fixing things or not. We deploy a custom Docker image based off the rabbitmq management one (with some additional certificates, conf and definitions):
If you can provide a tagged rabbitmq management image (based on 4.0.xx I imagine), that should be fairly easy for me I believe. |
@netrmqdev this (x86-64) tag should include #12392 (acb04e9): https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/sha-acb04e929f9b1030dd077c62991147876b08fff5/images/sha256-353eb18c3b4d98fd393eacaf98234a90bfc582f0c6d1f38232f8a4317b0cf4be?context=explore It is technically 4.1.0-based but there are no meaningful changes compared to Please give it a shot and let us know how it goes. |
Thanks @michaelklishin , yes 4.1 based is ok. |
I deployed the image this morning on the same env from the previous logs in 4.0.x, and it did not go well. The RMQ server fully crashed soon after starting.
The "full" log (messages truncated) of the crash
|
Thanks! I have indeed missed something. |
Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore |
Thanks a lot, I deployed it this morning on the first env and so far so good 👍 |
Hi all! We're observing the same issue. We went from 3.10.0 on CentOS7 to 3.13.2 on Debian 12 last Thursday and since then had those crashes every night in our canary environment where we mirror live traffic of our application and test new releases before staging them to the live environment. Last night I tried out 4.0.2 and the crashes were still happening, although RabbitMQ managed to automatically restart affected queues, what didn't work with 3.13.2. I now went back to 3.12.14 and will again observe for a night or two. Long story short: If someone can provide a DEB package with the attempted fix, I can do some real world testing of the fix in our canary environment afterwards. Since we have no real users on there, It would be quite easy for me to do. |
I'm sure @michaelklishin can help. |
@kaistierl I will build a 64-bit Debian package (off of |
@kaistierl would an ARM64 build do? Our x86-64 environment that I used to produce one-off builds needs to be re-created on different infrastructure for reasons outside of my control, and it will take some time. But I can produce a local ARM64 Debian package. |
@michaelklishin unfortunately not - I would need a x86-64 package. ...by the way: After downgrading to 3.12.14 everything is running stable again for me, so my observations match what was reported by others. |
Here is a generic binary build of #12392 which is platform-independent https://github.com/rabbitmq/rabbitmq-server/actions/runs/11120630513/artifacts/1999351253. I cannot promise any ETA for one-off x86-64 builds. |
Note that both 3.12 and 3.13 are out of community support so when a new patch release is out, you will have to move to |
@netrmqdev any updates from your clusters? |
For reasons unrelated to RabbitMQ, I was unable to get as much testing as I wanted to on my envs. However I now have about 4 of them running the patch for ~12-24 hours each, with no issues so far. They will keep running over the week-end so on Monday I can give more feedback 👍 |
@kaistierl as the patch only modifies one file you can extract |
@gomoripeti RabbitMQ remains open source software under a specific and well documented support policy. Except for serious enough security patches, there will be no more OSS RabbitMQ 3.13.x releases. |
Let's keep this discussion specific to this issue. Any attempts to convince our team to produce a 3.13.x release will be removed. This is something non-negotiable. The patch will be backported to |
@michaelklishin I've had the patched version running non stop for the past 72h on 4 different environments, and I have not seen any issue. That's very promising! Thanks for your quick reaction on this bug report 👌 |
I just installed the fix in our environment by patching a 4.0.2 installation as suggested in #12367 (comment) I should be able to provide the first meaningful feedback tomorrow since the crashes happened every night before. |
...all good so far, no crashes last night - looks like the fix works for me as well 🙂 |
Thank you for testing #12392 folks @netrmqdev @kaistierl. I cannot promise a |
Describe the bug
After upgrading RMQ server to version 3.13.6 or 3.13.7 from version 3.12.14, we started having randomly "Restarting crashed queue" errors for some of our queues. Those queues can have a throughput of 300 event/s, with big messages. We have also tested version 4.0.1 but the same issue was also spotted. Everytime we rollback to 3.12.14, the issue no longer happens.
Our topology is quite simple, we have only one RabbitMQ server running on k8s with the Docker management image. Clients publishing and consuming the affected queues are written in .NET and use the officiel .NET RabitMQ client.
Here are some stacktraces of when that happens:
3.13.7
4.0.1
Those issues seem very similar to what is described here: #10902
However this issue is supposed to be fixed since 3.13.2.
Reproduction steps
Cannot reproduce in a test environment.
Expected behavior
No queue crashing
Additional context
No response
The text was updated successfully, but these errors were encountered: