-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting bucket eventually fails and makes delete queues stuck #725
Comments
WIP Problems
Related Issues |
@vstax Thanks for reporting in detail.
As I commented on the above, there are some problems.
This is not typos (method head/fetch are used during a delete bucket operation internally).
As I answered at the above question, the restart can cause delete operations to stop in the middle.
It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow. EDIT: filed this inconsistency problem on #731
WIP. |
I've made a deletion bucket processing's diagram to clarify how to fix this issue, whose diagram covers #150. |
@mocchira @yosukehara Thank you for analyzing. This seems like a complicated issue; when looking at #150 and #701 I thought this is supposed to work as long as I don't create bucket with the same name again, but apparently I had too much hopes.
I can't do anything about retries from leo_gateway to leo_storage, but I can try to do it with different S3 client which will only do "delete bucket" operation once without retries and share if it works any better. However, I've stumbled into something else regarding queues so I'll leave everything be for now..
No, not that one. "mehtod" part is a typo. From here:
I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.
After a minute or two I got two errors in error.log of storage_1:
The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:
After this, at some point the I had debug logs enabled, I saw lots of lines in storage_1 debug log during this time. At first it was like this:
Then (note a gap in time! This - 13:25 - is few minutes after the queue got "stuck" at 52404 number. Could it be that something restarted and queue "unstuck" for a moment here?):
At some point (13:28:40 to be precise), messages have stopped appearing. I've repeated experiment with storage_2 and the situation at first was exactly the same, just with different numbers. However, unlike storage_1 there are other messages in error log:
The last line repeats lots of times, endlessly. I can't execute "mq-stats" for this node anymore: it returns instantly without any results (like it happens when a node isn't running). However, its status is indeed "running":
To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2. |
@vstax thanks for the detailed info.
Oops. Got it :)
queue size didn't change and displays invalid number that is different from the actual one caused by #731.
It seems half a minute error caused by #728. EDIT: found the fault here: https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/src/leo_storage_mq.erl#L342-L363
Getting the response from any command through leofs-adm slow is one of the symptom the Erlang runtime overloaded.
It seems the number is fake and there is stuff in queues. |
TODO
@vstax we will ask you to do the same test after we finish all TODOs described above. |
I've recognized I'll send a PR and its fix will be included in v1.3.5. |
Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.
Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.
I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine. The storage_2 is still in bad shape, doesn't respond to
every 5 seconds. Also the errors that I've seen on storage_1 never appeared in storage_2 log. However, when I restart nodes I'll probably see something else. A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like
there were quite successful deletes like
on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running. I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging? |
Since there are multiple consumer processes/files (IIRC, default 4 or 8) per the one queue (in this case ASYNC_DELETION), the period can vary under 30 seconds.
Make sense!
Thanks.
Seems something I haven't noticed still there.
Yes! please share via anything you like. |
@mocchira I've packed the queues (the nodes were running but there was no processing) and uploaded them to https://www.dropbox.com/s/78uitcmohhuq3mq/storage-queues.tar.gz?dl=0 (it's not such a big file so dropbox should work, I think?). Now, after I restarted storage_1.. a miracle! The queue had fully consumed, without any errors in logs or anything. Debug log was as usual:
Error log - nothing to show. Queue state during these 4 minutes (removed some extra lines):
I've restarted storage_2 as well. I got no '-decrease/3-lc$^0/1-0-' errors this time! Like, at all. At first queue was processing, then eventually it stuck:
I got just this in error log around the time it froze:
Now it spends 10-20 seconds in 100% CPU state, then switches back to 0, then 100% again and so on. Just like storage_1 did during the last experiment. And like last time, "mq-stats" makes me wait if executed during 100% CPU usage period. The whole situation seems quite random... EDIT: 1 hour after experiment, everything still the same; 100% CPU usage alternating with 0% CPU usage. Nothing in error logs (besides some disk watchdog messages as I'm over 80% disk usage on volume with AVS files). This is unlike last experiment (#725 (comment)) when storage_1 wrote something about "Supervisor started" in logs at some point and stopped consuming CPU soon after. Leo_doctor logs: https://pastebin.com/y9RgXtEK https://pastebin.com/rsxLCwDN and https://pastebin.com/PMFeRxFH EDIT: 21 hours after experiment, 100% CPU usage alternating with 0% CPU on storage_2 has stopped. Nothing related in logs, really; not in error.log nor in erlang.log. No mention of restarts or anything - just according to |
@vstax Still WIP although I'd like to share what I got at the moment.
Since the second one can be mitigated by reducing the number of consumers of leo_mq, I will add this workaround to #725 (comment). |
Design considerations for #725 (comment).
|
I've repeated - or, rather, tried to complete this experiment by re-adding "bodytest" bucket and removing it again on latest dev version. I don't expect it to work perfectly, but wanted to check how issues that were already fixed helped. Debug logs are disabled to make sure leo_logger problems won't affect anything, This time, I made sure to abort gateway logs:
storage_0 info log:
Error log:
storage_1 info log:
Error log:
storage_2 info log:
Error log - empty. Queue states:
which was dropping pretty fast
and reached 0 like 2-3 minutes after start of operation:
For storage_1, likewise the queue got to this number within 30 seconds, but its status was "suspending"
it was "suspending" all the time during experiment. It barely dropped and stays at this number even now:
For storage_2, the number was this within 30 seconds after start
it was dropping slowly (quite unlike storage_0) and has reached this number when it stopped reducing:
At this point the system is stable; nothing going on, there is no load, most of objects from "bodytest" still aren't removed, the queues for storage_1 and storage_2 are stalled with the numbers like above. There is nothing else in log files. I stop storage_2, make backup of its queues (just in case). I start it, the number in queue is the same at first. 20-30 seconds after node is started, it starts to reduce. There are new messages in error logs:
The number in queue eventually reduces to 0. I stop storage_1, make backup of its queues, start it again. Right after start the queue starts processing:
Then CPU load on node goes very high, "mq-stats" command starts to hang, I see this in error logs:
The node isn't working at this point. I restart it, queue starts to process:
Error log is typical at first:
but then mq-stats command starts to freeze, and I get this:
(the last line starts to repeat at this point) Note that the empty lines in log file are really there.
Problem 3) is probably fine for now given that #725 (comment) isn't implemented yet, I suppose, but 1) and 2) worry me as I don't see any currently open issues related to these problems... |
I've updated the diagram of the deletion bucket processing, which covers #725 (comment) |
@yosukehara Thanks for updating and taking my comments into account. Some comments.
It seems the other concerns have been covered by the above diagram. |
@vstax thanks for testing. As you concerned, the problem 1, 2 give me the impression there are something we have not covered yet. Note: |
@vstax Thanks for sorting the remained problems out.
Unfortunately we still couldn't reproduce this issue as of now. however it should not be critical one so that we will file this issue into 1.4 milestones.
Same status as above. we will file this issue into 1.4 milestones.
Through some discussions with other members, we've decided it's spec as below. In case leo_storage goes down while it's enqueuing, The state of the corresponding leo_storage at manager keep being enqueuing (Not changing to pending as it will become enqueuing immediately once leo_storage get back) Let us know if this behavior troubles you.
The below 2 commits should fix this problem so please give it a try with 1.3.5 after you will be back.
The below commit should fix this problem so please give it a try with 1.3.5 after you will be back.
We will release 1.3.5 today as the problems caused by stopping storage nodes have been fixed now. Let us know if there is something that doesn't work for you with 1.3.5 except the problems we assigned its milestone into 1.4. |
@mocchira Thank you, I'll test these fixes when I'll be able to. |
Maybe. We will try to reproduce with the one bucket having "old metadata" format later. |
Progress Note: we have not succeeded in reproducing issues yet even with the one bucket having the old metadata format. we will look into the other factors that might affect the reproducibility. |
Thank you. There are lots of timeout errors in my log files (because there are lots of objects to delete, I suppose), I wonder if you're getting similar amounts in your tests, because they might be a factor here?.. Either way, I'll be able to return to this around the end of week, I was thinking of doing 3 experiments on the latest version (delete bucket "body", then "bodytest" one by one; then the same, but in reverse order; then both at once) while gathering output of delete-bucket-stats and mq-stats for all nodes every minute or 30 seconds, to closely monitor & time moments of state switching and numbers in queues to see if there really is anything abnormal. I could add some other stats gathering here but not sure what else could help (debug logs don't output anything relevant, at least). |
@vstax Thanks for replying.
Yes we also have non-negligible timeouts like you due to #764 however the actual number might be (very) different from yours. I will look into the problem while considering how lots of timeouts affect the behavior of multiple delete-bucket at once.
Thanks! looking forward to seeing your results. |
@mocchira WIP, however I can see now that that I can get too many messages during deleting of a single bucket as well. It's not caused by deletion of two bucket at once, but by some other means, it seems. Long time ago I never got such numbers when deleting single bucket, so it must be either some changes between delete-bucket implementation and latest changes, or some third factor that didn't exist back then but plays role now.
Only deleting single bucket here, there are two buckets with about equal amount of messages so I don't expect numbers in queue to go higher than 1354898 / 2 ~= 677000. EDIT OK this looks totally wrong. After deleting "body" bucket is completed - "bodytest" was never touched!:
Deleting bucket "body" completely removed all objects both from "body" and "bodytest". How is that possible? If that really is the case, it would explain both problems (removing "body" takes much more time than "bodytest", and too many messages in queues when deleting both "body" and "bodytest" at once). Just in case, both buckets are owned by the same user "body". Right now bucket "bodytest" still exists, but all objects from it on nodes are removed. |
@vstax Oops. Thanks that totally make sense! we will fix ASAP. |
@mocchira We need to fix this issue quickly, then will release v1.3.6 which contains it. |
@mocchira Thank you, this fix works. I did tests with deleting buckets one by one and both at once and can confirm that the issues #803 and #804 are now gone and can be closed. I can see two minor problems still. Even during deletion of a single bucket, the number of messages in queue at its peak still gets somewhat higher than amount of remaining objects. E.g.
The numbers eventually get closer and closer (and both finish at 0, of course):
The second problem (even more minor) is premature clearing of delete-bucket-state. It definitely isn't available for "minutes" after bucket deletion is done. Here are few examples, I was executing delete-bucket-state every 30 seconds and catching it in "finished" state is nearly impossible - the state disappears almost right away as soon as all queues get to 0:
The next spreadsheets show combination of mq-stats and delete-bucket-stats. "none" in bucket state means that delete stats are "not found". Experiment with deleting body bucket first, then bodytest As for the third experiment, "bodytest", then "body" - I didn't finish it, I checked and it went exactly like "body, then bodytest" one. Initial state of nodes, and the state after "body" bucket was deleted in the first experiment, storage_0:
storage_1:
storage_2:
At the end all objects were removed from all nodes. Buckets "body" and "bodytest" are almost the same ("bodytest" might contain a bit more objects than "body", but the difference should be less than 1%). To summarize the experiment result: deleting first bucket took 26 minutes, second (after pause for a few minutes) - 34 minutes. The queues on storage_1 and storage_2 were filling ("enqueuing") significantly slower for the second bucket, but processing them ("monitoring") went at about the same speed. Deleting both buckets at once took 39 minutes and the performance was identical for both buckets (however, queue filling for second bucket lagged for about a minute, despite executing both deletes at about the same time). I don't think there are any issues here. |
I've checked what happens if node is stopped during bucket deletion and the problem with objects remaining is still present. First, I've executed "delete-bucket body" and "delete-bucket bodytest"
Some time later, when all other nodes switched state to "monitoring":
I started storage_0. Eventually all queues on all nodes got to 0, bucket deletion is finished:
But objects remain on storage_0:
Also: info log on manager_0 is missing "dequeue" for "body" bucket, there is one only for "bodytest":
Despite all three nodes having "dequeue" in logs for both buckets:
I executed compaction on storage_0 and lots of objects indeed remain, there are objects both from "bodytest" and "body" present on storage_0 (in about equal amount for each). I uploaded logs from all storage nodes to https://www.dropbox.com/s/gisr1ujoipu9tdz/storage-logs-objects-remaining.tar.gz?dl=0 I've noticed a (maybe interesting) thing as well, when looking at numbers "active number of objects: 2574, active size of objects: 9810347778" and "total number of objects: 1354898, total size of objects: 191667623380". Average object size for the node is 191667623380 / 1354898 ~= 140 KB. However, for the objects that weren't deleted the average size is 9810347778 / 2574 ~= 3.8 MB. Isn't that a bit strange? ("active number / size" numbers can be trusted, they didn't change after compaction as well). |
@vstax thanks for confirming the fix and reporting in detail as always.
OK. I will close later.
Reproduced however still not clear what causes this problem so we will file as another issue and assign 1.4.0 milestone as it's not critical one.
Depending on the timing, finished status can disappear in dozen seconds. strictly speaking it ranges from 10 to 30 second according to https://github.com/leo-project/leofs/blob/develop/apps/leo_manager/include/leo_manager.hrl#L371-L383. let us know if this behavior bother you. we'd like to consider increasing the period.
Thanks for summarizing. I have no confidence however maybe the length of a bucket can affect the performance of the enqueuing phase (basically prefix matches happen on leveldb), I will confirm whether my assumption is correct when I can spare time.
I still couldn't reproduce this one on my dev-box however turned out how this could happen according to log files you would share on the previous comment. In short, rollback_to_pending (https://github.com/leo-project/leofs/blob/develop/apps/leo_storage/src/leo_storage_handler_del_directory.erl#L375-L379) can fail if the leo_backend_db already finished its termination processing. I will file this one as another issue and get in touch with you once I've fixed it. |
It does not bother per se, just doesn't quite match documentation which claims that it will be available for minutes :)
Old format is in "body" bucket, removal of which (after the latest changes) in no tests shows that it's any slower than removing "bodytest" with new metadata. So I don't think it plays any role here.
Maybe. The difference between s0 and s1/s2 (as you can see from
OK, great. Just in case, here is the listing from diagnose-start of one of AVS files on storage_0. Most of remaining objects are first parts of multipart objects, plus lots of big objects in general. This is really quite different from real distribution of objects before deletion, as only small % of objects were that big. |
Got it. documentation has been fixed with #811.
OK.
OK. please file the issue if you find the way to reproduce (I also will try if I can spare time). As you may notice, remaining issues about delete-bucket are filed on
I will ask you to test once those issues get fixed. |
@vstax #812 has already solved OTOH #813 has still remained however it's not big deal so I'd like to ask you to test the delete-bucket related when you have a time. (I will set the milestone to 1.4.1 because of the same reason as #931 (comment) |
I got test cluster (1.3.4, 3 storage nodes, N=2, W=1). There are two buckets, "body" and "bodytest", each containing the same objects, about 1M in each (there are some other buckets as well, but they hardly contain anything). In other words, there are slightly over 2M objects in cluster in total. At the start of this test the data is fully consistent. There is some minor load on cluster with "body" buckets - some PUT & GET operations, but very few of them. No one tries to access "bodytest" bucket.
I want to remove "bodytest" with all its objects. I execute
s3cmd rb s3://bodytest
. I see load on gateway and storage nodes; after some time, s3cmd fails because of timeout (I expect this to happen, no way storage can find all 1M objects and mark them as deleted fast enough). I seeleo_async_deletion_queue
queues growing on storage nodes:[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 97845 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 102780 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 104911 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 108396 | 0 | 3000 | async deletion of objs
The same for storage_0 and storage_1. There is hardly any disk load, each storage nodes consumes 120-130% CPU as per
top
.Then some errors appear in error log on gateway_0:
If these errors mean that gateway sent "timeout" to the client that requested "delete bucket" operation, plus some other timeouts due to load on system - then it's within expectations; as long as all data from that bucket will eventually be marked as "deleted" asynchronously, all is fine.
That's not what happens, however. At some point - few minutes after the "delete bucket" operation - delete queues stop growing or reducing. It's as if they are stuck. Here is their current state - 1.5 hours after the experiment; they got to that state within 5-10 minutes after start of experiment and never changed since (I show only one queue here, others are empty):
There is nothing in logs of manager nodes. There is nothing in
erlang.log
files on storage nodes (no mention of restarts or anything). Logs on storage nodes, info log for storage_0:error log on storage_0:
Info log on storage_1:
error log on storage_1:
Info log on storage_2:
Error log on storage_2:
To summarize the problems:
and on storage_0:
What happened here - it's that "minor load" that I mentioned. Basically at 17:17:13 application tried to do PUT operation of object
body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz
. That's a very small object, 27 KB in size. Some moments after successful PUT few (5, I believe) other applications did GET for that object. However, they all were using the same gateway with caching enabled, so they should've gotten the object from memory cache (at worst gateway would've checked ETag against storage node). 17:17:13 was in the middle of "delete bucket" operation, so I suppose the fact that there was large "processing time" for PUT was expected. But why "read_repairer" errors and "primary_inconsistency"?? Storage_0 is "primary" node for this object:The text was updated successfully, but these errors were encountered: