Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting bucket eventually fails and makes delete queues stuck #725

Open
vstax opened this issue May 4, 2017 · 85 comments
Open

Deleting bucket eventually fails and makes delete queues stuck #725

vstax opened this issue May 4, 2017 · 85 comments
Assignees
Milestone

Comments

@vstax
Copy link
Contributor

vstax commented May 4, 2017

I got test cluster (1.3.4, 3 storage nodes, N=2, W=1). There are two buckets, "body" and "bodytest", each containing the same objects, about 1M in each (there are some other buckets as well, but they hardly contain anything). In other words, there are slightly over 2M objects in cluster in total. At the start of this test the data is fully consistent. There is some minor load on cluster with "body" buckets - some PUT & GET operations, but very few of them. No one tries to access "bodytest" bucket.

I want to remove "bodytest" with all its objects. I execute s3cmd rb s3://bodytest. I see load on gateway and storage nodes; after some time, s3cmd fails because of timeout (I expect this to happen, no way storage can find all 1M objects and mark them as deleted fast enough). I see leo_async_deletion_queue queues growing on storage nodes:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 97845 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 102780 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 104911 | 0 | 3000 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
leo_async_deletion_queue | idling | 108396 | 0 | 3000 | async deletion of objs

The same for storage_0 and storage_1. There is hardly any disk load, each storage nodes consumes 120-130% CPU as per top.

Then some errors appear in error log on gateway_0:

[W]	gateway_0@192.168.3.52	2017-05-04 17:15:53.998704 +0300	1493907353	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:15:58.999759 +0300	1493907358	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:16:07.11702 +0300	1493907367	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:16:12.12715 +0300	1493907372	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:16:23.48706 +0300	1493907383	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:16:28.49750 +0300	1493907388	leo_gateway_s3_api:delete_bucket_2/3	1774	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:17:01.702719 +0300	1493907421	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_0@192.168.3.53'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-04 17:17:06.703840 +0300	1493907426	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_1@192.168.3.54'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]

If these errors mean that gateway sent "timeout" to the client that requested "delete bucket" operation, plus some other timeouts due to load on system - then it's within expectations; as long as all data from that bucket will eventually be marked as "deleted" asynchronously, all is fine.

That's not what happens, however. At some point - few minutes after the "delete bucket" operation - delete queues stop growing or reducing. It's as if they are stuck. Here is their current state - 1.5 hours after the experiment; they got to that state within 5-10 minutes after start of experiment and never changed since (I show only one queue here, others are empty):

[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 53559          | 0              | 3000           | async deletion of objs
[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 136972         | 0              | 3000           | async deletion of objs

There is nothing in logs of manager nodes. There is nothing in erlang.log files on storage nodes (no mention of restarts or anything). Logs on storage nodes, info log for storage_0:

[I]	storage_0@192.168.3.53	2017-05-04 17:16:06.79494 +0300	1493907366	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,17086}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:06.80095 +0300	1493907366	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,14217}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:06.80515 +0300	1493907366	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,12232}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:24.135151 +0300	1493907384	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30141}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:34.277504 +0300	1493907394	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,28198}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:34.277892 +0300	1493907394	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0a/59/ab/0a59aba721c409c8f9bf0bba176d10242380842653e6994f782fbe31cb2296b46cba031a085b4af057ab43314631e3691c5c000000000000.xz">>},{processing_time,24674}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:34.280088 +0300	1493907394	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/25/03/16/250316ef50b26272f99b757409d75c173135d2ef09d972821072348ad071e49897dd7245c1f250db6489a401aea567d9886e000000000000.xz">>},{processing_time,5303}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:43.179282 +0300	1493907403	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,41173}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:43.179708 +0300	1493907403	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,18328}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:43.180082 +0300	1493907403	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,37101}]
[I]	storage_0@192.168.3.53	2017-05-04 17:16:43.180461 +0300	1493907403	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,37100}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:03.11597 +0300	1493907423	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,28734}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:03.12445 +0300	1493907423	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/04/12/d6/0412d6227769cfef42764802004d465e83b309d884ec841d330ede99a2a551eda52ddda2b9774531e88b0b060dbb3a17c092000000000000.xz">>},{processing_time,27558}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:03.12986 +0300	1493907423	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/03/68/ed/0368ed293eded8de34b8728325c08f1603edcede7e8de7778c81d558b5b0c9cd9f85307d1c571b9e549ef92e9ec69498005a7b0000000000.xz\n1">>},{processing_time,8691}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:03.809203 +0300	1493907423	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,56801}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:03.809742 +0300	1493907423	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,9943}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:32.116059 +0300	1493907452	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,74073}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:32.116367 +0300	1493907452	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,36264}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:32.116743 +0300	1493907452	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,14748}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:32.117065 +0300	1493907452	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/66/06/b6/6606b6f9fab7e81872e8a628b696634b0c25294b0a79fbaf03c10ec49aff1d44a0921efcba5a53127e84d83edf7206d758ac850000000000.xz\n1">>},{processing_time,13881}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:33.13033 +0300	1493907453	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30002}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:36.111764 +0300	1493907456	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{processing_time,30001}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:40.473183 +0300	1493907460	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/dd/f3/60/ddf360276b8ccf938a0cfdb8f260bdf2acdcd85b9cf1e2c8f8d3b1b2d0ad17554e8a1d7d2490d396e7ad08532c9e90ac7cf2040100000000.xz\n1">>},{processing_time,16375}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:40.478082 +0300	1493907460	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/2c/5a/d2/2c5ad26023e73f37b47ee824eea753e550b99dc2945281102253e12b88c122dfbc7fcdad9706e0ee6d0dc19e86d10b76a8277f0000000000.xz\n2">>},{processing_time,7066}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.502257 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,84458}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.503120 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,20645}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.503488 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0e/6e/8b/0e6e8bcdbc732024193f7114b3f7d607333a9d3212a71e7104aea2b2b3bc137514eadd9c4d7de516e345feb9764186d9389d000000000000.xz">>},{processing_time,22633}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.503863 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/e3/32/e0/e332e00f0f77cf322cbc1d7a30369f8681073076c49868ab9cd5cee6043dfe3ebc8c355ae5899b74602ba763dcba872450bc560000000000.xz\n1">>},{processing_time,5896}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.521029 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,19185}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.521524 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,83386}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.521894 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,19168}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.522149 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,64342}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.522401 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,18958}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.522652 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,18803}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.522912 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,14251}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.524355 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,50816}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.525083 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,45786}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.526651 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,43717}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.527288 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,22024}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.527732 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,15411}]
[I]	storage_0@192.168.3.53	2017-05-04 17:17:47.528128 +0300	1493907467	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/66/06/b6/6606b6f9fab7e81872e8a628b696634b0c25294b0a79fbaf03c10ec49aff1d44a0921efcba5a53127e84d83edf7206d758ac850000000000.xz\n1">>},{processing_time,15411}]

error log on storage_0:

[W]	storage_0@192.168.3.53	2017-05-04 17:16:23.854142 +0300	1493907383	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:16:24.850092 +0300	1493907384	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:16:54.851871 +0300	1493907414	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:16:55.851836 +0300	1493907415	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:25.853134 +0300	1493907445	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:26.855813 +0300	1493907446	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:36.117939 +0300	1493907456	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:37.113194 +0300	1493907457	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.534739 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907416708152},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.538512 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.542151 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.549344 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]

Info log on storage_1:

[I]	storage_1@192.168.3.54	2017-05-04 17:16:10.725946 +0300	1493907370	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,21667}]
[I]	storage_1@192.168.3.54	2017-05-04 17:16:20.764386 +0300	1493907380	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,26764}]
[I]	storage_1@192.168.3.54	2017-05-04 17:16:37.95550 +0300	1493907397	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,35064}]
[I]	storage_1@192.168.3.54	2017-05-04 17:16:47.109806 +0300	1493907407	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,40093}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:03.480048 +0300	1493907423	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,45433}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:03.480713 +0300	1493907423	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,5704}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.497836 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,50442}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.503749 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,8485}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.637206 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,16918}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.641295 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,11937}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.660910 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,10180}]

error log on storage_1:

[E]	storage_1@192.168.3.54	2017-05-04 17:16:10.720827 +0300	1493907370	leo_backend_db_eleveldb:prefix_search/3222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,38,115,152,115,44,32,50,32,91,246,196,247,235,102,48,217,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,54,47,54,50,47,97,57,47,57,54,54,50,97,57,57,57,51,51,50,49,51,54,52,53,100,48,50,54,102,51,57,56,53,56,97,48,50,51,99,50,100,48,54,100,101,50,51,98,55,101,56,48,53,52,52,56,48,51,102,48,50,98,100,50,51,52,49,98,102,53,53,102,55,48,54,56,50,100,100,99,54,51,102,52,53,52,52,55,48,99,49,51,102,99,100,48,101,51,51,100,50,52,55,102,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,217,31,144,105,179,78,5,110,16,0,38,115,152,115,44,32,50,32,91,246,196,247,235,102,48,217,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,54,47,54,50,47,97,57,47,57,54,54,50,97,57,57,57,51,51,50,49,51,54,52,53,100,48,50,54,102,51,57,56,53,56,97,48,50,51,99,50,100,48,54,100,101,50,51,98,55,101,56,48,53,52,52,56,48,51,102,48,50,98,100,50,51,52,49,98,102,53,53,102,55,48,54,56,50,100,100,99,54,51,102,52,53,52,52,55,48,99,49,51,102,99,100,48,101,51,51,100,50,52,55,102,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,160,179,127,210,14,97,0>>},10000]}}
[E]	storage_1@192.168.3.54	2017-05-04 17:16:20.760699 +0300	1493907380	leo_backend_db_eleveldb:prefix_search/3222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,32,17,41,106,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,170,179,127,210,14,97,0>>},10000]}}
[E]	storage_1@192.168.3.54	2017-05-04 17:16:37.94918 +0300	1493907397	leo_backend_db_eleveldb:prefix_search/3	222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,205,38,194,2,251,99,149,185,246,131,149,156,96,116,90,188,109,0,0,0,133,98,111,100,121,116,101,115,116,47,55,56,47,56,56,47,102,97,47,55,56,56,56,102,97,97,48,54,54,101,50,98,54,56,52,54,99,57,98,56,52,51,55,102,52,99,100,55,98,97,53,55,100,99,52,51,55,98,100,100,98,99,51,98,56,53,51,101,101,100,48,53,98,101,57,56,101,48,97,97,99,49,98,97,97,51,51,57,52,101,55,48,55,48,98,48,101,57,98,49,99,101,99,57,99,98,99,57,101,49,50,55,54,99,54,97,56,97,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,184,132,34,107,179,78,5,110,16,0,205,38,194,2,251,99,149,185,246,131,149,156,96,116,90,188,109,0,0,0,133,98,111,100,121,116,101,115,116,47,55,56,47,56,56,47,102,97,47,55,56,56,56,102,97,97,48,54,54,101,50,98,54,56,52,54,99,57,98,56,52,51,55,102,52,99,100,55,98,97,53,55,100,99,52,51,55,98,100,100,98,99,51,98,56,53,51,101,101,100,48,53,98,101,57,56,101,48,97,97,99,49,98,97,97,51,51,57,52,101,55,48,55,48,98,48,101,57,98,49,99,101,99,57,99,98,99,57,101,49,50,55,54,99,54,97,56,97,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,187,179,127,210,14,97,0>>},10000]}}
[E]	storage_1@192.168.3.54	2017-05-04 17:16:47.108568 +0300	1493907407	leo_backend_db_eleveldb:prefix_search/3222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,23,93,187,107,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,197,179,127,210,14,97,0>>},10000]}}
[E]	storage_1@192.168.3.54	2017-05-04 17:17:03.478769 +0300	1493907423	leo_backend_db_eleveldb:prefix_search/3222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,37,221,237,84,123,140,135,76,39,216,128,38,178,216,253,43,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,56,47,54,100,47,53,102,47,52,56,54,100,53,102,98,51,98,55,55,52,99,56,97,102,51,97,102,54,102,100,48,51,53,102,51,54,55,98,100,48,52,52,100,98,100,56,97,97,55,49,102,51,52,98,54,51,53,51,53,57,99,57,57,102,56,48,55,101,54,51,102,98,99,54,52,48,97,53,99,56,97,98,99,50,49,51,99,52,50,49,52,51,100,52,55,101,98,57,101,55,49,101,48,99,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,76,233,180,108,179,78,5,110,16,0,37,221,237,84,123,140,135,76,39,216,128,38,178,216,253,43,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,56,47,54,100,47,53,102,47,52,56,54,100,53,102,98,51,98,55,55,52,99,56,97,102,51,97,102,54,102,100,48,51,53,102,51,54,55,98,100,48,52,52,100,98,100,56,97,97,55,49,102,51,52,98,54,51,53,51,53,57,99,57,57,102,56,48,55,101,54,51,102,98,99,54,52,48,97,53,99,56,97,98,99,50,49,51,99,52,50,49,52,51,100,52,55,101,98,57,101,55,49,101,48,99,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,213,179,127,210,14,97,0>>},10000]}}
[E]	storage_1@192.168.3.54	2017-05-04 17:17:13.490725 +0300	1493907433	leo_backend_db_eleveldb:prefix_search/3222	{timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,225,206,77,109,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,223,179,127,210,14,97,0>>},10000]}}

Info log on storage_2:

[I]	storage_2@192.168.3.55	2017-05-04 17:16:12.956911 +0300	1493907372	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,23920}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:12.958225 +0300	1493907372	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,21096}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:12.958522 +0300	1493907372	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,19109}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:35.444648 +0300	1493907395	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,41450}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:35.445099 +0300	1493907395	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,12582}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:35.445427 +0300	1493907395	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,10552}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:42.958232 +0300	1493907402	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:53.6668 +0300	1493907413	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/25/03/16/250316ef50b26272f99b757409d75c173135d2ef09d972821072348ad071e49897dd7245c1f250db6489a401aea567d9886e000000000000.xz">>},{processing_time,24030}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:53.7199 +0300	1493907413	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/82/8c/1a828c7a9d7a334714f91a8d1c56a4ec30a8dd4998c9db79f9dfed87be084a73aa090513d535e36186a986822b1d6ca9bc74010000000000.xz">>},{processing_time,18711}]
[I]	storage_2@192.168.3.55	2017-05-04 17:16:55.327215 +0300	1493907415	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,53321}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.830422 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,69821}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.830874 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,20975}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.831683 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,19056}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.832010 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,11818}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.832350 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,63874}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:16.832687 +0300	1493907436	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,63874}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:34.514878 +0300	1493907454	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,76471}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:34.515241 +0300	1493907454	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,17146}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:34.515530 +0300	1493907454	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,7655}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.878082 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,83833}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.878513 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0e/6e/8b/0e6e8bcdbc732024193f7114b3f7d607333a9d3212a71e7104aea2b2b3bc137514eadd9c4d7de516e345feb9764186d9389d000000000000.xz">>},{processing_time,22009}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.878950 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/17/f5/9f/17f59f95c7bd20ea310cf7bd14d0c2cc9890444c621b859e03f879ccf2700936abeafbd3d62deee9ed2e58bfa86107e4cea8040100000000.xz\n3">>},{processing_time,8458}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.879426 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/e3/32/e0/e332e00f0f77cf322cbc1d7a30369f8681073076c49868ab9cd5cee6043dfe3ebc8c355ae5899b74602ba763dcba872450bc560000000000.xz\n1">>},{processing_time,5269}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.879704 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,71434}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.880035 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,71433}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.880362 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,51552}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.881471 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,30049}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.881907 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,30050}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.882233 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,30050}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.886477 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,12372}]
[I]	storage_2@192.168.3.55	2017-05-04 17:17:46.887370 +0300	1493907466	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,12373}]

Error log on storage_2:

[W]	storage_2@192.168.3.55	2017-05-04 17:16:21.862361 +0300	1493907381	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-04 17:16:22.862268 +0300	1493907382	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-04 17:16:52.865051 +0300	1493907412	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-04 17:16:53.865848 +0300	1493907413	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-04 17:17:23.868648 +0300	1493907443	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-04 17:17:24.867539 +0300	1493907444	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{cause,timeout}]

To summarize the problems:

  1. Timeouts on gateway - not really a problem, as long as operation goes on asynchronously
  2. Typos "mehtod,delete", "mehtod,head", "mehtod,fetch" in info log. Note that it's correct in error log :)
  3. The fact that delete operation did not complete (I have picked a ~4300 random object names and executed "whereis" for them; around 1750 of them was marked as "deleted" on all nodes and around 2500 weren't deleted on any of them).
  4. The fact that delete queues got stuck. How do I "unfreeze" them? Reboot storage nodes? (not a problem, I'm just keeping them like that for now in case there is something else to try). There no errors or anything right now (however, debug logs are no enabled); state of all nodes is "running", but delete queue is not being processed on storage_1 and storage_2.
  5. These lines in log of storage_1
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.637206 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,16918}]
[I]	storage_1@192.168.3.54	2017-05-04 17:17:13.641295 +0300	1493907433	leo_object_storage_event:handle_event/254	[{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,11937}]

and on storage_0:

[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.534739 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907416708152},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.538512 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.542151 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W]	storage_0@192.168.3.53	2017-05-04 17:17:47.549344 +0300	1493907467	leo_storage_read_repairer:compare/4	165	[{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]

What happened here - it's that "minor load" that I mentioned. Basically at 17:17:13 application tried to do PUT operation of object body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz. That's a very small object, 27 KB in size. Some moments after successful PUT few (5, I believe) other applications did GET for that object. However, they all were using the same gateway with caching enabled, so they should've gotten the object from memory cache (at worst gateway would've checked ETag against storage node). 17:17:13 was in the middle of "delete bucket" operation, so I suppose the fact that there was large "processing time" for PUT was expected. But why "read_repairer" errors and "primary_inconsistency"?? Storage_0 is "primary" node for this object:

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | storage_0@192.168.3.53      | 90d03d01c8e65bffaba62e8fac56165e     |        25K |   dd009e23e7 | false          |              0 | 54eb36e9dcecf  | 2017-05-04 17:17:25 +0300
       | storage_1@192.168.3.54      | 90d03d01c8e65bffaba62e8fac56165e     |        25K |   dd009e23e7 | false          |              0 | 54eb36e9dcecf  | 2017-05-04 17:17:25 +0300
@mocchira mocchira self-assigned this May 9, 2017
@mocchira mocchira added this to the 1.4.0 milestone May 9, 2017
@mocchira
Copy link
Member

mocchira commented May 9, 2017

WIP

Problems

  • Retries can happen more than necessary
    • S3 Client <-> leo_gateway
    • leo_gateway <-> leo_manager
  • Part of processes are not ASYNC
  • Insufficient Error Handling on leo_manager

Related Issues

@mocchira
Copy link
Member

mocchira commented May 10, 2017

@vstax Thanks for reporting in detail.

Timeouts on gateway - not really a problem, as long as operation goes on asynchronously

As I commented on the above, there are some problems.

  • Too much retries going on in parallel behind the scene
  • Each retry cause the full scan objects stored in LeoFS
    result in getting LeoFS overloaded and restarted and that can cause delete operations to stop in the middle.

Typos "mehtod,delete", "mehtod,head", "mehtod,fetch" in info log. Note that it's correct in error log :)

This is not typos (method head/fetch are used during a delete bucket operation internally).

The fact that delete operation did not complete (I have picked a ~4300 random object names and executed "whereis" for them; around 1750 of them was marked as "deleted" on all nodes and around 2500 weren't deleted on any of them).

As I answered at the above question, the restart can cause delete operations to stop in the middle.

The fact that delete queues got stuck. How do I "unfreeze" them? Reboot storage nodes? (not a problem, I'm just keeping them like that for now in case there is something else to try). There no errors or anything right now (however, debug logs are no enabled); state of all nodes is "running", but delete queue is not being processed on storage_1 and storage_2.

It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.

EDIT: filed this inconsistency problem on #731

These lines in log of storage_1

WIP.

@yosukehara
Copy link
Member

I've made a deletion bucket processing's diagram to clarify how to fix this issue, whose diagram covers #150.

leofs-del-bucket-processing

@vstax
Copy link
Contributor Author

vstax commented May 10, 2017

@mocchira @yosukehara Thank you for analyzing.

This seems like a complicated issue; when looking at #150 and #701 I thought this is supposed to work as long as I don't create bucket with the same name again, but apparently I had too much hopes.

Too much retries going on in parallel behind the scene

I can't do anything about retries from leo_gateway to leo_storage, but I can try to do it with different S3 client which will only do "delete bucket" operation once without retries and share if it works any better. However, I've stumbled into something else regarding queues so I'll leave everything be for now..

This is not typos (method head/fetch are used during a delete bucket operation internally).

No, not that one. "mehtod" part is a typo. From here:
https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55

It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.

I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 1440           | 550            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 1280           | 600            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 800            | 750            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52980          | 480            | 850            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52800          | 480            | 850            | async deletion of objs                      

After a minute or two I got two errors in error.log of storage_1:

[W]	storage_1@192.168.3.54	2017-05-10 13:17:43.733429 +0300	1494411463	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-10 13:17:44.732449 +0300	1494411464	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]

The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 1450           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 1700           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 2200           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 2900           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52440          | 0              | 2950           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 3000           | async deletion of objs                      

After this, at some point the mq-stats command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage. Nothing else in error logs all this time. I see the same values (52400 / 0 / 3000 for async deletion queue, 0 / 0 / 0 for all others) but it takes 10 seconds to respond. It's still fast during 0% CPU usage period, but since node switches between these all the time now it's pretty random.

I had debug logs enabled, I saw lots of lines in storage_1 debug log during this time. At first it was like this:

[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.74131 +0300	1494411433	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/72/80/4f/72804f11dd276935ff759f28e4363761b6b2311ab33ffb969a41d33610c17a78e56971eeaa283bc5724ebff74c9797a27822010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.74432 +0300	1494411433	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.74707 +0300	1494411433	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.74915 +0300	1494411433	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.75166 +0300	1494411433	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D]	storage_1@192.168.3.54	2017-05-10 13:17:13.75400 +0300	1494411433	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{req_id,0}]

Then (note a gap in time! This - 13:25 - is few minutes after the queue got "stuck" at 52404 number. Could it be that something restarted and queue "unstuck" for a moment here?):

[D]	storage_1@192.168.3.54	2017-05-10 13:18:02.921132 +0300	1494411482	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/11/f3/aa/11f3aafb5d279afbcbb0ad9ff76a24f806c5fa1bd64eb54691629363dd0771394f81e4eb216e489d5169395736e80d992078020000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:18:02.922308 +0300	1494411482	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/7a/e0/82/7ae0820cb42d3224fc9ac54b86e6f4c21ea567c81c91d65f524cd27e4777cb5fd3ff4d415ec8b2529c4da616f58b830ec844010000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:27:18.952873 +0300	1494412038	null:null	0	Supervisor inet_gethost_native_sup started undefined at pid <0.10159.0>
[D]	storage_1@192.168.3.54	2017-05-10 13:27:18.953587 +0300	1494412038	null:null	0	Supervisor kernel_safe_sup started inet_gethost_native:start_link() at pid <0.10158.0>
[D]	storage_1@192.168.3.54	2017-05-10 13:27:52.990768 +0300	1494412072	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/b1/28/81/b12881f64bd8bb9e7382dc33bad442cdc91b0372bcdbbf1dcbd9bacda421e9a2ee24d479dba47d346c0b89bc06e74dc62540010000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:27:52.995161 +0300	1494412072	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/8a/7a/71/8a7a715855dabae364d61c1c05a5872079a5ca82588e894fdc83c647530c50cb0c910981b2b4cf62ac9625983fee7661d840010000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:27:52.998699 +0300	1494412072	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/96/35/56/963556c85b8a97d1d6d6b3a5f33f649dcdd6c9d89729c7c517d364f8c498eb5e214c1af2d694299d50f504f42f31fd60a816010000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-10 13:27:53.294 +0300	1494412073	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/5a/3a/e0/5a3ae0c07352fdf97d3720e4afdec76ba4c3e2f60ede654f675ce68e9b5f749fd40e6bc1b3f5855c1c085402c0b3ece9a0ef000000000000.xz">>},{req_id,0}]

At some point (13:28:40 to be precise), messages have stopped appearing.

I've repeated experiment with storage_2 and the situation at first was exactly the same, just with different numbers. However, unlike storage_1 there are other messages in error log:

[E]	storage_2@192.168.3.55	2017-05-10 13:30:04.679350 +0300	1494412204	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_2@192.168.3.55	2017-05-10 13:30:06.182672 +0300	1494412206	null:null	0	Error in process <0.23852.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_2@192.168.3.55	2017-05-10 13:30:06.232671 +0300	1494412206	null:null	0	Error in process <0.23853.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_2@192.168.3.55	2017-05-10 13:30:09.680281 +0300	1494412209	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_2@192.168.3.55	2017-05-10 13:30:14.681474 +0300	1494412214	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

The last line repeats lots of times, endlessly. I can't execute "mq-stats" for this node anymore: it returns instantly without any results (like it happens when a node isn't running). However, its status is indeed "running":

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status |grep storage_2
  S    | storage_2@192.168.3.55      | running      | c1d863d0       | c1d863d0       | 2017-05-10 13:27:51 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status storage_2@192.168.3.55
--------------------------------------+--------------------------------------
                Item                  |                 Value                
--------------------------------------+--------------------------------------
 Config-1: basic
--------------------------------------+--------------------------------------
                              version | 1.3.4
                     number of vnodes | 168
                    object containers | - path:[/mnt/avs], # of containers:8
                        log directory | /var/log/leofs/leo_storage/erlang
                            log level | debug
--------------------------------------+--------------------------------------
 Config-2: watchdog
--------------------------------------+--------------------------------------
 [rex(rpc-proc)]                      |
                    check interval(s) | 10
               threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
 [cpu]                                |
                     enabled/disabled | disabled
                    check interval(s) | 10
               threshold cpu load avg | 5.0
                threshold cpu util(%) | 90
--------------------------------------+--------------------------------------
 [disk]                               |
                     enabled/disalbed | enabled
                    check interval(s) | 10
                threshold disk use(%) | 85
               threshold disk util(%) | 90
                    threshold rkb(kb) | 98304
                    threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
 Config-3: message-queue
--------------------------------------+--------------------------------------
                   number of procs/mq | 8
        number of batch-procs of msgs | max:3000, regular:1600
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Config-4: autonomic operation
--------------------------------------+--------------------------------------
 [auto-compaction]                    |
                     enabled/disabled | disabled
        warning active size ratio (%) | 70
      threshold active size ratio (%) | 60
             number of parallel procs | 1
                        exec interval | 3600
--------------------------------------+--------------------------------------
 Config-5: data-compaction
--------------------------------------+--------------------------------------
  limit of number of compaction procs | 4
        number of batch-procs of objs | max:1500, regular:1000
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Status-1: RING hash
--------------------------------------+--------------------------------------
                    current ring hash | c1d863d0
                   previous ring hash | c1d863d0
--------------------------------------+--------------------------------------
 Status-2: Erlang VM
--------------------------------------+--------------------------------------
                           vm version | 7.3
                      total mem usage | 158420648
                     system mem usage | 107431240
                      procs mem usage | 50978800
                        ets mem usage | 5926016
                                procs | 428/1048576
                          kernel_poll | true
                     thread_pool_size | 32
--------------------------------------+--------------------------------------
 Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
                 replication messages | 0
                  vnode-sync messages | 0
                   rebalance messages | 0
--------------------------------------+--------------------------------------

To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.

@mocchira
Copy link
Member

mocchira commented May 11, 2017

@vstax thanks for the detailed info.

No, not that one. "mehtod" part is a typo. From here:
https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55

Oops. Got it :)

I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.

queue size didn't change and displays invalid number that is different from the actual one caused by #731.

The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:

It seems half a minute error caused by #728.
Fluctuating the CPU usage from 100 to 0 back and forth repeatedly might imply there are some items that can't be consumed and keep existing for some reason. I will vet in detail.

EDIT: found the fault here: https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/src/leo_storage_mq.erl#L342-L363
After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION.
They keep existing if the target object was already deleted.

After this, at some point the mq-stats command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage.

Getting the response from any command through leofs-adm slow is one of the symptom the Erlang runtime overloaded.
If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node?
The output would make it easy for us to debug in detail.

To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.

It seems the number is fake and there is stuff in queues.

@yosukehara
Copy link
Member

yosukehara commented May 11, 2017

@mocchira

Fix QUEUE_ID_ASYNC_DELETION to consume items properly even if the item was already deleted.

I've recognized leo_storage_mq has a bug about it when receiving {error, not_found}.

I'll send a PR and its fix will be included in v1.3.5.

@vstax
Copy link
Contributor Author

vstax commented May 11, 2017

@mocchira

It seems half a minute error caused by #728.

Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.

After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION.
They keep existing if the target object was already deleted.

Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.

If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node?

I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.

The storage_2 is still in bad shape, doesn't respond to mq-stats command and spits out

[E]	storage_2@192.168.3.55	2017-05-11 10:43:35.310528 +0300	1494488615	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

every 5 seconds. Also the errors that I've seen on storage_1 never appeared in storage_2 log.

However, when I restart nodes I'll probably see something else.

A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like

info.20170510.13.1:[D]	storage_2@192.168.3.55	2017-05-10 13:27:50.400569 +0300	1494412070	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]	storage_2@192.168.3.55	2017-05-10 13:27:50.400833 +0300	1494412070	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]

there were quite successful deletes like

info.20170510.13.1:[D]	storage_2@192.168.3.55	2017-05-10 13:28:40.717438 +0300	1494412120	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/fc/e3/a3/fce3a3f19655893ef1113627be71afe416987e6770337940e7d533662d7821fa8e74463d4c41ca1fdcd526c6ffb3a14e00ea090000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]	storage_2@192.168.3.55	2017-05-10 13:28:40.719168 +0300	1494412120	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/f5/6b/01/f56b019f9b473ccb07efbf5091d3ce257b1dcfce862669b2684be231c4f028ce92e8b4fc2dd1ac58248210ac99744ea60018000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]	storage_2@192.168.3.55	2017-05-10 13:28:40.723881 +0300	1494412120	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/c4/3c/46/c43c46dd688723e79858c0af76107cc370ad7aebbac60c604de7a8bee450b9b78f3c8222272aefd3bc66579cf3fb12ca10c4000000000000.xz">>},{req_id,0}]

on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.

I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?

@mocchira
Copy link
Member

@yosukehara #732

@yosukehara yosukehara modified the milestones: v1.3.5, 1.4.0 May 11, 2017
@mocchira
Copy link
Member

@vstax

Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.

Since there are multiple consumer processes/files (IIRC, default 4 or 8) per the one queue (in this case ASYNC_DELETION), the period can vary under 30 seconds.

Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.

Make sense!

I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.

Thanks.

A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like
...
on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.

Seems something I haven't noticed still there.
Your queue file might help me debug further.

I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?

Yes! please share via anything you like.
(off topic: previously you shared some stuff via https://cloud.mail.ru/ that was amazingly the fast one I've ever used :)

@vstax
Copy link
Contributor Author

vstax commented May 11, 2017

@mocchira I've packed the queues (the nodes were running but there was no processing) and uploaded them to https://www.dropbox.com/s/78uitcmohhuq3mq/storage-queues.tar.gz?dl=0 (it's not such a big file so dropbox should work, I think?).

Now, after I restarted storage_1.. a miracle! The queue had fully consumed, without any errors in logs or anything. Debug log was as usual:

[D]	storage_1@192.168.3.54	2017-05-11 22:44:09.681558 +0300	1494531849	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/76/74/02/767402b5880aa54206793cb197e3fccf4bacf4e516444cd6c88eeea8c9d25af461bb30bcb513041ac033c8db12e7e67e4c09010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D]	storage_1@192.168.3.54	2017-05-11 22:44:09.681905 +0300	1494531849	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-11 22:44:09.682166 +0300	1494531849	leo_storage_handler_object:delete/4	596	[{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D]	storage_1@192.168.3.54	2017-05-11 22:44:09.682426 +0300	1494531849	leo_storage_handler_object:delete/3	582	[{from,leo_mq},{method,del},{key,<<"bodytest/3d/e8/88/3de888009faa04a6860550b94d4bb2f19fe01958ad28229a38bf4eeafd399d5a569d4130b008b48ab6d51889add0aa2e2570010000000000.xz">>},{req_id,0}]
[..skipped..]
[D]	storage_1@192.168.3.54	2017-05-11 22:48:41.454128 +0300	1494532121	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/58/15/b6/5815b6600a1d5aa3c46b00dffa3e0a9da7c50f7c75dc4058bbc503f6aca8c74396ce93889a7864ad14207c98445b914da443000000000000.xz">>},{req_id,0}]
[D]	storage_1@192.168.3.54	2017-05-11 22:48:41.455928 +0300	1494532121	leo_storage_handler_object:put/4	404	[{from,storage},{method,delete},{key,<<"bodytest/f0/d8/4d/f0d84d4f4b6cb071fb88f3107a00d87be6a849dc304ec7a738c9d7ac4f7e97f7e5ff30a6beff3536fe6267f8af26e57b3ce9000000000000.xz">>},{req_id,0}]

Error log - nothing to show. Queue state during these 4 minutes (removed some extra lines):

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 51355          | 1600           | 500            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 46380          | 320            | 900            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 46200          | 0              | 1000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 37377          | 0              | 1400           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 23740          | 0              | 1550           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 13480          | 0              | 1750           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 1814           | 0              | 2000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 0              | 0              | 2050           | async deletion of objs                      

I've restarted storage_2 as well. I got no '-decrease/3-lc$^0/1-0-' errors this time! Like, at all. At first queue was processing, then eventually it stuck:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 135142         | 1440           | 550            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 131602         | 800            | 750            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 129353         | 0              | 1000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 1700           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 1700           | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 129353         | 0              | 2200           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 3000           | async deletion of objs                      

I got just this in error log around the time it froze:

[W]	storage_2@192.168.3.55	2017-05-11 22:48:21.736404 +0300	1494532101	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-11 22:48:22.733389 +0300	1494532102	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]

Now it spends 10-20 seconds in 100% CPU state, then switches back to 0, then 100% again and so on. Just like storage_1 did during the last experiment. And like last time, "mq-stats" makes me wait if executed during 100% CPU usage period. The whole situation seems quite random...

EDIT: 1 hour after experiment, everything still the same; 100% CPU usage alternating with 0% CPU usage. Nothing in error logs (besides some disk watchdog messages as I'm over 80% disk usage on volume with AVS files). This is unlike last experiment (#725 (comment)) when storage_1 wrote something about "Supervisor started" in logs at some point and stopped consuming CPU soon after.

Leo_doctor logs: https://pastebin.com/y9RgXtEK https://pastebin.com/rsxLCwDN and https://pastebin.com/PMFeRxFH
First one, I think it was executed completely during 100% CPU usage period. Second one started during 100% CPU usage and last 3 seconds or so were during near-0% CPU usage period. The third one was without that "expected_svt" option (I don't know the difference so not sure which one you need). The third one started during 100% CPU usage and last 4-5 seconds were during near-0% usage period.

EDIT: 21 hours after experiment, 100% CPU usage alternating with 0% CPU on storage_2 has stopped. Nothing related in logs, really; not in error.log nor in erlang.log. No mention of restarts or anything - just according to sar at 19:40 on May 12 the load was there, on 19:50 and from that point on it wasn't. The leo_async_deletion_queue queue is unchanged, 129353 / 0 / 3000 messages just like it was at the moment it stopped processing. Just in case, leo_doctor logs from current moment (note that there might be very light load on this node, plus disk space watchdog triggered): https://pastebin.com/iUMn6uLX

@mocchira
Copy link
Member

@vstax Still WIP although I'd like to share what I got at the moment.

Since the second one can be mitigated by reducing the number of consumers of leo_mq, I will add this workaround to #725 (comment).

@mocchira
Copy link
Member

mocchira commented May 19, 2017

Design considerations for #725 (comment).

  • How to implement Queue-1
    • To make it easy for implementing a reliable replication, mnesia looks good choice to me as it runs on leo_manager(s)
  • Handle getting leo_manager(s) down in the middle
    • Make the communication between leo_manager and leo_storage(s) in async (leo_storage respond immediately after it succeeded in storing a bucket info into Queue-2)
    • Implement a worker process polling leo_storage(s) to confirm whether each leo_storage finish to delete a bucket
    • Once the worker process confirm a delete bucket completed on all leo_storage(s), delete the corresponding entry in Queue-1
  • About Queue-2
    • What to store: Bucket
    • When to store: received a delete-bucket request from leo_manager
    • When to delete: succeeded in storing all object info under a bucket into Queue-3
  • About Queue-3
    • What to store: Key and AddrId and store those at a different queue from the existing async-deletion queue as we can't identify whether all deletions are completed if it's mixed in
  • How to handle multiple delete-bucket requests in parallel
    • Need to manage one Queue-3 instance per one delete-bucket request to distinguish the progress of each delete-bucket
    • Or allow ONLY one delete-bucket to run at once (accept multiple delete-bucket requests on leo_manager but proceed only one delete-bucket operation on leo_storage(s)
  • Priority on background jobs(BJ)
    • Now we have no priority on BJs however obviously the rebalance/recover nodes should be prior than delete-bucket (Maybe this should be filed as another issue)

@yosukehara yosukehara modified the milestones: 1.4.0, v1.3.5 May 24, 2017
@vstax
Copy link
Contributor Author

vstax commented May 26, 2017

I've repeated - or, rather, tried to complete this experiment by re-adding "bodytest" bucket and removing it again on latest dev version. I don't expect it to work perfectly, but wanted to check how issues that were already fixed helped. Debug logs are disabled to make sure leo_logger problems won't affect anything, mq.num_of_mq_procs = 4 is set.

This time, I made sure to abort s3cmd rb s3://bodytest command after it sent the "remove bucket" request once so that it didn't try to repegat the request or anything. It's exactly the same system, but I estimate that over 60% of original amount of object (1M) are still present in "bodytest" bucket so there was a lot of stuff to remove.

gateway logs:

[W]	gateway_0@192.168.3.52	2017-05-26 22:00:26.733769 +0300	1495825226	leo_gateway_s3_api:delete_bucket_2/31798	[{cause,timeout}]
[W]	gateway_0@192.168.3.52	2017-05-26 22:00:31.734812 +0300	1495825231	leo_gateway_s3_api:delete_bucket_2/31798	[{cause,timeout}]

storage_0 info log:

[I]	storage_0@192.168.3.53	2017-05-26 22:00:27.162670 +0300	1495825227	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:34.240077 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:34.324375 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:42.957679 +0300	1495825242	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:43.469667 +0300	1495825243	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:50.241744 +0300	1495825250	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:51.136573 +0300	1495825251	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:59.20997 +0300	1495825259	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:59.21352 +0300	1495825259	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:20.242104 +0300	1495825280	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:21.264304 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.156285 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.156745 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.157114 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.157429 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.158711 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:27.162670 +0300	1495825227	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:34.240077 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:34.324375 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:42.957679 +0300	1495825242	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:43.469667 +0300	1495825243	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:50.241744 +0300	1495825250	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:51.136573 +0300	1495825251	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:59.20997 +0300	1495825259	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I]	storage_0@192.168.3.53	2017-05-26 22:00:59.21352 +0300	1495825259	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:20.242104 +0300	1495825280	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:21.264304 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.156285 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.156745 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.157114 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.157429 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I]	storage_0@192.168.3.53	2017-05-26 22:01:38.158711 +0300	1495825298	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]

Error log:

[E]	storage_0@192.168.3.53	2017-05-26 21:58:54.581809 +0300	1495825134	leo_backend_db_eleveldb:first_n/2	282	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E]	storage_0@192.168.3.53	2017-05-26 21:58:54.582525 +0300	1495825134	leo_mq_server:handle_call/3	287	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E]	storage_0@192.168.3.53	2017-05-26 21:58:54.924670 +0300	1495825134	leo_backend_db_eleveldb:first_n/2	282	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E]	storage_0@192.168.3.53	2017-05-26 21:58:54.927313 +0300	1495825134	leo_mq_server:handle_call/3	287	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E]	storage_0@192.168.3.53	2017-05-26 21:58:55.42756 +0300	1495825135	leo_backend_db_eleveldb:first_n/2	282	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E]	storage_0@192.168.3.53	2017-05-26 21:58:55.43297 +0300	1495825135	leo_mq_server:handle_call/3	287	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[W]	storage_0@192.168.3.53	2017-05-26 22:01:24.864263 +0300	1495825284	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-26 22:01:25.816594 +0300	1495825285	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-26 22:02:08.375387 +0300	1495825328	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W]	storage_0@192.168.3.53	2017-05-26 22:02:09.382327 +0300	1495825329	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]

storage_1 info log:

[I]	storage_1@192.168.3.54	2017-05-26 22:00:26.993464 +0300	1495825226	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5221}]
[I]	storage_1@192.168.3.54	2017-05-26 22:00:34.450344 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7456}]
[I]	storage_1@192.168.3.54	2017-05-26 22:00:34.899198 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8167}]
[I]	storage_1@192.168.3.54	2017-05-26 22:00:34.900451 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5424}]
[I]	storage_1@192.168.3.54	2017-05-26 22:00:46.351992 +0300	1495825246	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11453}]
[I]	storage_1@192.168.3.54	2017-05-26 22:00:46.352702 +0300	1495825246	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,8778}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:00.258646 +0300	1495825260	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,25808}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:00.259186 +0300	1495825260	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,15039}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:21.291575 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,34940}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:21.292084 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,29112}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:21.292789 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/2b/5c/f3/2b5cf31eaffd8e884937240a026abec1c6a48f66b042c08cca9b80250e9a58dd2216871bdb0dddbbaae4d6e7eb0896538498000000000000.xz">>},{processing_time,5069}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:21.294835 +0300	1495825281	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,21036}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:30.189080 +0300	1495825290	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,29930}]
[I]	storage_1@192.168.3.54	2017-05-26 22:01:30.189895 +0300	1495825290	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/01/9c/aa/019caaf22c84f6e77c5f5597810faa55ef57c71a38a133cbe9d38c631e40d11434ff449f989d77d408571af4c06e11aeb475000000000000.xz">>},{processing_time,28628}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:00.189674 +0300	1495825320	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:08.370447 +0300	1495825328	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{processing_time,30001}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.574818 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/df/87/0fdf870a9a237805c0282ba71c737966f2630124921b5c8709b6f470754b3e187eebdd30e80d404ccb700be646bc3c03bfa6020100000000.xz\n1">>},{processing_time,29259}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.575370 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/12/ed/21/12ed21380e8cb085deb10aa161feb131b581553ab1ead52e24ed88619b2ec7709d59b9e69b3d7bb0febc5930048bb1a0d8a2020100000000.xz\n2">>},{processing_time,17599}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.575744 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/29/95/03/2995035be6f7fbe86d6f4f76eba845bfc50338bd40535d9947e473779538a5ba6de5534672c3b5146fb5768b9e905a4318fa7b0000000000.xz\n1">>},{processing_time,40674}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.576122 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/4d/93/47/4d934795bb3006b7e35d99ca7485bfaa1b9cc1b8878fe11f260e0ffedb8e1d97f66221bfbb048ac5ce8298ae93e922be46e8020100000000.xz\n1">>},{processing_time,24915}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.576518 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/59/89/e7/5989e7825beeb82933706f559ab737cfe0eb88156471a29e0c6f6ae04c00576f0b0c5462f6714d2387a1856f99cdf3fc89ab040100000000.xz\n1">>},{processing_time,9456}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.576883 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/2f/99/d1/2f99d1ffa377ceda4341d1c0a85647f17fade7e8e375eafb1b8e1a17bd794fa9683a0546ed594ce2a18944c3e817498f00821a0100000000.xz\n1">>},{processing_time,38070}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.578804 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/63/7f/75/637f7568ee27aa13f0ccabc34d68faac4500535cb4c3f34b4b5d4349d80a6a96de46bcc04522f76debd1060647083a4850955c0000000000.xz\n1">>},{processing_time,8954}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.579637 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/07/7c/11/077c11796ee67c7a15027cf21b749ffbfd244c06980bf98a945acdd92b3404feb56609b8a0b177cd205d309e0d8310a6b0df5b0000000000.xz\n1">>},{processing_time,37267}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.580231 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/49/c8/3f/49c83ff8341d50259f4138707688613860802327ebb2e75d9019bda193c8ab82a3b66b4f7e92d4d9dc0f3d39c082010e5694370100000000.xz\n2">>},{processing_time,35610}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.581187 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/1a/9b/07/1a9b073aafa182620e4bb145507a097320bb4097ebec0dfddee3936a96e0cb83fc10ed7a7bfcd3f20456a3cdf0a373be3026700000000000.xz\n1">>},{processing_time,35500}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.581593 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/22/8c/64/228c648d769e51472f79670cbb804f9bee23d8d9ea6612ee4a21ea11b901ef60732e3657a2e4fb68ce26b745525ada7ab0b5790000000000.xz\n1">>},{processing_time,34513}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.582178 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/42/29/fb/4229fb92dac335eb214e0eef2d2bd59d25685ae9ace816f44eb4d37147921ad66b5be7ccc97938aacfdfc64c1e721f1ed2a1020100000000.xz\n2">>},{processing_time,20930}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.582963 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/14/d7/9b/14d79b2fd3b666cf511d1c4e55dddf2b44f998312dc0103cd26dd7227dba14ce0ddfe0e8e87a64d30e49f788081cd75a39bc000000000000.xz">>},{processing_time,14189}]
[I]	storage_1@192.168.3.54	2017-05-26 22:02:23.583762 +0300	1495825343	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/39/b5/e3/39b5e371ed1f857e881725e5b491810862a291268efb395e948da6d83934bc19d3ef8fc7c5a9584bcd18bd174c3e080dfba2020100000000.xz\n1">>},{processing_time,13991}]

Error log:

[W]	storage_1@192.168.3.54	2017-05-26 22:01:15.220528 +0300	1495825275	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-26 22:01:16.221833 +0300	1495825276	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]

storage_2 info log:

[I]	storage_2@192.168.3.55	2017-05-26 22:00:34.873903 +0300	1495825234	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8110}]
[I]	storage_2@192.168.3.55	2017-05-26 22:00:35.352063 +0300	1495825235	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8615}]
[I]	storage_2@192.168.3.55	2017-05-26 22:00:35.359634 +0300	1495825235	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5868}]
[I]	storage_2@192.168.3.55	2017-05-26 22:00:46.957075 +0300	1495825246	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11605}]
[I]	storage_2@192.168.3.55	2017-05-26 22:00:46.958526 +0300	1495825246	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,9393}]
[I]	storage_2@192.168.3.55	2017-05-26 22:00:46.958917 +0300	1495825246	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n2">>},{processing_time,7222}]
[I]	storage_2@192.168.3.55	2017-05-26 22:01:04.874732 +0300	1495825264	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I]	storage_2@192.168.3.55	2017-05-26 22:01:05.757004 +0300	1495825265	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82disk_usage/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,20530}]
[I]	storage_2@192.168.3.55	2017-05-26 22:01:18.153498 +0300	1495825278	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,31196}]
[I]	storage_2@192.168.3.55	2017-05-26 22:01:18.154052 +0300	1495825278	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,25974}]
[I]	storage_2@192.168.3.55	2017-05-26 22:01:18.159729 +0300	1495825278	leo_object_storage_event:handle_event/2	54	[{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,12403}]

Error log - empty.

Queue states:
For storage_0, within 30 seconds after delete bucket operation, the queue has reached this number:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53|grep leo_async_deletion
 leo_async_deletion_queue       |   idling    | 80439          | 1600           | 500            | async deletion of objs                      

which was dropping pretty fast

 leo_async_deletion_queue       |   running   | 25950          | 3000           | 0              | async deletion of objs                      

and reached 0 like 2-3 minutes after start of operation:

 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs                      

For storage_1, likewise the queue got to this number within 30 seconds, but its status was "suspending"

 leo_async_deletion_queue       | suspending  | 171957         | 0              | 1700           | async deletion of objs                      

it was "suspending" all the time during experiment. It barely dropped and stays at this number even now:

 leo_async_deletion_queue       | suspending  | 170963         | 0              | 1500           | async deletion of objs                      

For storage_2, the number was this within 30 seconds after start

 leo_async_deletion_queue       |   idling    | 34734          | 0              | 2400           | async deletion of objs                      

it was dropping slowly (quite unlike storage_0) and has reached this number when it stopped reducing:

 leo_async_deletion_queue       |   idling    | 29448          | 0              | 3000           | async deletion of objs                      

At this point the system is stable; nothing going on, there is no load, most of objects from "bodytest" still aren't removed, the queues for storage_1 and storage_2 are stalled with the numbers like above. There is nothing else in log files.

I stop storage_2, make backup of its queues (just in case). I start it, the number in queue is the same at first. 20-30 seconds after node is started, it starts to reduce. There are new messages in error logs:

[W]	storage_2@192.168.3.55	2017-05-26 22:36:53.397898 +0300	1495827413	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W]	storage_2@192.168.3.55	2017-05-26 22:36:54.377776 +0300	1495827414	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]

The number in queue eventually reduces to 0.

I stop storage_1, make backup of its queues, start it again. Right after start the queue starts processing:

 leo_async_deletion_queue       |   running   | 168948         | 1600           | 500            | async deletion of objs                      

Then CPU load on node goes very high, "mq-stats" command starts to hang, I see this in error logs:

[W]	storage_1@192.168.3.54	2017-05-26 22:49:59.603367 +0300	1495828199	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-26 22:50:00.600085 +0300	1495828200	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[E]	storage_1@192.168.3.54	2017-05-26 22:50:52.705543 +0300	1495828252	leo_mq_server:handle_call/3	287	{timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:50:53.757233 +0300	1495828253	null:null	0	gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:50:54.262824 +0300	1495828254	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:50:54.264280 +0300	1495828254	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.316.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.461439 +0300	1495828263	null:null	0	gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.461919 +0300	1495828263	null:null	0	gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.462275 +0300	1495828263	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.462926 +0300	1495828263	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.318.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.481700 +0300	1495828263	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:51:03.482332 +0300	1495828263	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.312.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:24.823088 +0300	1495828284	null:null	0	gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:24.823534 +0300	1495828284	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:51:24.825905 +0300	1495828284	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.20880.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.85988 +0300	1495828294	null:null	0	gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.87305 +0300	1495828294	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.95578 +0300	1495828294	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.27909.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.522235 +0300	1495828294	null:null	0	gen_fsm leo_async_deletion_queue_consumer_1_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.525223 +0300	1495828294	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.539198 +0300	1495828294	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.539694 +0300	1495828294	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason reached_max_restart_intensity in context shutdown
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.541076 +0300	1495828294	null:null	0	Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.220.0> exit with reason shutdown in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:51:34.976730 +0300	1495828294	null:null	0	Error in process <0.20995.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:35.122748 +0300	1495828295	null:null	0	Error in process <0.20996.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:35.140676 +0300	1495828295	null:null	0	Error in process <0.20997.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:35.211716 +0300	1495828295	null:null	0	Error in process <0.20998.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:35.367975 +0300	1495828295	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:36.17706 +0300	1495828296	null:null	0	Error in process <0.21002.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:36.68751 +0300	1495828296	null:null	0	Error in process <0.21005.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:36.273259 +0300	1495828296	null:null	0	Error in process <0.21011.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:37.246142 +0300	1495828297	null:null	0	Error in process <0.21018.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:37.625651 +0300	1495828297	null:null	0	Error in process <0.21022.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:38.192580 +0300	1495828298	null:null	0	Error in process <0.21024.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:38.461708 +0300	1495828298	null:null	0	Error in process <0.21025.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:38.462431 +0300	1495828298	null:null	0	Error in process <0.21026.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:39.324727 +0300	1495828299	null:null	0	Error in process <0.21033.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:39.851241 +0300	1495828299	null:null	0	Error in process <0.21043.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:40.5627 +0300	1495828300	null:null	0	Error in process <0.21049.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:40.369284 +0300	1495828300	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:40.523795 +0300	1495828300	null:null	0	Error in process <0.21050.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:41.56663 +0300	1495828301	null:null	0	Error in process <0.21052.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:41.317741 +0300	1495828301	null:null	0	Error in process <0.21057.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:42.785978 +0300	1495828302	null:null	0	Error in process <0.21069.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:42.812650 +0300	1495828302	null:null	0	Error in process <0.21070.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:42.984686 +0300	1495828302	null:null	0	Error in process <0.21071.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:43.815766 +0300	1495828303	null:null	0	Error in process <0.21078.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:44.817129 +0300	1495828304	null:null	0	Error in process <0.21085.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:45.370117 +0300	1495828305	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:46.199487 +0300	1495828306	null:null	0	Error in process <0.21097.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:46.502452 +0300	1495828306	null:null	0	Error in process <0.21099.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:47.770769 +0300	1495828307	null:null	0	Error in process <0.21103.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:47.987768 +0300	1495828307	null:null	0	Error in process <0.21108.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:48.516769 +0300	1495828308	null:null	0	Error in process <0.21112.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:48.524799 +0300	1495828308	null:null	0	Error in process <0.21113.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:48.813618 +0300	1495828308	null:null	0	Error in process <0.21114.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:50.370898 +0300	1495828310	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:51:50.872671 +0300	1495828310	null:null	0	Error in process <0.21136.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:51:55.372095 +0300	1495828315	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:00.373178 +0300	1495828320	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:05.373913 +0300	1495828325	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:10.375174 +0300	1495828330	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:15.375872 +0300	1495828335	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:20.376915 +0300	1495828340	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:25.377929 +0300	1495828345	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:30.378945 +0300	1495828350	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:35.379846 +0300	1495828355	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:40.381247 +0300	1495828360	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:45.381901 +0300	1495828365	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:52:50.383154 +0300	1495828370	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

The node isn't working at this point. I restart it, queue starts to process:

 leo_async_deletion_queue       |   running   | 122351         | 160            | 950            | async deletion of objs                      

Error log is typical at first:

[W]	storage_1@192.168.3.54	2017-05-26 22:54:44.83565 +0300	1495828484	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-26 22:54:44.690582 +0300	1495828484	leo_storage_replicator:loop/6	216	[{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-26 22:54:45.79657 +0300	1495828485	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W]	storage_1@192.168.3.54	2017-05-26 22:54:45.689791 +0300	1495828485	leo_storage_replicator:replicate/5	123	[{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]

but then mq-stats command starts to freeze, and I get this:

[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.421877 +0300	1495828535	null:null	0	gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.670420 +0300	1495828535	leo_mq_server:handle_call/3	287	{timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.851418 +0300	1495828535	null:null	0	gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.964534 +0300	1495828535	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.966858 +0300	1495828535	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.967659 +0300	1495828535	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.331.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:55:35.968591 +0300	1495828535	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.329.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:55:40.252705 +0300	1495828540	null:null	0	gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:55:40.273471 +0300	1495828540	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:55:40.274015 +0300	1495828540	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.333.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:55:45.382167 +0300	1495828545	null:null	0	gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:55:45.383698 +0300	1495828545	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:55:45.384491 +0300	1495828545	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.335.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.248006 +0300	1495828566	null:null	0	gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.248610 +0300	1495828566	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.249397 +0300	1495828566	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.14436.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.618153 +0300	1495828566	null:null	0	gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.618743 +0300	1495828566	null:null	0	["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.619501 +0300	1495828566	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.619996 +0300	1495828566	null:null	0	Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason reached_max_restart_intensity in context shutdown
[E]	storage_1@192.168.3.54	2017-05-26 22:56:06.620377 +0300	1495828566	null:null	0	Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.222.0> exit with reason shutdown in context child_terminated
[E]	storage_1@192.168.3.54	2017-05-26 22:56:07.236507 +0300	1495828567	null:null	0	Error in process <0.14718.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:07.395666 +0300	1495828567	null:null	0	Error in process <0.14719.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:07.589406 +0300	1495828567	null:null	0	Error in process <0.14721.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:08.34491 +0300	1495828568	null:null	0	Error in process <0.14722.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:08.553459 +0300	1495828568	null:null	0	Error in process <0.14724.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:08.699552 +0300	1495828568	null:null	0	Error in process <0.14726.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:08.750870 +0300	1495828568	null:null	0	Error in process <0.14727.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:09.395709 +0300	1495828569	null:null	0	Error in process <0.14741.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:09.429783 +0300	1495828569	null:null	0	Error in process <0.14742.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:09.536674 +0300	1495828569	null:null	0	Error in process <0.14743.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:09.670552 +0300	1495828569	null:null	0	Error in process <0.14748.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:10.239008 +0300	1495828570	null:null	0	Error in process <0.14754.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:10.395451 +0300	1495828570	null:null	0	Error in process <0.14755.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:10.872669 +0300	1495828570	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.79527 +0300	1495828571	null:null	0	Error in process <0.14758.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.89153 +0300	1495828571	null:null	0	Error in process <0.14760.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.93206 +0300	1495828571	null:null	0	Error in process <0.14761.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.291948 +0300	1495828571	null:null	0	Error in process <0.14762.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.336069 +0300	1495828571	null:null	0	Error in process <0.14763.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:11.608531 +0300	1495828571	null:null	0	Error in process <0.14769.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:12.78531 +0300	1495828572	null:null	0	Error in process <0.14770.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:12.461563 +0300	1495828572	null:null	0	Error in process <0.14772.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:12.689473 +0300	1495828572	null:null	0	Error in process <0.14773.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:12.812491 +0300	1495828572	null:null	0	Error in process <0.14774.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:14.902513 +0300	1495828574	null:null	0	Error in process <0.14793.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:15.250434 +0300	1495828575	null:null	0	Error in process <0.14800.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:15.266418 +0300	1495828575	null:null	0	Error in process <0.14801.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E]	storage_1@192.168.3.54	2017-05-26 22:56:15.873589 +0300	1495828575	leo_watchdog_sub:handle_info/2	165	{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

(the last line starts to repeat at this point)

Note that the empty lines in log file are really there.
In other words, the current problems are:

  1. Queue processing is still freezing without any directly related errors (as shown by storage_2)
  2. Something scary is going on storage_1 (EDIT: fixed names of nodes).
  3. The delete process isn't finished, there are still objects in bucket (however, I assume this is expected, given badargs in eleveldb,async_iterator early on)

Problem 3) is probably fine for now given that #725 (comment) isn't implemented yet, I suppose, but 1) and 2) worry me as I don't see any currently open issues related to these problems...

@yosukehara
Copy link
Member

yosukehara commented May 29, 2017

I've updated the diagram of the deletion bucket processing, which covers #725 (comment)

leofs-deletion-bucket-proc

@mocchira
Copy link
Member

@yosukehara Thanks for updating and taking my comments into account.

Some comments.

  • Check the state of the deletion bucket

    • Notify its message to manager(s) can be failed and also leo_storage can go down for some reason before notifying manage(s). So remove a deletion bucket message from Q2 should be done after being succeeded in notifying.
  • Let me confirm about How to handle multiple delete-bucket requests in parallel mentioned at the above comment.

    • Run at once or Allow multi? if the diagram means Run at once through Q2 then make sense to me.
  • Let me confirm Q3 (async_deletion) is the one that is already existing?

    • If so, since async_deletion queue are used in not only delete_bucket requests but also retrying delete object requests through leo_storage_replicator, we can't distinguish whether a delete_bucket request is really completed or not by just checking the number of messages in that queue. (please correct me if I'm missing something.)
    • if you means the new one will be created then make sense to me.
  • How about Priority on background jobs(BJ) mentioned at the above comment?

    • IMHO, this should be filed as another issue so I'd like to file it if you feel fine.

It seems the other concerns have been covered by the above diagram.
Thanks for your hard work.

@mocchira
Copy link
Member

@vstax thanks for testing.

As you concerned, the problem 1, 2 give me the impression there are something we have not covered yet.
I will dig in further later (Now I have one hypothesis that could explain problem 1, 2)

Note:
If you find reached_max_restart_intensity in an error.log then that means something goes pretty bad (Some erlang processes that are suppose to exist go down permanently due to the number of restarts reached a certain threshold in a specific time). Please restart a server if you face such cases in production. we'd like to tackle this problem somehow (like restarting automatically without human intervention) as another issue.

@mocchira
Copy link
Member

@vstax I guess #744 can be the root cause of problem 1, 2 here so please try the same test with the latest leo_mq if you can spare time?

@mocchira
Copy link
Member

mocchira commented Aug 10, 2017

@vstax Thanks for sorting the remained problems out.

Much higher amount of messages during deletion of two buckets at once compared to deleting them one after another. There are details in #725 (comment); fix from #793 did not help.

Unfortunately we still couldn't reproduce this issue as of now. however it should not be critical one so that we will file this issue into 1.4 milestones.
EDIT: filed on #803.

I think that "enqueuing" temporarily gets kind-of-stuck, especially visible during the same experiment (deleting two similar-sized large buckets at once), for one bucket state is switched to "monitoring" very fast, but for another it gets stuck at "enqueuing" for hour or so, even though the queue growth stops for both buckets at about the same time. Only when both buckets are nearly deleted on storage nodes, "enqueuing" changes to "monitoring" for second bucket as well. Both buckets switch to "finished" state at around the same time, it's only "enqueuing->monitoring" switch which seem to happen much later than you'd think it should. But I cannot say if this really is a bug or just behaves this way for some reason, either way, it doesn't lead to consistency problems.

Same status as above. we will file this issue into 1.4 milestones.
EDIT: filed on #804.

The state at manager is stuck at "enqueuing" (need to confirm this one again)

Through some discussions with other members, we've decided it's spec as below.

In case leo_storage goes down while it's enqueuing, The state of the corresponding leo_storage at manager keep being enqueuing (Not changing to pending as it will become enqueuing immediately once leo_storage get back)

Let us know if this behavior troubles you.

Not all objects from the bucket are deleted on storage node in the end (confirmed)

The below 2 commits should fix this problem so please give it a try with 1.3.5 after you will be back.

It's possible for queues to get stuck on storage node after it's started, it consumes some messages but the numbers aren't at 0 in the end (need to confirm this one again)

The below commit should fix this problem so please give it a try with 1.3.5 after you will be back.

Unfortunately, for RL reasons I won't be able to help with testing anything in the next 2 weeks. I think currently bucket deletion is working (or almost-working, first two bugs are still present but they don't cause any consistency problems) for the cases when nothing unusual happens to storage nodes. Maybe leaving the problems caused by stopping storage nodes for next release, if 2 weeks delay + who knows much much time later to fix it all is unacceptable delay for 1.3.5? I'd really like to help with testing this functionality so it has no known bugs, we can return to this later.

We will release 1.3.5 today as the problems caused by stopping storage nodes have been fixed now. Let us know if there is something that doesn't work for you with 1.3.5 except the problems we assigned its milestone into 1.4.

@vstax
Copy link
Contributor Author

vstax commented Aug 10, 2017

@mocchira Thank you, I'll test these fixes when I'll be able to.
Regarding being unable to reproduce issues when deleting two buckets at once, I just remembered that one of these buckets ("body", it's the one which switches from enqueuing to monitoring much later as well) has all of its objects with "old metadata" format. I was able to delete it only since #754 was implemented. Could these two issues be related to that?

@mocchira
Copy link
Member

@vstax

Regarding being unable to reproduce issues when deleting two buckets at once, I just remembered that one of these buckets ("body", it's the one which switches from enqueuing to monitoring much later as well) has all of its objects with "old metadata" format. I was able to delete it only since #754 was implemented. Could these two issues be related to that?

Maybe. We will try to reproduce with the one bucket having "old metadata" format later.
Thanks!

@mocchira
Copy link
Member

Progress Note: we have not succeeded in reproducing issues yet even with the one bucket having the old metadata format. we will look into the other factors that might affect the reproducibility.

@vstax
Copy link
Contributor Author

vstax commented Aug 20, 2017

@mocchira

Progress Note: we have not succeeded in reproducing issues yet even with the one bucket having the old metadata format. we will look into the other factors that might affect the reproducibility.

Thank you. There are lots of timeout errors in my log files (because there are lots of objects to delete, I suppose), I wonder if you're getting similar amounts in your tests, because they might be a factor here?..

Either way, I'll be able to return to this around the end of week, I was thinking of doing 3 experiments on the latest version (delete bucket "body", then "bodytest" one by one; then the same, but in reverse order; then both at once) while gathering output of delete-bucket-stats and mq-stats for all nodes every minute or 30 seconds, to closely monitor & time moments of state switching and numbers in queues to see if there really is anything abnormal. I could add some other stats gathering here but not sure what else could help (debug logs don't output anything relevant, at least).

@mocchira
Copy link
Member

@vstax Thanks for replying.

Thank you. There are lots of timeout errors in my log files (because there are lots of objects to delete, I suppose), I wonder if you're getting similar amounts in your tests, because they might be a factor here?..

Yes we also have non-negligible timeouts like you due to #764 however the actual number might be (very) different from yours. I will look into the problem while considering how lots of timeouts affect the behavior of multiple delete-bucket at once.

Either way, I'll be able to return to this around the end of week, I was thinking of doing 3 experiments on the latest version (delete bucket "body", then "bodytest" one by one; then the same, but in reverse order; then both at once) while gathering output of delete-bucket-stats and mq-stats for all nodes every minute or 30 seconds, to closely monitor & time moments of state switching and numbers in queues to see if there really is anything abnormal. I could add some other stats gathering here but not sure what else could help (debug logs don't output anything relevant, at least).

Thanks! looking forward to seeing your results.

@vstax
Copy link
Contributor Author

vstax commented Aug 24, 2017

@mocchira WIP, however I can see now that that I can get too many messages during deleting of a single bucket as well. It's not caused by deletion of two bucket at once, but by some other means, it seems. Long time ago I never got such numbers when deleting single bucket, so it must be either some changes between delete-bucket implementation and latest changes, or some third factor that didn't exist back then but plays role now.

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description                 
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   running   | 8              | 1600           | 500            | async deletion of objs                      
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node              
 leo_delete_dir_queue_1         |   running   | 1276860        | 1600           | 500            | deletion bucket #1                          
 leo_delete_dir_queue_2         |   idling    | 0              | 1600           | 500            | deletion bucket #2                          
 leo_delete_dir_queue_3         |   idling    | 0              | 1600           | 500            | deletion bucket #3                          
 leo_delete_dir_queue_4         |   idling    | 0              | 1600           | 500            | deletion bucket #4                          
 leo_delete_dir_queue_5         |   idling    | 0              | 1600           | 500            | deletion bucket #5                          
 leo_delete_dir_queue_6         |   idling    | 0              | 1600           | 500            | deletion bucket #6                          
 leo_delete_dir_queue_7         |   idling    | 0              | 1600           | 500            | deletion bucket #7                          
 leo_delete_dir_queue_8         |   idling    | 0              | 1600           | 500            | deletion bucket #8                          
 leo_per_object_queue           |   idling    | 0              | 1600           | 500            | recover inconsistent objs                   
 leo_rebalance_queue            |   idling    | 0              | 1600           | 500            | rebalance objs                              
 leo_recovery_node_queue        |   idling    | 0              | 1600           | 500            | recovery objs of node                       
 leo_req_delete_dir_queue       |   idling    | 0              | 1600           | 500            | request removing directories                
 leo_sync_by_vnode_id_queue     |   idling    | 0              | 1600           | 500            | sync objs by vnode-id                       
 leo_sync_obj_with_dc_queue     |   idling    | 0              | 1600           | 500            | sync objs w/remote-node                     

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 1249230
  total number of objects: 1354898
   active size of objects: 174361023360
    total size of objects: 191667623380
     ratio of active size: 90.97%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Only deleting single bucket here, there are two buckets with about equal amount of messages so I don't expect numbers in queue to go higher than 1354898 / 2 ~= 677000.

EDIT OK this looks totally wrong. After deleting "body" bucket is completed - "bodytest" was never touched!:

[root@leo-m0 shm]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 0
  total number of objects: 1354898
   active size of objects: 0
    total size of objects: 191667623380
     ratio of active size: 0.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Deleting bucket "body" completely removed all objects both from "body" and "bodytest". How is that possible? If that really is the case, it would explain both problems (removing "body" takes much more time than "bodytest", and too many messages in queues when deleting both "body" and "bodytest" at once). Just in case, both buckets are owned by the same user "body". Right now bucket "bodytest" still exists, but all objects from it on nodes are removed.

@mocchira
Copy link
Member

Deleting bucket "body" completely removed all objects both from "body" and "bodytest". How is that possible? If that really is the case, it would explain both problems (removing "body" takes much more time than "bodytest", and too many messages in queues when deleting both "body" and "bodytest" at once). Just in case, both buckets are owned by the same user "body". Right now bucket "bodytest" still exists, but all objects from it on nodes are removed.

@vstax Oops. Thanks that totally make sense! we will fix ASAP.

@yosukehara
Copy link
Member

@mocchira We need to fix this issue quickly, then will release v1.3.6 which contains it.

@mocchira
Copy link
Member

@vstax #808 should work for you so please give it a try.

@vstax
Copy link
Contributor Author

vstax commented Aug 25, 2017

@mocchira Thank you, this fix works.

I did tests with deleting buckets one by one and both at once and can confirm that the issues #803 and #804 are now gone and can be closed.

I can see two minor problems still. Even during deletion of a single bucket, the number of messages in queue at its peak still gets somewhat higher than amount of remaining objects. E.g.

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 511330
  total number of objects: 1354898
   active size of objects: 67839608148
    total size of objects: 191667623380
     ratio of active size: 35.39%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   idling    | 1              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   running   | 549813         | 1600           | 500            | deletion bucket #1

The numbers eventually get closer and closer (and both finish at 0, of course):

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 156991
  total number of objects: 1354898
   active size of objects: 16764286563
    total size of objects: 191667623380
     ratio of active size: 8.75%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   running   | 157986         | 1600           | 500            | deletion bucket #1

The second problem (even more minor) is premature clearing of delete-bucket-state. It definitely isn't available for "minutes" after bucket deletion is done. Here are few examples, I was executing delete-bucket-state every 30 seconds and catching it in "finished" state is nearly impossible - the state disappears almost right away as soon as all queues get to 0:

16:40:45	{"de_bucket_state":[{"node":"storage_0@192.168.3.53","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:20:10 +0300"},{"node":"storage_1@192.168.3.54","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:23:27 +0300"},{"node":"storage_2@192.168.3.55","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:21:23 +0300"}]}
16:41:15	{"de_bucket_state":[{"node":"storage_0@192.168.3.53","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:20:10 +0300"},{"node":"storage_1@192.168.3.54","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:23:27 +0300"},{"node":"storage_2@192.168.3.55","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:21:23 +0300"}]}
16:41:45	{"error":"Delete-bucket's stats not found"}
17:17:45	{"de_bucket_state":[{"node":"storage_0@192.168.3.53","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:52:56 +0300"},{"node":"storage_1@192.168.3.54","state":"3","state_str":"monitoring","timestamp":"2017-08-25 17:03:54 +0300"},{"node":"storage_2@192.168.3.55","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:59:44 +0300"}]}
17:18:15	{"de_bucket_state":[{"node":"storage_0@192.168.3.53","state":"3","state_str":"monitoring","timestamp":"2017-08-25 16:52:56 +0300"},{"node":"storage_1@192.168.3.54","state":"3","state_str":"monitoring","timestamp":"2017-08-25 17:03:54 +0300"},{"node":"storage_2@192.168.3.55","state":"9","state_str":"finished","timestamp":"2017-08-25 17:18:15 +0300"}]}
17:18:45	{"error":"Delete-bucket's stats not found"}

The next spreadsheets show combination of mq-stats and delete-bucket-stats. "none" in bucket state means that delete stats are "not found".

Experiment with deleting body bucket first, then bodytest
Experiment with deleting body and bodytest at once

As for the third experiment, "bodytest", then "body" - I didn't finish it, I checked and it went exactly like "body, then bodytest" one.

Initial state of nodes, and the state after "body" bucket was deleted in the first experiment, storage_0:

 active number of objects: 1348208
  total number of objects: 1354898
   active size of objects: 191655130794
    total size of objects: 191667623380
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

 active number of objects: 674344
  total number of objects: 1354898
   active size of objects: 95969028848
    total size of objects: 191667623380
     ratio of active size: 50.07%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

storage_1:

 active number of objects: 1502466
  total number of objects: 1509955
   active size of objects: 213408666322
    total size of objects: 213421617475
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

 active number of objects: 750866
  total number of objects: 1509955
   active size of objects: 106709199844
    total size of objects: 213421617475
     ratio of active size: 50.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

storage_2:

 active number of objects: 1486212
  total number of objects: 1493689
   active size of objects: 210039336410
    total size of objects: 210052283999
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

 active number of objects: 743230
  total number of objects: 1493689
   active size of objects: 104871681352
    total size of objects: 210052283999
     ratio of active size: 49.93%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

At the end all objects were removed from all nodes. Buckets "body" and "bodytest" are almost the same ("bodytest" might contain a bit more objects than "body", but the difference should be less than 1%).

To summarize the experiment result: deleting first bucket took 26 minutes, second (after pause for a few minutes) - 34 minutes. The queues on storage_1 and storage_2 were filling ("enqueuing") significantly slower for the second bucket, but processing them ("monitoring") went at about the same speed. Deleting both buckets at once took 39 minutes and the performance was identical for both buckets (however, queue filling for second bucket lagged for about a minute, despite executing both deletes at about the same time). I don't think there are any issues here.

@vstax
Copy link
Contributor Author

vstax commented Aug 25, 2017

I've checked what happens if node is stopped during bucket deletion and the problem with objects remaining is still present.

First, I've executed "delete-bucket body" and "delete-bucket bodytest"
A minute or so later, when the queues started to fill I stopped storage_0:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   idling    | 52380          | 1600           | 500            | deletion bucket #1

Some time later, when all other nodes switched state to "monitoring":

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: body
node                         | state            | timestamp
-----------------------------+------------------+-----------------------------
storage_0@192.168.3.53       | enqueuing        | 2017-08-25 19:44:26 +0300
storage_1@192.168.3.54       | monitoring       | 2017-08-25 19:55:26 +0300
storage_2@192.168.3.55       | monitoring       | 2017-08-25 19:54:18 +0300


- Bucket: bodytest
node                         | state            | timestamp
-----------------------------+------------------+-----------------------------
storage_0@192.168.3.53       | enqueuing        | 2017-08-25 19:43:26 +0300
storage_1@192.168.3.54       | monitoring       | 2017-08-25 19:53:21 +0300
storage_2@192.168.3.55       | monitoring       | 2017-08-25 19:52:38 +0300


[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   running   | 216616         | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   running   | 512792         | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_2         |   running   | 501945         | 1600           | 500            | deletion bucket #2

I started storage_0.

Eventually all queues on all nodes got to 0, bucket deletion is finished:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
[ERROR] Delete-bucket's stats not found

But objects remain on storage_0:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 2574
  total number of objects: 1354898
   active size of objects: 9810347778
    total size of objects: 191667623380
     ratio of active size: 5.12%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

(reverse-i-search)`du': ^C -sm *
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
 active number of objects: 0
  total number of objects: 1509955
   active size of objects: 0
    total size of objects: 213421617475
     ratio of active size: 0.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_2@192.168.3.55
 active number of objects: 0
  total number of objects: 1493689
   active size of objects: 0
    total size of objects: 210052283999
     ratio of active size: 0.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Also: info log on manager_0 is missing "dequeue" for "body" bucket, there is one only for "bodytest":

[I]	manager_0@192.168.3.50	2017-08-25 19:43:14.883159 +0300	1503679394	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"body">>},{"node",'storage_0@192.168.3.53'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:14.883463 +0300	1503679394	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"body">>},{"node",'storage_1@192.168.3.54'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:14.883648 +0300	1503679394	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"body">>},{"node",'storage_2@192.168.3.55'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:17.745669 +0300	1503679397	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"bodytest">>},{"node",'storage_0@192.168.3.53'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:17.746028 +0300	1503679397	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"bodytest">>},{"node",'storage_1@192.168.3.54'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:17.746231 +0300	1503679397	leo_manager_del_bucket_handler:handle_call/3 - enqueue	134	[{"bucket_name",<<"bodytest">>},{"node",'storage_2@192.168.3.55'}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.855532 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_0@192.168.3.53'},{"bucket_name",<<"body">>}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.858732 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_1@192.168.3.54'},{"bucket_name",<<"body">>}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.865141 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_2@192.168.3.55'},{"bucket_name",<<"body">>}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.868522 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_0@192.168.3.53'},{"bucket_name",<<"bodytest">>}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.871346 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_1@192.168.3.54'},{"bucket_name",<<"bodytest">>}]
[I]	manager_0@192.168.3.50	2017-08-25 19:43:24.874775 +0300	1503679404	leo_manager_del_bucket_handler:notify_fun/3	280	[{"node",'storage_2@192.168.3.55'},{"bucket_name",<<"bodytest">>}]
[I]	manager_0@192.168.3.50	2017-08-25 20:15:12.982604 +0300	1503681312	leo_manager_api:brutal_synchronize_ring_1/2	1944	node:'storage_0@192.168.3.53'
[I]	manager_0@192.168.3.50	2017-08-25 20:49:30.316532 +0300	1503683370	leo_manager_del_bucket_handler:after_completion/1	308	[{"msg: dequeued and removed",<<"bodytest">>}]

Despite all three nodes having "dequeue" in logs for both buckets:

[I]	storage_0@192.168.3.53	2017-08-25 20:30:32.142338 +0300	1503682232	leo_storage_handler_del_directory:run/5661	[{"msg: dequeued and removed (bucket)",<<"body">>}]
[I]	storage_0@192.168.3.53	2017-08-25 20:31:29.699986 +0300	1503682289	leo_storage_handler_del_directory:run/5661	[{"msg: dequeued and removed (bucket)",<<"bodytest">>}]

[I]	storage_1@192.168.3.54	2017-08-25 20:49:11.193615 +0300	1503683351	leo_storage_handler_del_directory:run/5661	[{"msg: dequeued and removed (bucket)",<<"body">>}]
[I]	storage_1@192.168.3.54	2017-08-25 20:49:20.183256 +0300	1503683360	leo_storage_handler_del_directory:run/5661	[{"msg: dequeued and removed (bucket)",<<"bodytest">>}]

[I]	storage_2@192.168.3.55	2017-08-25 20:49:10.91084 +0300	1503683350	leo_storage_handler_del_directory:run/5	661	[{"msg: dequeued and removed (bucket)",<<"body">>}]
[I]	storage_2@192.168.3.55	2017-08-25 20:49:19.238982 +0300	1503683359	leo_storage_handler_del_directory:run/5661	[{"msg: dequeued and removed (bucket)",<<"bodytest">>}]

I executed compaction on storage_0 and lots of objects indeed remain, there are objects both from "bodytest" and "body" present on storage_0 (in about equal amount for each). I uploaded logs from all storage nodes to https://www.dropbox.com/s/gisr1ujoipu9tdz/storage-logs-objects-remaining.tar.gz?dl=0

I've noticed a (maybe interesting) thing as well, when looking at numbers "active number of objects: 2574, active size of objects: 9810347778" and "total number of objects: 1354898, total size of objects: 191667623380". Average object size for the node is 191667623380 / 1354898 ~= 140 KB. However, for the objects that weren't deleted the average size is 9810347778 / 2574 ~= 3.8 MB. Isn't that a bit strange? ("active number / size" numbers can be trusted, they didn't change after compaction as well).

@mocchira
Copy link
Member

@vstax thanks for confirming the fix and reporting in detail as always.

I did tests with deleting buckets one by one and both at once and can confirm that the issues #803 and #804 are now gone and can be closed.

OK. I will close later.

Even during deletion of a single bucket, the number of messages in queue at its peak still gets somewhat higher than amount of remaining objects.

Reproduced however still not clear what causes this problem so we will file as another issue and assign 1.4.0 milestone as it's not critical one.

The second problem (even more minor) is premature clearing of delete-bucket-state. It definitely isn't available for "minutes" after bucket deletion is done. Here are few examples, I was executing delete-bucket-state every 30 seconds and catching it in "finished" state is nearly impossible - the state disappears almost right away as soon as all queues get to 0:

Depending on the timing, finished status can disappear in dozen seconds. strictly speaking it ranges from 10 to 30 second according to https://github.com/leo-project/leofs/blob/develop/apps/leo_manager/include/leo_manager.hrl#L371-L383. let us know if this behavior bother you. we'd like to consider increasing the period.

To summarize the experiment result: deleting first bucket took 26 minutes, second (after pause for a few minutes) - 34 minutes. The queues on storage_1 and storage_2 were filling ("enqueuing") significantly slower for the second bucket, but processing them ("monitoring") went at about the same speed. Deleting both buckets at once took 39 minutes and the performance was identical for both buckets (however, queue filling for second bucket lagged for about a minute, despite executing both deletes at about the same time). I don't think there are any issues here.

Thanks for summarizing. I have no confidence however maybe the length of a bucket can affect the performance of the enqueuing phase (basically prefix matches happen on leveldb), I will confirm whether my assumption is correct when I can spare time.
(or IIRC the AVS/metadata format of one bucket in your env is the older one. this can affect the performance as the additional logic for converting the format take place. I will also check how this factor affect the throughput of enqueuing phase)

I've checked what happens if node is stopped during bucket deletion and the problem with objects remaining is still present.

I still couldn't reproduce this one on my dev-box however turned out how this could happen according to log files you would share on the previous comment. In short, rollback_to_pending (https://github.com/leo-project/leofs/blob/develop/apps/leo_storage/src/leo_storage_handler_del_directory.erl#L375-L379) can fail if the leo_backend_db already finished its termination processing. I will file this one as another issue and get in touch with you once I've fixed it.

@vstax
Copy link
Contributor Author

vstax commented Aug 29, 2017

@mocchira

let us know if this behavior bother you. we'd like to consider increasing the period.

It does not bother per se, just doesn't quite match documentation which claims that it will be available for minutes :)

AVS/metadata format of one bucket in your env is the older one. this can affect the performance

Old format is in "body" bucket, removal of which (after the latest changes) in no tests shows that it's any slower than removing "bodytest" with new metadata. So I don't think it plays any role here.

maybe the length of a bucket can affect the performance of the enqueuing phase (basically prefix matches happen on leveldb),

Maybe. The difference between s0 and s1/s2 (as you can see from du output) is that s1 and s2 got about 10-11% more data, both in objects amounts and in raw data size compared to s0 (each AVS file on s0 is about 10% smaller than each AVS file on s1/s2). And that difference can be seen in "body, then bodytest" graph for the removal of first ("body") bucket, and in "body and bodytest" graph. It's just that when removing "bodytest" in "body, then bodytest" case the difference between s0 and s1/s2 is much higher than these 10%.
It's hard to say if there is some real problem just from these tests.. I'll try to repeat the experiment again to see if this is reproducible, however I can't promise it will be anytime soon.

I will file this one as another issue and get in touch with you once I've fixed it.

OK, great. Just in case, here is the listing from diagnose-start of one of AVS files on storage_0. Most of remaining objects are first parts of multipart objects, plus lots of big objects in general. This is really quite different from real distribution of objects before deletion, as only small % of objects were that big.

@mocchira
Copy link
Member

mocchira commented Aug 30, 2017

@vstax

It does not bother per se, just doesn't quite match documentation which claims that it will be available for minutes :)

Got it. documentation has been fixed with #811.

Old format is in "body" bucket, removal of which (after the latest changes) in no tests shows that it's any slower than removing "bodytest" with new metadata. So I don't think it plays any role here.

OK.

It's hard to say if there is some real problem just from these tests.. I'll try to repeat the experiment again to see if this is reproducible, however I can't promise it will be anytime soon.

OK. please file the issue if you find the way to reproduce (I also will try if I can spare time).

As you may notice, remaining issues about delete-bucket are filed on

I will ask you to test once those issues get fixed.

@mocchira
Copy link
Member

@vstax #812 has already solved OTOH #813 has still remained however it's not big deal so I'd like to ask you to test the delete-bucket related when you have a time. (I will set the milestone to 1.4.1 because of the same reason as #931 (comment)

@mocchira mocchira modified the milestones: 1.4.0, 1.4.1 Feb 16, 2018
@mocchira mocchira modified the milestones: 1.4.1, 1.4.3 Mar 30, 2018
@mocchira mocchira modified the milestones: 1.4.3, 1.5.0 Sep 20, 2018
@yosukehara yosukehara removed the v1.4 label Feb 25, 2019
@yosukehara yosukehara modified the milestones: 1.5.0, 1.6.0 Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants