Many problems during recovery #425

furkanmustafa · 2018-02-05T12:35:13Z

I am using v1.0.1, compiled from github tag, on ubuntu14.04, with corosync. 7~10 TB total md size.

I had 7 nodes on my experimental setup. One of the nodes suddenly failed, and automatic-recovery (reweight) kicked in, which caused lots of IO on other nodes. And that caused some other nodes to crash.
My nodes are set to restart after crash. But now my cluster is in recovery loop that never succeeds.

Also it continuously increases disk usage until disk is full on some nodes (in .stall folder). If the disk is full, the node cannot start again, and other nodes cannot see the objects those nodes have, and cannot properly recover.

Lot's of points could be suggested to improve at this point, like;

read-only node startup for only providing some objects to other nodes for recovery.
in this recovery-loop-hell dog cluster recovery disable does not work until the cluster is healthy. An option to disable recovery on init would be helpful. Otherwise everytime a node dies/comes-back, the whole operation is restarting, which takes hours and hours to recover each node.
sheep process dies without saying anything when disk is full.
sheep process should be able to start even if disk is full, otherwise recovering other nodes are sometimes impossible.
some more (that I'll list down in this issue when I remember, if that's okay)

I'm currently still trying to recover the cluster (for the last 30~40 hours).

Any suggestions would be really helpful.

The text was updated successfully, but these errors were encountered:

furkanmustafa · 2018-02-05T13:14:49Z

adding or removing (or crashing) gateway only nodes are also resetting recovery process. I believe there is no reason for any gateway node to affect recovery/reweight process.

furkanmustafa · 2018-02-05T13:15:57Z

Switched all cluster from corosync to zookeeper. That definitely made it more robust & responsive.
Yet, problems still occur and I cannot get out of recovery loop.

furkanmustafa · 2018-02-05T13:18:56Z

Just a moment ago gateway node ( sheep -g ) died after a failing dog vdi list command.

[root@mizumushi sheepdog]# dog vdi list
failed to read a response
Failed to read object 800f2da500000000
Failed to read inode header
failed to read a response
Failed to read object 8011310f00000000
Failed to read inode header

and it crashed, /tmp/obj/sheep.log;

19 Segmentation fault      (core dumped)

and systemd-journal logged coredump, rather small;

Feb 05 22:12:33 mizumushi systemd[1]: Started Process Core Dump (PID 22254/UID 0).                                     
-- Subject: Unit systemd-coredump@92-22254-0.service has finished start-up                                             
-- Defined-By: systemd                                     
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel                                               
--                                                         
-- Unit systemd-coredump@92-22254-0.service has finished starting up.                                                  
--                                                         
-- The start-up result is done.                            
Feb 05 22:12:34 mizumushi systemd-coredump[22255]: Process 18420 (sheep) of user 0 dumped core.                        

                                                   Stack trace of thread 1002:                                         
                                                   #0  0x00007f7d6d29d1fb n/a (/lib/x86_64-linux-gnu/libpthread-2.19.so)                                                                                                                       
-- Subject: Process 18420 (sheep) dumped core              
-- Defined-By: systemd                                     
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel                                               
-- Documentation: man:core(5)                              
--                                                         
-- Process 18420 (sheep) crashed and dumped core.          
--                                                         
-- This usually indicates a programming error in the crashing program and                                              
-- should be reported to its vendor as a bug.

furkanmustafa · 2018-02-07T00:37:17Z

Recovery finally succeeded this morning. With no crashes yet.
In the process, because of high IO load, I have lost 2 disks. I have inserted 4 new 1 TB disks to some nodes. (disk full nodes and one new node). Moved existing data to the new disks (except the new node).
This also helped the recovery.

Now the disks are working, but throughput has lowered down drastically. It does not saturate any of CPU/Disk/Network. Sometimes looks like does not even move, when I'm trying to read the vdi. As a result VMs are extremely slow (but working).

I had no performance problems before the outage. Using -n (nosync), but no object cache or journalling.

Is there a specific reason for losing cluster performance? and is there a way to fix it?

vtolstov · 2018-02-07T09:25:27Z

have you tune number of threads and throttling of recovery process?

furkanmustafa · 2018-02-07T09:38:01Z

After realizing that recovery was hard on disks, yes. I have enabled throttling. But I was late for it.

Also another update; cluster performance is back improving now.

I did not touch any thread numbers.

furkanmustafa · 2018-02-10T09:28:06Z

3 days later, sheep processes starting crashing (segfaults) randomly here and there and cluster is down again.

furkanmustafa · 2018-02-10T09:59:00Z

I've seen a new behavior this time, first time since I've started using sheepdog;
I say, any dog command, for example;

dog node list

It just hangs there. (on all nodes of the cluster. None of them answers this question.)

furkanmustafa · 2018-02-10T10:06:37Z

Only one node showing a different, but not less strange behavior.

It says; "There are no active sheep daemons"

root@server:~# ps axf                                         
  PID TTY      STAT   TIME COMMAND                         
 2099 ?        Ss     0:00 bash                            
 2122 ?        R+     0:00  \_ ps axf                      
    1 ?        Ss     0:00 /bin/bash /root/run.sh sheep    
   22 ?        Sl    10:39 /usr/sbin/sheep -c zookeeper:10.0.42.202:2181,10.0.42.200:2181,10.0.42.201:2181 -R max=100 interval=1000 /store/server -p 7000 -P /var/run/sheep.pid -f
   28 ?        Ss     0:00  \_ /usr/sbin/sheep -c zookeeper:10.0.42.202:2181,10.0.42.200:2181,10.0.42.201:2181 -R max=100 interval=1000 /store/server -p 7000 -P /var/run/sheep.pid -f
   25 ?        S      0:00 tail -F /var/log/sheep*.log /var/log/corosync/corosync.log /store/server/sheep.log
 2121 ?        S      0:00 sleep 1                         
root@server:~# dog node list                                  
There are no active sheep daemons                          
root@server:~# tail /store/server/sheep.log               
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: f2da5 (old parent VID: f2da4, new parent VID: f2da4)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: da4e6a (old parent VID: 9f82b2, new parent VID: 9f82b2)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: c42b26 (old parent VID: c42b25, new parent VID: c42b25)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: 17875d (old parent VID: bc085e, new parent VID: bc085e)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: bc085f (old parent VID: bc085e, new parent VID: bc085e)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: 9f82b3 (old parent VID: 9f82b2, new parent VID: 9f82b2)
Feb 10 09:27:16   WARN [main] do_add_vdi_state(379) caution, updating parent VID of VID: d4693a (old parent VID: bc085e, new parent VID: bc085e)
Feb 10 09:27:16  ERROR [main] check_host_env(571) failed to set nofile to suggested 6144000, Operation not permitted
Feb 10 09:27:16  ERROR [main] check_host_env(572) please increase nofile via sysctl fs.nr_open
Feb 10 09:27:16   INFO [main] main(1108) sheepdog daemon (version 1.0.1) started
root@server:~#

furkanmustafa · 2018-02-10T14:38:08Z

Ok. This time looks like zookeeper went crazy. nodes were blocked on their init while trying to join the cluster, thus not answering cli calls.

All the troubles in the world concentrated here.

So, I have replaced zookeeper with etcd + zetcd. That (for now) gave me a robust/working/lightweight cluster and my nodes are happy for now. (recovering...)

furkanmustafa · 2018-02-10T16:10:58Z

A question at this point; Is this huge recovery process caused by automatic vnode assignment?
If the vnode-assignment was done manually from the beginning, would this recovery process be only minimal, for only lost objects, etc.?

furkanmustafa · 2018-02-11T01:38:16Z

Status is the morning; I have managed to get dog client lockups again. I am suspecting this is probably due to some cluster locks being stucked in a locked state? And started thinking that was my probably problem with original zookeeper too?

After that, in etcd logs I have spotted time difference between nodes, does this have a big impact?
(I have fixed it now using ntpdate)

furkanmustafa · 2018-02-11T01:42:32Z

Ok. on one of the nodes, I have this =>

Feb 11 01:41:49  ERROR [main] listen_handler(1036) failed to accept a new connection: Too many open files

I'll try to fix that and see what happens.

furkanmustafa · 2018-02-11T01:49:52Z

Yes. Looks like that problem in single node in cluster blocked the whole cluster. When I restarted that node, everything started to move again. Lost object counts increased a bit for some volumes.

(Graph of recovery; lost object count per volume)

furkanmustafa · 2018-02-11T14:35:13Z

Another constant problem I am experiencing is;

about to finish a huge recovery
all nodes are recovered except one
that is also about to finish %95+ somewhere
that last process goes crazy is memory and gets killed.
the WHOLE recovery is trash. starting from again...

after 8~9 hours of promising recovery, everything lost.

furkanmustafa · 2018-02-11T16:18:54Z

After this last restart, recovery is almost not proceeding at all..

getting this error in some of the nodes;

Feb 11 16:15:27  ERROR [main] check_request_epoch(157) old node version 1166, 1164 (READ_PEER)

after checking all, seems like some are lacking behind, like below;

Feb 11 16:17:21  ERROR [main] check_request_epoch(165) new node version 1124, 1164 (READ_PEER)

Restarting the ones lacking behind...

Ok. Question; If some nodes realize that they're old (have old cluster epoch, whatever), why wouldn't they kill themselves, or just upgrade their knowledge about the cluster somehow?
Or why would they get behind in the first place?

vtolstov · 2018-02-11T17:59:42Z

sorry, i don't think that sheepdog stable enough to handle unexpected failures and recovery.
Sometimes ago i'm lost all of my data after power down two nodes with copies=2.
Because recovery delete all of my data moving it into stale and removed after recovery finished.

furkanmustafa · 2018-02-11T20:10:00Z

Because recovery delete all of my data moving it into stale and removed after recovery finished.

Was it because of this? => #371

vtolstov · 2018-02-11T20:28:28Z

I think that this is not only one issue that can damage data.
In my case i can't start sheep because it says that somethink wrong with epoch.

furkanmustafa · 2018-02-12T23:21:22Z

the amount of space that .stale folder eats is ENDLESS. 4TB in one node. 2.6TB in another node. I did not even have that much data in my cluster.

furkanmustafa · 2018-02-12T23:48:16Z

I understand the file names are in .stale folder mostly have .{epoch} at their name. like; 002e36e8000003a3_1.56

For example in this file, epoch number is 56. But my last working, completely recovered cluster epoch is higher than that. Would that be OKAY to delete that file in this case?

furkanmustafa · 2018-02-13T03:08:57Z

A bit more digging into .stale, I have found this;

[root@my-server .stale]# find . -name "00905fdd00000ec1_1*" | tail | xargs -n 1 md5sum
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.56
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.180
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.528
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.681
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.790
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.816
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.950
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.1124
b27345e0fa8ac51fac3032eeddb82e5d  ./00905fdd00000ec1_1.1168

The very same object with same contents, but copies for multiple epochs in .stale folder.
Reading the source code to be sure that I can delete older copies. Please confirm if you know.

furkanmustafa · 2018-02-13T03:12:32Z

Oh please

sheepdog/dog/vdi.c

Lines 2082 to 2085 in 929d401

    
           if (memcmp(obj, info->vcw[d + k].buf, len) != 0) { 
        
           	/* TODO repair the inconsistency */ 
        
           	sd_err("object %016"PRIx64" is inconsistent", oid); 
        
           	goto out;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many problems during recovery #425

Many problems during recovery #425

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 5, 2018 •

edited

Loading

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 7, 2018 •

edited

Loading

vtolstov commented Feb 7, 2018

furkanmustafa commented Feb 7, 2018

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 10, 2018 •

edited

Loading

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 10, 2018 •

edited

Loading

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018 •

edited

Loading

vtolstov commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

vtolstov commented Feb 11, 2018

furkanmustafa commented Feb 12, 2018

furkanmustafa commented Feb 12, 2018

furkanmustafa commented Feb 13, 2018

furkanmustafa commented Feb 13, 2018

Many problems during recovery #425

Many problems during recovery #425

Comments

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 5, 2018 • edited Loading

furkanmustafa commented Feb 5, 2018

furkanmustafa commented Feb 7, 2018 • edited Loading

vtolstov commented Feb 7, 2018

furkanmustafa commented Feb 7, 2018

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 10, 2018 • edited Loading

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 10, 2018 • edited Loading

furkanmustafa commented Feb 10, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018 • edited Loading

vtolstov commented Feb 11, 2018

furkanmustafa commented Feb 11, 2018

vtolstov commented Feb 11, 2018

furkanmustafa commented Feb 12, 2018

furkanmustafa commented Feb 12, 2018

furkanmustafa commented Feb 13, 2018

furkanmustafa commented Feb 13, 2018

furkanmustafa commented Feb 5, 2018 •

edited

Loading

furkanmustafa commented Feb 7, 2018 •

edited

Loading

furkanmustafa commented Feb 10, 2018 •

edited

Loading

furkanmustafa commented Feb 10, 2018 •

edited

Loading

furkanmustafa commented Feb 11, 2018 •

edited

Loading