troubleshooting

[[TOC]]

EES HA Troubleshooting

This article provides an incomplete list of abnormal situations that may occur on a EES cluster and recommendations on how to deal with them. The intended audience are EES support engineers.

Early version of this article was included in the EES Troubleshooting Guide.

pcs status

High availability (HA) of EES cluster is based on Pacemaker resource manager.

The pcs status command shows current information about the cluster and resources. You'll use this command often. Familiarize with its output, and be on the lookout for anomalies.

Sample output of pcs status

$ sudo pcs status

Cluster name: cortx_cluster
Stack: corosync
Current DC: srvnode-2 (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum
Last updated: Thu May 28 16:57:46 2020
Last change: Thu May 28 14:12:55 2020 by hacluster via crmd on srvnode-1

2 nodes configured
65 resources configured

Online: [ srvnode-1 srvnode-2 ]

Full list of resources:

 Clone Set: ClusterIP-clone [ClusterIP] (unique)
     ClusterIP:0	(ocf::heartbeat:IPaddr2):	Started srvnode-1
     ClusterIP:1	(ocf::heartbeat:IPaddr2):	Started srvnode-2
 stonith-c1	(stonith:fence_ipmilan):	Started srvnode-2
 stonith-c2	(stonith:fence_ipmilan):	Started srvnode-1
 Clone Set: lnet-clone [lnet]
     Started: [ srvnode-1 srvnode-2 ]
 Resource Group: c1
     ip-c1	(ocf::heartbeat:IPaddr2):	Started srvnode-1
     consul-c1	(systemd:hare-consul-agent-c1):	Started srvnode-1
     lnet-c1	(ocf::eos:lnet):	Started srvnode-1
     hax-c1	(systemd:hare-hax-c1):	Started srvnode-1
     mero-confd-c1	(systemd:m0d@0x7200000000000001:0x9):	Started srvnode-1
     mero-ios-c1	(systemd:m0d@0x7200000000000001:0xc):	Started srvnode-1
 Resource Group: c2
     ip-c2	(ocf::heartbeat:IPaddr2):	Started srvnode-2
     consul-c2	(systemd:hare-consul-agent-c2):	Started srvnode-2
     lnet-c2	(ocf::eos:lnet):	Started srvnode-2
     hax-c2	(systemd:hare-hax-c2):	Started srvnode-2
     mero-confd-c2	(systemd:m0d@0x7200000000000001:0x52):	Started srvnode-2
     mero-ios-c2	(systemd:m0d@0x7200000000000001:0x55):	Started srvnode-2
 Clone Set: mero-kernel-clone [mero-kernel]
     Started: [ srvnode-1 srvnode-2 ]
 Clone Set: ldap-clone [ldap]
     Started: [ srvnode-1 srvnode-2 ]
 Clone Set: s3auth-clone [s3auth]
     Started: [ srvnode-1 srvnode-2 ]
 Clone Set: els-search-clone [els-search]
     Started: [ srvnode-1 srvnode-2 ]
 Clone Set: statsd-clone [statsd]
     Started: [ srvnode-1 srvnode-2 ]
 haproxy-c1	(systemd:haproxy):	Started srvnode-1
 haproxy-c2	(systemd:haproxy):	Started srvnode-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ srvnode-1 srvnode-2 ]
 s3backcons-c1	(systemd:s3backgroundconsumer):	Started srvnode-1
 s3backcons-c2	(systemd:s3backgroundconsumer):	Started srvnode-2
 s3backprod	(systemd:s3backgroundproducer):	Started srvnode-1
 s3server-c1-1	(systemd:s3server@0x7200000000000001:0x22):	Started srvnode-1
 s3server-c1-2	(systemd:s3server@0x7200000000000001:0x25):	Started srvnode-1
 s3server-c1-3	(systemd:s3server@0x7200000000000001:0x28):	Started srvnode-1
 s3server-c1-4	(systemd:s3server@0x7200000000000001:0x2b):	Started srvnode-1
 s3server-c1-5	(systemd:s3server@0x7200000000000001:0x2e):	Started srvnode-1
 s3server-c1-6	(systemd:s3server@0x7200000000000001:0x31):	Started srvnode-1
 s3server-c1-7	(systemd:s3server@0x7200000000000001:0x34):	Started srvnode-1
 s3server-c1-8	(systemd:s3server@0x7200000000000001:0x37):	Started srvnode-1
 s3server-c1-9	(systemd:s3server@0x7200000000000001:0x3a):	Started srvnode-1
 s3server-c1-10	(systemd:s3server@0x7200000000000001:0x3d):	Started srvnode-1
 s3server-c1-11	(systemd:s3server@0x7200000000000001:0x40):	Started srvnode-1
 s3server-c2-1	(systemd:s3server@0x7200000000000001:0x6b):	Started srvnode-2
 s3server-c2-2	(systemd:s3server@0x7200000000000001:0x6e):	Started srvnode-2
 s3server-c2-3	(systemd:s3server@0x7200000000000001:0x71):	Started srvnode-2
 s3server-c2-4	(systemd:s3server@0x7200000000000001:0x74):	Started srvnode-2
 s3server-c2-5	(systemd:s3server@0x7200000000000001:0x77):	Started srvnode-2
 s3server-c2-6	(systemd:s3server@0x7200000000000001:0x7a):	Started srvnode-2
 s3server-c2-7	(systemd:s3server@0x7200000000000001:0x7d):	Started srvnode-2
 s3server-c2-8	(systemd:s3server@0x7200000000000001:0x80):	Started srvnode-2
 s3server-c2-9	(systemd:s3server@0x7200000000000001:0x83):	Started srvnode-2
 s3server-c2-10	(systemd:s3server@0x7200000000000001:0x86):	Started srvnode-2
 s3server-c2-11	(systemd:s3server@0x7200000000000001:0x89):	Started srvnode-2
 mero-free-space-mon	(systemd:mero-free-space-monitor):	Started srvnode-2
 Master/Slave Set: sspl-master [sspl]
     Masters: [ srvnode-2 ]
     Slaves: [ srvnode-1 ]
 Resource Group: csm-kibana
     kibana-vip	(ocf::heartbeat:IPaddr2):	Started srvnode-2
     kibana	(systemd:kibana):	Started srvnode-2
     csm-web	(systemd:csm_web):	Started srvnode-2
     csm-agent	(systemd:csm_agent):	Started srvnode-2
 uds	(systemd:uds):	Started srvnode-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

All systems green. Zarro boogs seen.

Note that some (most?) of pcs commands require superuser privileges.

$ pcs status
Error: cluster is not currently running on this node
$
$ sudo pcs status
Cluster name: cortx_cluster
Stack: corosync
Current DC: srvnode-2 (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum
[...]

$ pcs status groups
$
$ sudo pcs status groups
c1: ip-c1 consul-c1 lnet-c1 hax-c1 mero-confd-c1 mero-ios-c1
c2: ip-c2 consul-c2 lnet-c2 hax-c2 mero-confd-c2 mero-ios-c2
csm-kibana: kibana-vip kibana csm-web csm-agent

HA Anomalies

Disabled Resources

Check:

sudo pcs status | grep disabled

Solution:

sudo pcs resource enable <resource>...

Failed Resource Actions

Check:

$ sudo pcs status | sed -n '/Failed Resource Actions/,$ p'

Failed Resource Actions:
* s3server-c2-4_monitor_60000 on srvnode-2 'not running' (7): call=585, status=complete, exitreason='',
   last-rc-change='Tue May 26 10:38:45 2020', queued=0ms, exec=0ms
* s3server-c2-7_monitor_60000 on srvnode-2 'not running' (7): call=588, status=complete, exitreason='',
   last-rc-change='Tue May 26 10:38:46 2020', queued=0ms, exec=0ms
* s3server-c2-1_monitor_60000 on srvnode-2 'not running' (7): call=582, status=complete, exitreason='',
   last-rc-change='Tue May 26 10:38:45 2020', queued=0ms, exec=0ms
* s3server-c2-11_monitor_60000 on srvnode-2 'not running' (7): call=592, status=complete, exitreason='',
   last-rc-change='Tue May 26 10:38:46 2020', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Failed _monitor_ actions are harmless and can be ignored.

⚠️ Watch out for failed _start_ actions. If the resource isn't starting, it's usually due to either a misconfiguration of the resource (can be debugged via the system log), or constraints preventing the resource from starting, or the resource being disabled.

Solution:

$ sudo pcs resource cleanup

Cleaned up all resources on all nodes
Waiting for 4 replies from the CRMd.... OK

⚠️ This may cause restarting of resources with failed actions. Any dependent services may be restarted as well. Give the cluster a few minutes to settle.

If _start_ actions are still failing after the cleanup, collect the logs and open a bug report.

Node in Standby Mode

A node in standby mode cannot host resources.

Check:

$ sudo pcs status | grep standby
Node srvnode-1: standby

Possible cause: cluster administrator put a node into standby mode and forgot to unstandby it.

Solution:

sudo pcs node unstandby <node>

⚠️ This will trigger the failback. Give it 3–4 minutes to complete.

Failover

Clue: resource groups c1 and c2 are started on the same node.

In this example c1 and c2 both run on srvnode-2:

Resource Group: c1
     ip-c1      (ocf::heartbeat:IPaddr2):       Started srvnode-2
     consul-c1  (systemd:hare-consul-agent-c1): Started srvnode-2
     lnet-c1    (ocf::eos:lnet):        Started srvnode-2
     hax-c1     (systemd:hare-hax-c1):  Started srvnode-2
     mero-confd-c1      (systemd:m0d@0x7200000000000001:0x9):   Started srvnode-2
     mero-ios-c1        (systemd:m0d@0x7200000000000001:0xc):   Started srvnode-2
 Resource Group: c2
     ip-c2      (ocf::heartbeat:IPaddr2):       Started srvnode-2
     consul-c2  (systemd:hare-consul-agent-c2): Started srvnode-2
     lnet-c2    (ocf::eos:lnet):        Started srvnode-2
     hax-c2     (systemd:hare-hax-c2):  Started srvnode-2
     mero-confd-c2      (systemd:m0d@0x7200000000000001:0x52):  Started srvnode-2
     mero-ios-c2        (systemd:m0d@0x7200000000000001:0x55):  Started srvnode-2

Solution: XXX TBD

Reporting Bugs

First of all, collect the forensic data:

$ hctl reportbug
Created /tmp/hare/hare_smc19-m10.tar.gz

Attach this archive file to your bug report.

Information to include:

environment
steps to reproduce
expected result
actual result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

troubleshooting

EES HA Troubleshooting

pcs status

HA Anomalies

Disabled Resources

Failed Resource Actions

Node in Standby Mode

Failover

Reporting Bugs

Clone this wiki locally