-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: DC lost during wait #1483
Fix: DC lost during wait #1483
Conversation
crmsh/constants.py
Outdated
@@ -454,4 +454,9 @@ | |||
|
|||
# Commands that are deprecated and hidden from UI | |||
HIDDEN_COMMANDS = {'ms'} | |||
|
|||
# pacemaker crm/common/results.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just hit another one:
Stderr: �[33mWARNING�[0m: Unknown return code from crmadmin: 113
�[33mWARNING�[0m: DC lost during wait
which is
CRM_EX_UNSATISFIED = 113, //!< Requested item does not satisfy constraints
and it seems like this case mapped on ./lib/common/results.c
in pacemaker:
case EAGAIN:
case EBUSY:
return CRM_EX_UNSATISFIED;
So, worth to ignore that RC and give another try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just reusing the original version of get_dc
, ignoring the exit status, is good enough.
crmsh/utils.py
Outdated
if not ServiceManager().service_is_active("pacemaker.service", remote_addr=node): | ||
raise ValueError("Pacemaker is not running. No DC.") | ||
dc_deadtime = get_property("dc-deadtime", peer=node) or str(constants.DC_DEADTIME_DEFAULT) | ||
dc_timeout = int(dc_deadtime.strip('s')) + 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dc-deadtime
value takes time
and it could be s|m|h
and maybe even d
days.
2af3a86
to
cd6f99b
Compare
The initial error that we faced during our online storage upgrade: ``` Command: x Exascaler Install: apply_lustre_params,create_udev_rules,email,emf_agent,emf_node_manager,ha,hosts,ipmi,kdump,logging,lustre,lvm,mdt_backup,modprobe,nics,ntp,os,ost_pools,restart_network,serial,start_cluster,sync_exa_toml (Config ver. 1) failed User: api Job: x es-install --steps start_cluster on node5 failed Step: x Run config-pacemaker on node5 failed (took: 12s 534ms 171us 586ns) Result (Error): Bad Exit Code: 1. Started: 2024-02-07T03:26:16.158Z Ended: 2024-02-07T03:26:28.692Z Stdout: Running Command: config-pacemaker --unmanaged-emf Stderr: x Command has failed. Code: exit status: 1 Stdout: INFO: cib.commit: committed '5e8558de-1ceb-46c2-bd70-1ab4d8504c9f' shadow CIB to the cluster Stderr: WARNING: DC lost during wait ``` Basically, the source of our problems below (case 3 - DC election or voiting during cluster recalculation): ``` [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? Designated Controller is: es-2-virt1 0 [root@es-1-virt1 ~]# crm cluster stop INFO: The cluster stack stopped on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: Could not connect to controller: Connection refused error: Command failed: Connection refused 102 [root@es-1-virt1 ~]# crm cluster start INFO: The cluster stack started on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: No reply received from controller before timeout (1000ms) error: Command failed: Connection timed out 124 ``` Potentially, we have a deadloop in dc_waiter, but it also means that pacemaker in the same state and in worst case the amount of time should not be more than 'dc-deadtime'.
cd6f99b
to
1d5ad92
Compare
1d5ad92
to
2465ff3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if nothing happen in dc_timeout = crm_msec(dc_deadtime) // 1000 + 5
then this should work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This is a refined version of #1475.
crmsh.utils.wait4dc
traces DC transition until it gets stable. However, at certain steps during the transition, there will be no DCs in the cluster. wait4dc does not handle this situation and fails.