Fix: DC lost during wait #1483

nicholasyang2022 · 2024-07-08T10:19:46Z

This is a refined version of #1475.

crmsh.utils.wait4dc traces DC transition until it gets stable. However, at certain steps during the transition, there will be no DCs in the cluster. wait4dc does not handle this situation and fails.

freishutz · 2024-07-08T10:34:02Z

crmsh/constants.py

@@ -454,4 +454,9 @@

 # Commands that are deprecated and hidden from UI
 HIDDEN_COMMANDS = {'ms'}
+
+# pacemaker crm/common/results.h


Just hit another one:

Stderr: �[33mWARNING�[0m: Unknown return code from crmadmin: 113 �[33mWARNING�[0m: DC lost during wait

which is
CRM_EX_UNSATISFIED = 113, //!< Requested item does not satisfy constraints
and it seems like this case mapped on ./lib/common/results.c in pacemaker:

case EAGAIN: case EBUSY: return CRM_EX_UNSATISFIED;

So, worth to ignore that RC and give another try.

So, just reusing the original version of get_dc, ignoring the exit status, is good enough.

freishutz · 2024-07-08T12:03:20Z

crmsh/utils.py

+    if not ServiceManager().service_is_active("pacemaker.service", remote_addr=node):
+        raise ValueError("Pacemaker is not running. No DC.")
+    dc_deadtime = get_property("dc-deadtime", peer=node) or str(constants.DC_DEADTIME_DEFAULT)
+    dc_timeout = int(dc_deadtime.strip('s')) + 5


dc-deadtime value takes time and it could be s|m|h and maybe even d days.

The initial error that we faced during our online storage upgrade: ``` Command: x Exascaler Install: apply_lustre_params,create_udev_rules,email,emf_agent,emf_node_manager,ha,hosts,ipmi,kdump,logging,lustre,lvm,mdt_backup,modprobe,nics,ntp,os,ost_pools,restart_network,serial,start_cluster,sync_exa_toml (Config ver. 1) failed User: api Job: x es-install --steps start_cluster on node5 failed Step: x Run config-pacemaker on node5 failed (took: 12s 534ms 171us 586ns) Result (Error): Bad Exit Code: 1. Started: 2024-02-07T03:26:16.158Z Ended: 2024-02-07T03:26:28.692Z Stdout: Running Command: config-pacemaker --unmanaged-emf Stderr: x Command has failed. Code: exit status: 1 Stdout: INFO: cib.commit: committed '5e8558de-1ceb-46c2-bd70-1ab4d8504c9f' shadow CIB to the cluster Stderr: WARNING: DC lost during wait ``` Basically, the source of our problems below (case 3 - DC election or voiting during cluster recalculation): ``` [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? Designated Controller is: es-2-virt1 0 [root@es-1-virt1 ~]# crm cluster stop INFO: The cluster stack stopped on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: Could not connect to controller: Connection refused error: Command failed: Connection refused 102 [root@es-1-virt1 ~]# crm cluster start INFO: The cluster stack started on es-1-virt1 [root@es-1-virt1 ~]# crmadmin -D -t 1; echo $? error: No reply received from controller before timeout (1000ms) error: Command failed: Connection timed out 124 ``` Potentially, we have a deadloop in dc_waiter, but it also means that pacemaker in the same state and in worst case the amount of time should not be more than 'dc-deadtime'.

codecov · 2024-07-12T07:44:15Z

Codecov Report

Attention: Patch coverage is 80.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 69.57%. Comparing base (72c64e6) to head (2465ff3).

Additional details and impacted files

Flag	Coverage Δ
integration	`54.18% <80.00%> (+<0.01%)`	⬆️
unit	`52.49% <20.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
crmsh/prun/runner.py	`67.59% <100.00%> (ø)`
crmsh/ui_cluster.py	`77.96% <100.00%> (+0.32%)`	⬆️
crmsh/ui_context.py	`58.46% <100.00%> (ø)`
crmsh/ui_configure.py	`44.15% <0.00%> (ø)`
crmsh/ui_history.py	`29.43% <0.00%> (ø)`
crmsh/ui_resource.py	`70.43% <0.00%> (ø)`
crmsh/utils.py	`68.67% <87.50%> (+0.04%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

freishutz

I think that if nothing happen in dc_timeout = crm_msec(dc_deadtime) // 1000 + 5 then this should work as expected.

liangxin1300

Thanks!

freishutz reviewed Jul 8, 2024

View reviewed changes

nicholasyang2022 force-pushed the refine-pr-1475 branch 3 times, most recently from 2af3a86 to cd6f99b Compare July 11, 2024 09:55

freishutz and others added 3 commits July 12, 2024 14:54

Dev: utils: rename wait4dc to wait_dc_stable

beb1009

Dev: utils: revert previous changes to get_dc()

3bafb61

nicholasyang2022 force-pushed the refine-pr-1475 branch from cd6f99b to 1d5ad92 Compare July 12, 2024 06:55

nicholasyang2022 added 3 commits July 12, 2024 15:19

Dev: ui_cluster: refactor Cluster._wait_for_dc()

9a1f321

unused code removal

c297169

Dev: prun: replace deprecated stdlib API asyncio.get_event_loop()

2465ff3

nicholasyang2022 force-pushed the refine-pr-1475 branch from 1d5ad92 to 2465ff3 Compare July 12, 2024 07:20

nicholasyang2022 marked this pull request as ready for review July 12, 2024 07:51

nicholasyang2022 requested review from liangxin1300 and freishutz July 12, 2024 07:52

freishutz approved these changes Jul 16, 2024

View reviewed changes

liangxin1300 approved these changes Jul 23, 2024

View reviewed changes

liangxin1300 merged commit 6803994 into master Jul 23, 2024
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: DC lost during wait #1483

Fix: DC lost during wait #1483

nicholasyang2022 commented Jul 8, 2024

freishutz Jul 8, 2024

nicholasyang2022 Jul 9, 2024

freishutz Jul 8, 2024

codecov bot commented Jul 12, 2024 •

edited

Loading

freishutz left a comment

liangxin1300 left a comment

Fix: DC lost during wait #1483

Fix: DC lost during wait #1483

Conversation

nicholasyang2022 commented Jul 8, 2024

freishutz Jul 8, 2024

Choose a reason for hiding this comment

nicholasyang2022 Jul 9, 2024

Choose a reason for hiding this comment

freishutz Jul 8, 2024

Choose a reason for hiding this comment

codecov bot commented Jul 12, 2024 • edited Loading

Codecov Report

freishutz left a comment

Choose a reason for hiding this comment

liangxin1300 left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 12, 2024 •

edited

Loading