Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: DC lost during wait #1483

Merged
merged 6 commits into from
Jul 23, 2024
Merged

Fix: DC lost during wait #1483

merged 6 commits into from
Jul 23, 2024

Conversation

nicholasyang2022
Copy link
Collaborator

This is a refined version of #1475.

crmsh.utils.wait4dc traces DC transition until it gets stable. However, at certain steps during the transition, there will be no DCs in the cluster. wait4dc does not handle this situation and fails.

@@ -454,4 +454,9 @@

# Commands that are deprecated and hidden from UI
HIDDEN_COMMANDS = {'ms'}

# pacemaker crm/common/results.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hit another one:

      Stderr: �[33mWARNING�[0m: Unknown return code from crmadmin: 113
      �[33mWARNING�[0m: DC lost during wait

which is
CRM_EX_UNSATISFIED = 113, //!< Requested item does not satisfy constraints
and it seems like this case mapped on ./lib/common/results.c in pacemaker:

        case EAGAIN:
        case EBUSY:
            return CRM_EX_UNSATISFIED;

So, worth to ignore that RC and give another try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just reusing the original version of get_dc, ignoring the exit status, is good enough.

crmsh/utils.py Outdated
if not ServiceManager().service_is_active("pacemaker.service", remote_addr=node):
raise ValueError("Pacemaker is not running. No DC.")
dc_deadtime = get_property("dc-deadtime", peer=node) or str(constants.DC_DEADTIME_DEFAULT)
dc_timeout = int(dc_deadtime.strip('s')) + 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dc-deadtime value takes time and it could be s|m|h and maybe even d days.

@nicholasyang2022 nicholasyang2022 force-pushed the refine-pr-1475 branch 3 times, most recently from 2af3a86 to cd6f99b Compare July 11, 2024 09:55
freishutz and others added 3 commits July 12, 2024 14:54
The initial error that we faced during our online storage upgrade:
```
Command: x Exascaler Install: apply_lustre_params,create_udev_rules,email,emf_agent,emf_node_manager,ha,hosts,ipmi,kdump,logging,lustre,lvm,mdt_backup,modprobe,nics,ntp,os,ost_pools,restart_network,serial,start_cluster,sync_exa_toml (Config ver. 1) failed
User: api

  Job: x es-install --steps start_cluster on node5 failed

    Step: x Run config-pacemaker on node5 failed (took: 12s 534ms 171us 586ns)
    Result (Error):
      Bad Exit Code: 1.
    Started: 2024-02-07T03:26:16.158Z
    Ended: 2024-02-07T03:26:28.692Z
    Stdout:
      Running Command: config-pacemaker --unmanaged-emf
    Stderr:
      x Command has failed.
      Code: exit status: 1
      Stdout: INFO: cib.commit: committed '5e8558de-1ceb-46c2-bd70-1ab4d8504c9f' shadow CIB to the cluster

      Stderr: WARNING: DC lost during wait
```

Basically, the source of our problems below (case 3 - DC election or
voiting during cluster recalculation):

```
[root@es-1-virt1 ~]# crmadmin -D -t 1; echo $?
Designated Controller is: es-2-virt1
0

[root@es-1-virt1 ~]# crm cluster stop
INFO: The cluster stack stopped on es-1-virt1
[root@es-1-virt1 ~]# crmadmin -D -t 1; echo $?
error: Could not connect to controller: Connection refused
error: Command failed: Connection refused
102

[root@es-1-virt1 ~]# crm cluster start
INFO: The cluster stack started on es-1-virt1
[root@es-1-virt1 ~]# crmadmin -D -t 1; echo $?
error: No reply received from controller before timeout (1000ms)
error: Command failed: Connection timed out
124
```
Potentially, we have a deadloop in dc_waiter, but it also means that
pacemaker in the same state and in worst case the amount of time should
not be more than 'dc-deadtime'.
Copy link

codecov bot commented Jul 12, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 69.57%. Comparing base (72c64e6) to head (2465ff3).

Additional details and impacted files
Flag Coverage Δ
integration 54.18% <80.00%> (+<0.01%) ⬆️
unit 52.49% <20.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
crmsh/prun/runner.py 67.59% <100.00%> (ø)
crmsh/ui_cluster.py 77.96% <100.00%> (+0.32%) ⬆️
crmsh/ui_context.py 58.46% <100.00%> (ø)
crmsh/ui_configure.py 44.15% <0.00%> (ø)
crmsh/ui_history.py 29.43% <0.00%> (ø)
crmsh/ui_resource.py 70.43% <0.00%> (ø)
crmsh/utils.py 68.67% <87.50%> (+0.04%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nicholasyang2022 nicholasyang2022 marked this pull request as ready for review July 12, 2024 07:51
Copy link
Contributor

@freishutz freishutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if nothing happen in dc_timeout = crm_msec(dc_deadtime) // 1000 + 5 then this should work as expected.

Copy link
Collaborator

@liangxin1300 liangxin1300 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@liangxin1300 liangxin1300 merged commit 6803994 into master Jul 23, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants