DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

liuxuezhao · 2024-10-18T10:46:13Z

In special massive failure case -

some engines down and triggered rebuild.
one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map.
That engine restarted by administrator.

In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole system including the PS leader.

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-10-18T10:46:30Z

Ticket title is 'Rebuilding cannot be completed after restarting ranks in cases of massive failures.'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-16702

In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Skip-nlt: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

daosbuild1 · 2024-10-19T00:19:19Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15343/2/execution/node/1397/log

src/pool/srv_pool.c

Skip-nlt: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

wangshilong · 2024-10-21T07:44:26Z

src/pool/srv_pool.c

@@ -1489,6 +1526,9 @@ handle_event(struct pool_svc *svc, struct pool_svc_event_set *event_set)

 		if (event->psv_src != CRT_EVS_SWIM || event->psv_type != CRT_EVT_ALIVE)
 			continue;
+
+		pool_restart_rebuild_if_rank_wip(svc->ps_pool, event->psv_rank);


[Question] As @liw mentioned last week, if this ALIVE event means rank state from SUSPECT to ALIVE , and there is a rebuild running, and its rebuild status is not finished, we will abort rebuild unexpected ?

I don't know what method can differenciate/avoid that case, even one engine crashed and reboot again in short time, it possibly reflect as SUSPECT+ALIVE.
Seems in SWIM protocol, it is not required to report ALIVE event when remove from SUSPECT status.

wangshilong

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

liuxuezhao · 2024-10-21T08:03:14Z

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

at my side I tested DER_STALE will retry and finish. I'll check details with you offline

wangshilong · 2024-10-21T13:51:30Z

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

at my side I tested DER_STALE will retry and finish. I'll check details with you offline

See logs here:

https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15354/1/artifact/Functional%20on%20EL%208.8/control/dmg_pool_query_ranks.py/job.log/*view*/

dmg pool query timeout after restarting rank, but rebuild did not finish too.

refine handle_event. Skip-nlt: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

liuxuezhao · 2024-10-21T17:07:57Z

https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15354/1/artifact/Functional%20on%20EL%208.8/control/dmg_pool_query_ranks.py/job.log/*view*/

dmg pool query timeout after restarting rank, but rebuild did not finish too.

I ran several test on wolf wit similar steps and be able to reproduce a rebuild timeout issue.
The problem is the ALIVE event got from CRT_EVS_GRPMOD rather than CRT_EVS_SWIM, so cannot ignore CRT_EVS_GRPMOD event.
I changed handle_event a little bit @liw please check if it is good for you.
@wangshilong please retest with the new version. thx

liuxuezhao marked this pull request as ready for review October 18, 2024 10:55

liuxuezhao requested review from a team as code owners October 18, 2024 10:55

liuxuezhao removed request for a team October 18, 2024 10:55

liuxuezhao force-pushed the lxz/massive_rb branch from 5e68d53 to 967b84e Compare October 18, 2024 13:30

liuxuezhao requested review from wangshilong and wangdi1 October 18, 2024 13:31

liuxuezhao requested a review from liw October 21, 2024 01:03

liw requested changes Oct 21, 2024

View reviewed changes

src/pool/srv_pool.c Outdated Show resolved Hide resolved

src/pool/srv_pool.c Outdated Show resolved Hide resolved

liuxuezhao requested a review from liw October 21, 2024 01:57

liuxuezhao force-pushed the lxz/massive_rb branch from f76cdb6 to 01c8b62 Compare October 21, 2024 01:58

liw reviewed Oct 21, 2024

View reviewed changes

src/pool/srv_pool.c Outdated Show resolved Hide resolved

src/pool/srv_pool.c Outdated Show resolved Hide resolved

DAOS-16702 rebuild: address comment

a952a14

Skip-nlt: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

liuxuezhao force-pushed the lxz/massive_rb branch from 01c8b62 to a952a14 Compare October 21, 2024 02:05

liuxuezhao requested a review from liw October 21, 2024 02:07

liw previously approved these changes Oct 21, 2024

View reviewed changes

wangshilong reviewed Oct 21, 2024

View reviewed changes

DAOS-16702 rebuild: CRT_EVT_ALIVE possibly from CRT_EVS_GRPMOD

0a34490

refine handle_event. Skip-nlt: true Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

liuxuezhao dismissed liw’s stale review via 0a34490 October 21, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

liuxuezhao commented Oct 18, 2024

github-actions bot commented Oct 18, 2024 •

edited

Loading

daosbuild1 commented Oct 19, 2024

wangshilong Oct 21, 2024

liuxuezhao Oct 21, 2024

wangshilong left a comment

liuxuezhao commented Oct 21, 2024

wangshilong commented Oct 21, 2024

liuxuezhao commented Oct 21, 2024

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

Are you sure you want to change the base?

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

Conversation

liuxuezhao commented Oct 18, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Oct 18, 2024 • edited Loading

daosbuild1 commented Oct 19, 2024

wangshilong Oct 21, 2024

Choose a reason for hiding this comment

liuxuezhao Oct 21, 2024

Choose a reason for hiding this comment

wangshilong left a comment

Choose a reason for hiding this comment

liuxuezhao commented Oct 21, 2024

wangshilong commented Oct 21, 2024

liuxuezhao commented Oct 21, 2024

github-actions bot commented Oct 18, 2024 •

edited

Loading