Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

CORTX-33899 : Delayed Motr service start makes cortx-rgw container restart #2115

Merged
merged 4 commits into from
Sep 8, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 38 additions & 3 deletions motr/client_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,8 @@ struct m0_sm_state_descr initlift_phases[] = {
[IL_IDX_SERVICE] = {
.sd_name = "init/fini-resource-manager",
.sd_allowed = M0_BITS(IL_ROOT_FID,
IL_LAYOUT_DB),
IL_LAYOUT_DB,
IL_IDX_SERVICE),
.sd_in = initlift_idx_service,
},
[IL_ROOT_FID] = {
Expand Down Expand Up @@ -247,6 +248,9 @@ struct m0_sm_trans_descr initlift_trans[] = {
{"initialising-index-service",
IL_LAYOUT_DB,
IL_IDX_SERVICE},
{"retry-initialising-index-service",
IL_IDX_SERVICE,
IL_IDX_SERVICE},
{"retrieving-root-fid", IL_IDX_SERVICE,
IL_ROOT_FID},
{"initialising-addb2", IL_ROOT_FID,
Expand Down Expand Up @@ -356,6 +360,18 @@ static int initlift_get_next_floor(struct m0_client *m0c)
return M0_RC(rc);
}

/**
* Helper function to get the value of the current floor.
*
* @param m0c the client instance we are working with.
* @return the current state/floor.
*/
static int initlift_get_cur_floor(struct m0_client *m0c)
{
M0_PRE(m0c != NULL);
return M0_RC(m0c->m0c_initlift_sm.sm_state);
}

/**
* Helper function to move the initlift onto its next state in the
* direction of travel.
Expand Down Expand Up @@ -1222,12 +1238,14 @@ static int initlift_layouts(struct m0_sm *mach)
return M0_RC(initlift_get_next_floor(m0c));
}

#define MAX_CLIENT_INIT_RETRIES 1000
static int initlift_idx_service(struct m0_sm *mach)
{
int rc = 0;
struct m0_client *m0c;
struct m0_idx_service *service;
struct m0_idx_service_ctx *ctx;
static int retry_count = 0;

M0_ENTRY();
M0_PRE(mach != NULL);
Expand All @@ -1250,8 +1268,25 @@ static int initlift_idx_service(struct m0_sm *mach)
rc = service->is_svc_ops->iso_init((void *)ctx);
m0_sm_group_lock(&m0c->m0c_sm_group);

if (rc != 0)
initlift_fail(rc, m0c);
if (rc != 0) {
/*
* Added retry logic to handle out of
* order startup of data and server
* PODs. Ref: Jira ID Cortx-33899
*/
if (retry_count < MAX_CLIENT_INIT_RETRIES
&& M0_IN(rc, (-EIO, -EPROTO))) {
retry_count += 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some sleep()/delay() here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some kind of pod/service dependencies?
For example, the "server pods" can only be started after the "data pods" are successfully started.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sleep in my draft, but madhav wants to avoid it. @madhavemuri please share your feedback
yes, server pod has a dependency on data pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During first deployment all the server pods wait until data pods run time containers are started.
I think this issue issue is seen most when cluster restart is done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is seen with cluster restart only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced sleep between retires in exponential backoff manner. Now m0_client_init() will retry IDX service initialization for around 4min in total, and return error post that.

M0_LOG(M0_ERROR, "client init \
failed with %d. Retrying.", rc);
return M0_RC(initlift_get_cur_floor(m0c));
} else {
retry_count = 0;
initlift_fail(rc, m0c);
}
} else {
retry_count = 0;
}
madhavemuri marked this conversation as resolved.
Show resolved Hide resolved
} else {
service = ctx->isc_service;
M0_ASSERT(service != NULL &&
Expand Down