-
Notifications
You must be signed in to change notification settings - Fork 142
CORTX-33899 : Delayed Motr service start makes cortx-rgw container restart #2115
Conversation
failing on startup Issue: During startup, if data PODs are delayed, then server PODs are retarting with error "Couldn't init storage provider" Root Cause: During startup, radosgw calling m0_clinet_init() api to initialze motr client. This api initializes different components in order, one of which is IL_IDX_SERVICE. Initialization of this service is failng with -EIO OR -EPROTO, as data PODs are not yet up. Solution: During startup, if PODs are started in out of order, then server POD should not crash. Added a retry mechanism in initialization of IL_IDX_SERVICE. Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
motr/client_init.c
Outdated
*/ | ||
if (retry_count < MAX_CLIENT_INIT_RETRIES | ||
&& (rc == -EIO || rc == -EPROTO)) { | ||
retry_count += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add some sleep()/delay() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some kind of pod/service dependencies?
For example, the "server pods" can only be started after the "data pods" are successfully started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sleep in my draft, but madhav wants to avoid it. @madhavemuri please share your feedback
yes, server pod has a dependency on data pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During first deployment all the server pods wait until data pods run time containers are started.
I think this issue issue is seen most when cluster restart is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the issue is seen with cluster restart only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced sleep between retires in exponential backoff manner. Now m0_client_init() will retry IDX service initialization for around 4min in total, and return error post that.
macro, instead of manual comparision. Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
manner to retry for atmost 4min Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
Jenkins CI Result : Motr#1730Motr Test Summary
CppCheck SummaryCppcheck: No new warnings found 👍 |
Issue: During startup, if data PODs are delayed, then server PODs are
restarting with error "Couldn't init storage provider"
Root Cause: During startup, radosgw calling m0_client_init() api to
initialize motr client. This api initializes different components in
order, one of which is IL_IDX_SERVICE. Initialization of this service is
failing with -EIO OR -EPROTO, as data PODs are not yet up.
Solution: During startup, if PODs are started in out of order, then
server POD should not crash. Added a retry mechanism in initialization
of IL_IDX_SERVICE.
Signed-off-by: Naga Kishore Kommuri nagakishore.kommuri@seagate.com
Problem Statement
Design
Coding
Checklist for Author
Testing
Checklist for Author
Impact Analysis
Checklist for Author/Reviewer/GateKeeper
Review Checklist
Checklist for Author
Documentation
Checklist for Author