Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

CORTX-33899 : Delayed Motr service start makes cortx-rgw container restart #2115

Merged
merged 4 commits into from
Sep 8, 2022

Conversation

nkommuri
Copy link

@nkommuri nkommuri commented Sep 1, 2022

Issue: During startup, if data PODs are delayed, then server PODs are
restarting with error "Couldn't init storage provider"

Root Cause: During startup, radosgw calling m0_client_init() api to
initialize motr client. This api initializes different components in
order, one of which is IL_IDX_SERVICE. Initialization of this service is
failing with -EIO OR -EPROTO, as data PODs are not yet up.

Solution: During startup, if PODs are started in out of order, then
server POD should not crash. Added a retry mechanism in initialization
of IL_IDX_SERVICE.

Signed-off-by: Naga Kishore Kommuri nagakishore.kommuri@seagate.com

Problem Statement

  • Problem statement

Design

  • For Bug, Describe the fix here.
  • For Feature, Post the link for design

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

failing on startup

Issue: During startup, if data PODs are delayed, then server PODs are
retarting with error "Couldn't init storage provider"

Root Cause: During startup, radosgw calling m0_clinet_init() api to
initialze motr client. This api initializes different components in
order, one of which is IL_IDX_SERVICE. Initialization of this service is
failng with -EIO OR -EPROTO, as data PODs are not yet up.

Solution: During startup, if PODs are started in out of order, then
server POD should not crash. Added a retry mechanism in initialization
of IL_IDX_SERVICE.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
*/
if (retry_count < MAX_CLIENT_INIT_RETRIES
&& (rc == -EIO || rc == -EPROTO)) {
retry_count += 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some sleep()/delay() here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some kind of pod/service dependencies?
For example, the "server pods" can only be started after the "data pods" are successfully started.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sleep in my draft, but madhav wants to avoid it. @madhavemuri please share your feedback
yes, server pod has a dependency on data pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During first deployment all the server pods wait until data pods run time containers are started.
I think this issue issue is seen most when cluster restart is done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is seen with cluster restart only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced sleep between retires in exponential backoff manner. Now m0_client_init() will retry IDX service initialization for around 4min in total, and return error post that.

@rkothiya
Copy link
Contributor

rkothiya commented Sep 1, 2022

Jenkins CI Result : Motr#1678

Motr Test Summary

Test ResultCountInfo
❌Failed1
📁

01motr-single-node/00userspace-tests

🏁Skipped32
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/04initscripts
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed44
📁

01motr-single-node/43m0crate
01motr-single-node/05confgen
01motr-single-node/06hagen
01motr-single-node/52motr-singlenode-sanity
01motr-single-node/01net
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/02rpcping
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/72spiel-sns-motr-repair-quiesce
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/71spiel-sns-motr-repair
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/73motr-io-small-disks
04motr-single-node/48motr-raid0-io
04motr-single-node/74motr-di-corruption-detection
04motr-single-node/49motr-rpc-cancel
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total77🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

motr/client_init.c Outdated Show resolved Hide resolved
motr/client_init.c Outdated Show resolved Hide resolved
motr/client_init.c Show resolved Hide resolved
macro, instead of manual comparision.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
Naga Kishore Kommuri added 2 commits September 7, 2022 04:02
manner to retry for atmost 4min

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
@rkothiya
Copy link
Contributor

rkothiya commented Sep 7, 2022

Jenkins CI Result : Motr#1730

Motr Test Summary

Test ResultCountInfo
❌Failed0
📁

🏁Skipped31
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed171
📁

01motr-single-node/00userspace-tests_rpc-packet-encdec-ut
01motr-single-node/00userspace-tests_fdmi-sd-ut
01motr-single-node/00userspace-tests_net-prov-ut
01motr-single-node/00userspace-tests_rpc-link-ut
01motr-single-node/00userspace-tests_fom-stats-ut
01motr-single-node/00userspace-tests_fdmi-pd-ut
01motr-single-node/00userspace-tests_rpc-formation-ut
01motr-single-node/00userspace-tests_balloc-ut
01motr-single-node/00userspace-tests_cob-ut
01motr-single-node/00userspace-tests_net-module
01motr-single-node/00userspace-tests_spiel-ci-ut
01motr-single-node/00userspace-tests_io-nw-xfer-ut
01motr-single-node/00userspace-tests_rpc-session-ut
01motr-single-node/00userspace-tests_spiel-conf-ut
01motr-single-node/00userspace-tests_addb2-storage
01motr-single-node/00userspace-tests_rm-ut
01motr-single-node/00userspace-tests_helpers-ufid-ut
01motr-single-node/00userspace-tests_io-req-ut
01motr-single-node/00userspace-tests_buffer_pool_ut
01motr-single-node/00userspace-tests_be-ut
01motr-single-node/43m0crate
01motr-single-node/00userspace-tests_dix-cm-iter
01motr-single-node/00userspace-tests_fis-ut
01motr-single-node/00userspace-tests_rpc-conn-pool-ut
01motr-single-node/00userspace-tests_idx-ut
01motr-single-node/00userspace-tests_ff2c-ut
01motr-single-node/05confgen
01motr-single-node/00userspace-tests_conf-ut
01motr-single-node/00userspace-tests_cob-foms-ut
01motr-single-node/00userspace-tests_addb2-consumer
01motr-single-node/00userspace-tests_capa-ut
01motr-single-node/00userspace-tests_isc-api-ut
01motr-single-node/00userspace-tests_snscm_storage-ut
01motr-single-node/00userspace-tests_net-bulk-mem
01motr-single-node/00userspace-tests_obj-ut
01motr-single-node/00userspace-tests_confc-ut
01motr-single-node/00userspace-tests_reqh-fop-allow-ut
01motr-single-node/00userspace-tests_mdservice-ut
01motr-single-node/00userspace-tests_reqh-ut
01motr-single-node/00userspace-tests_ios-bufferpool-ut
01motr-single-node/00userspace-tests_io-ut
01motr-single-node/00userspace-tests_rpc-rcv-session-ut
01motr-single-node/00userspace-tests_dtm-dtx-ut
01motr-single-node/00userspace-tests_idx-dix
01motr-single-node/00userspace-tests_io-req-fop-ut
01motr-single-node/00userspace-tests_net-bulk-if
01motr-single-node/00userspace-tests_net-misc-ut
01motr-single-node/00userspace-tests_parity_math_ssse3-ut
01motr-single-node/00userspace-tests_addb2-net
01motr-single-node/00userspace-tests_failure_domains_tree-ut
01motr-single-node/00userspace-tests_fdmi-filterc-ut
01motr-single-node/00userspace-tests_cm-cp-ut
01motr-single-node/00userspace-tests_conf-pvers-ut
01motr-single-node/00userspace-tests_addb2-base
01motr-single-node/00userspace-tests_rpc-at
01motr-single-node/00userspace-tests_dtm0-ut
01motr-single-node/00userspace-tests_parity_math-ut
01motr-single-node/00userspace-tests_net-test
01motr-single-node/00userspace-tests_spiel-ut
01motr-single-node/00userspace-tests_stats-ut
01motr-single-node/00userspace-tests_addb2-sys
01motr-single-node/00userspace-tests_fdmi-filter-eval-ut
01motr-single-node/00userspace-tests_snscm_xform-ut
01motr-single-node/00userspace-tests_ms-fom-ut
01motr-single-node/00userspace-tests_fop-lock-ut
01motr-single-node/00userspace-tests_dtm-nucleus-ut
01motr-single-node/00userspace-tests_xcode_bufvec_fop-ut
01motr-single-node/00userspace-tests_conf-diter-ut
01motr-single-node/00userspace-tests_libconsole-ut
01motr-single-node/00userspace-tests_sm-ut
01motr-single-node/00userspace-tests_di-ut
01motr-single-node/00userspace-tests_rpc-item-source-ut
01motr-single-node/00userspace-tests_addb2-histogram
01motr-single-node/00userspace-tests_isc-service-ut
01motr-single-node/00userspace-tests_sss-ut
01motr-single-node/00userspace-tests_dix-client-ut
01motr-single-node/00userspace-tests_ha-state-ut
01motr-single-node/00userspace-tests_poolmach-ut
01motr-single-node/00userspace-tests_dtm0-log-ut
01motr-single-node/06hagen
01motr-single-node/00userspace-tests_rpc-machine-ut
01motr-single-node/00userspace-tests_rm-rcredits-ut
01motr-single-node/00userspace-tests_snscm_net-ut
01motr-single-node/00userspace-tests_bytecount-ut
01motr-single-node/00userspace-tests_rconfc-ut
01motr-single-node/00userspace-tests_stob-ut
01motr-single-node/00userspace-tests_reqh-service-ut
01motr-single-node/00userspace-tests_rpc-lib-ut
01motr-single-node/00userspace-tests_failure_domains-ut
01motr-single-node/00userspace-tests_sync-ut
01motr-single-node/00userspace-tests_pi_ut
01motr-single-node/00userspace-tests_net-lnet
01motr-single-node/00userspace-tests_udb-ut
01motr-single-node/00userspace-tests_layout-ut
01motr-single-node/04initscripts
01motr-single-node/00userspace-tests_rpc-connection-ut
01motr-single-node/00userspace-tests_io-pargrp-ut
01motr-single-node/00userspace-tests_rpc-item-ut
01motr-single-node/00userspace-tests_fdmi-fol-fini-ut
01motr-single-node/00userspace-tests_confstr-ut
01motr-single-node/00userspace-tests_module-ut
01motr-single-node/00userspace-tests_cm-ut
01motr-single-node/00userspace-tests_sns-file-lock-ut
01motr-single-node/00userspace-tests_rm-rwlock-ut
01motr-single-node/00userspace-tests_conf-validation-ut
01motr-single-node/00userspace-tests_m0d-ut
01motr-single-node/00userspace-tests_dtm-transmit-ut
01motr-single-node/00userspace-tests_idx-dix-mt
01motr-single-node/00userspace-tests_xcode-ut
01motr-single-node/00userspace-tests_cas-client
01motr-single-node/00userspace-tests_bulk-server-ut
01motr-single-node/00userspace-tests_ha-ut
01motr-single-node/00userspace-tests_conf-walk-ut
01motr-single-node/00userspace-tests_client-ut
01motr-single-node/00userspace-tests_conf-load-ut
01motr-single-node/00userspace-tests_sns-cm-repair-ut
01motr-single-node/00userspace-tests_libm0-ut
01motr-single-node/00userspace-tests_btree-ut
01motr-single-node/00userspace-tests_reqh-service-ctx-ut
01motr-single-node/00userspace-tests_fom-timedwait-ut
01motr-single-node/52motr-singlenode-sanity
01motr-single-node/00userspace-tests_fdmi-fol-ut
01motr-single-node/00userspace-tests_conf-glob-ut
01motr-single-node/00userspace-tests_storage-dev-ut
01motr-single-node/00userspace-tests_cas-service
01motr-single-node/01net
01motr-single-node/00userspace-tests_bulk-client-ut
01motr-single-node/00userspace-tests_layout-access-plan-ut
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/00userspace-tests_dtm0-clk-src-ut
01motr-single-node/02rpcping
01motr-single-node/00userspace-tests_fol-ut
01motr-single-node/00userspace-tests_fit-ut
01motr-single-node/00userspace-tests_libfab-ut
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/72spiel-sns-motr-repair-quiesce
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/71spiel-sns-motr-repair
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/73motr-io-small-disks
04motr-single-node/48motr-raid0-io
04motr-single-node/74motr-di-corruption-detection
04motr-single-node/49motr-rpc-cancel
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total202🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

@madhavemuri madhavemuri merged commit d8c8491 into Seagate:main Sep 8, 2022
@nkommuri nkommuri deleted the cortx-33899 branch September 14, 2022 08:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants