Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix(backup_policy): do not try again when got fs errors #807

Merged
merged 5 commits into from
Apr 8, 2021

Conversation

zhangyifan27
Copy link
Contributor

@zhangyifan27 zhangyifan27 commented Apr 1, 2021

Prior to this patch, if we got errors during backup process, meta would retry again and again, the backup would never stop or be disabled until backup success.
This pr marked a policy failed when got fs errors during backup, and we could disable the policy after it is marked failed.

I tested it on a real cluster, set fds_burst_size to 500 bytes to mock fds errors.

-_write_token_bucket.reset(new folly::TokenBucket(FLAGS_fds_write_limit_rate << 20, std::numeric_limits<double>::max()));
+_write_token_bucket.reset(new folly::TokenBucket(FLAGS_fds_write_limit_rate << 20,500));

test steps:

  1. add a backup policy
  2. query the policy
>>> query_backup_policy -p test_fds
policy_info:
    name                  : test_fds
    backup_provider_type  : fds_wq  
    backup_interval       : 86400s  
    app_ids               : {1}     
    start_time            : 17:25   
    status                : enabled 
    backup_history_count  : 2       

backup_infos:
[1]
    id          : 1617701020841      
    start_time  : 2021-04-06 17:23:40
    end_time    : -                  
    app_ids     : {1}                

query backup policy succeed

The backup has started but didn't finish, I checked logs in meta server:

D2021-04-06 18:07:30.939 (1617703650939435182 183307)   meta.default3.0201000500000010: meta_backup_service.cpp:493:on_backup_reply(): test_fds@1617701020841: receive backup response for partition 1.6 from server 10.132.15.13:35801.
E2021-04-06 18:07:30.939 (1617703650939451019 183307)   meta.default3.0201000500000010: meta_backup_service.cpp:542:on_backup_reply(): test_fds@1617701020841: backup got error ERR_LOCAL_APP_FAILURE for partition 1.6 from 10.132.15.13:35801, don't try again when got this error.
  1. try to disable the policy
>>> disable_backup_policy -p test_fds
disable policy result: ERR_OK

We could disable it because the backup is failed.
Then try to query the policy:(it has been disabled)

>>> query_backup_policy -p test_fds
policy_info:
    name                  : test_fds
    backup_provider_type  : fds_wq  
    backup_interval       : 86400s  
    app_ids               : {1}     
    start_time            : 17:25   
    status                : disabled
    backup_history_count  : 2       

backup_infos:
[1]
    id          : 1617701020841      
    start_time  : 2021-04-06 17:23:40
    end_time    : -                  
    app_ids     : {1}                

query backup policy succeed
  1. try to enable the policy.
>>> enable_backup_policy -p test_fds
enable policy result: ERR_OK

We could enable the policy, but the backup will still fail if we don't change the burst_size.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants