Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability: Adds logic to avoid bad replica during cache refresh #3127

Merged
merged 3 commits into from
Apr 1, 2022

Conversation

j82w
Copy link
Contributor

@j82w j82w commented Apr 1, 2022

Pull Request Template

Description

Current design:
If the SDK get's a 410 or other failure that signals a replica has moved to a different machine an address cache refresh is triggered. The cache refresh returns the stale information until the new address list is returned, because the 3 other replicas should still be valid and can complete requests. This still gives a 25% chance of a new replica going to the bad replica which can possibly take multiple seconds for the connection to timeout.

The solution:
The GatewayAddressCache individual addresses have a unhealthy flag. When a cache refresh is requested the bad replica will be marked as unhealthy. When the SDK goes to pick a random replica it will always move the unhealthy replicas to the end of the list. When the results from the gateway return the health state is reset. It will only avoid the replica during call to get the new addresses from the gateway.

The "unhealthy" state would be reset when a Gateway refresh response comes back (whether addresses changed or not) or after 1 minute - whatever comes first. So the throughput SLA regression risk (temporarily only using 3 out of 4 replica) is only applicable for at most 1 minute.

// Cache refresh design

sequenceDiagram
    participant Request1
    participant Request2
    Request1->>+GatewayAddressCache: Get partition key range 0
    GatewayAddressCache->>-Request1: Returns replicas[1,2,3,4]
    Request1->>+Replica2: Get item
    Replica2->>-Request1: Gone(410) represent replica moved
    Request1->>GatewayAddressCache: Start background refresh of addresses
    Request1->>+Replica1: Get item
    Replica1->>-Request1: Returns item
    GatewayAddressCache->>+Cosmos Gateway: Get addresses for range 0 with ForceRefresh
    Request2->>+GatewayAddressCache: Get partition key range 0 
    GatewayAddressCache->>-Request2: Stale [1,2,3,4]
    Request2->>+Replica3: Replica1 has a 0% chance of being picked since refresh is still occurring. It use to be 25%. 
    Replica3->>-Request2: Returns item
    Cosmos Gateway->>-GatewayAddressCache: Return addresses [5,2,3,4]
Loading

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

Copy link
Member

@ealsur ealsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit on the diagram. Request2 has a 410 response also on Replica3, should it be a 200/201?

@j82w j82w enabled auto-merge (squash) April 1, 2022 13:56
@j82w j82w merged commit 0aa0456 into master Apr 1, 2022
@j82w j82w deleted the users/jawilley/ha/avoidUnhealthyReplica branch April 1, 2022 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants