Skip to content

Retry Logic Overview (WIP)

Jim Borden edited this page Sep 13, 2016 · 5 revisions

This document exists to be the authoritative document on retrying requests over a network. There are quite a few places where this applies during the replication process. This will cover what should happen in the event of both a transient and permanent error. A transient error is one that is expected to pass given a relatively short period of time (such as a connection timeout, or a 503). A permanent error is the opposite (such as a 401 or 404), and is not likely to recover without intervention. This document will not cover other replication logic such as "going offline."

Flow

The flow of the replication retry follows:

  1. Replication attempts to start
  2. Replication attempts to continue
  3. A connection error occurs
    • 3a The connection error indicates lack of connectivity, go to 6
    • 3b The connection error is transient, go to 4
    • 3c The connection error is permanent, stop the replication
  4. Retry according to the applied retry strategy (not customizable on all platforms)
    • 4a The retry strategy fails, go to 5
    • 4b The retry strategy succeeds, go to 2
  5. At this point the request in question has failed to send and/or get a response
    • 5a The replication is continuous. Switch to idle, set last error, enter long delay (~60 sec) and go to 1
    • 5b The replication is non-continuous. Set last error, give up and stop the replication
  6. The endpoint is not reachable
    • 6a The device has no network connection. Switch to offline, set last error, and wait for network connection change.
    • 6b The device has a network connection. Switch to offline, set last error, enter long delay (~60 sec) and go to 1

Examples

Start non-continuous replication
Initial connection reports 401 (Unauthorized)
Stop replication, callback for error and stopped status (two notifications)

Start non-continuous replication
Halfway through, a 503 error is encountered (Service Unavailable)
Error is transient, so retry
Retry succeeds, replication continues

Start non-continuous replication
Halfway through, a connection time out happens
Error is transient, so retry
Retry failed, replication stops

Start a continuous replication
A 404 error is encountered on the endpoint
Permanent error, so stop the replication

Pseudocode Algorithm

void ErrorEncountered(Exception e)
{
    if(IsTransient(e)) {
       // 3b
       if(_strategy.CanRetry) {
           _strategy.Retry();
           return:
       }
    }

    HandleErrorEndgame(e);
}

void HandleErrorEndgame(Exception e)
{
    if(IsContinuous && IsTransient(e)) {
        // 5a
        EnterRetryLoop();
        return;
    }
    
    if(IsOfflineError(e)) {
        // 3a -> 6b
        EnterOfflineLoop();
    else {
        // 3c
        StopReplication();
    }
}

Error Judgement

Determining whether an error is connectivity related, permanent, or transient is a big task. This section will accumulate the rules used so far (using .NET for reference).

Exceptions

  • IOException, TimeoutException, TaskCanceledException (this is thrown by the library during async timeouts on HTTP requests) are all considered transient and not analyzed further.
  • SocketException will analyze the socket error code
    • AccessDenied = 10013 = Permanent,
    • AddressAlreadyInUse = 10048 = Permanent,
    • AddressFamilyNotSupported = 10047 = Permanent,
    • AddressNotAvailable = 10049 = Permanent,
    • AlreadyInProgress = 10037 = Transient,
    • ConnectionAborted = 10053 = Transient,
    • ConnectionRefused = 10061 = Connectivity,
    • ConnectionReset = 10054 = Transient,
    • DestinationAddressRequired = 10039 = Permanent,
    • Disconnecting = 10101 = Permanent,
    • Fault = 10014 = Permanent,
    • HostDown = 10064 = Connectivity,
    • HostNotFound = 11001 = Permanent,
    • HostUnreachable = 10065 = Permanent,
    • InProgress = 10036 = Transient,
    • Interrupted = 10004 = Transient,
    • InvalidArgument = 10022 = Permanent,
    • IOPending = 997 = Transient,
    • IsConnected = 10056 = Transient,
    • MessageSize = 10040 = Permanent,
    • NetworkDown = 10050 = Connectivity,
    • NetworkReset = 10052 = Transient,
    • NetworkUnreachable = 10051 = Permanent,
    • NoBufferSpaceAvailable = 10055 = Permanent,
    • NoData = 11004 = Permanent,
    • NoRecovery = 11003 = Permanent,
    • NotConnected = 10057 = Connectivity,
    • NotInitialized = 10093 = Permanent,
    • NotSocket = 10038 = Permanent,
    • OperationAborted = 995 = Transient,
    • OperationNotSupported = 10045 = Permanent,
    • ProcessLimit = 10067 = Transient,
    • ProtocolFamilyNotSupported = 10046 = Permanent,
    • ProtocolNotSupported = 10043 = Permanent,
    • ProtocolOption = 10042 = Permanent,
    • ProtocolType = 10041 = Permanent,
    • Shutdown = 10058 = Transient,
    • SocketError = -1 = Permanent,
    • SocketNotSupported = 10044 = Permanent,
    • SystemNotReady = 10091 = Transient,
    • TimedOut = 10060 = Transient,
    • TooManyOpenSockets = 10024 = Transient,
    • TryAgain = 11002 = Transient,
    • TypeNotFound = 10109 = Permanent,
    • VersionNotSupported = 10092 = Permanent,
    • WouldBlock = 10035 = Transient
  • WebException will analyze the type of failure first
    • ConnectFailure, Timeout, ConnectionClosed, and RequestCanceled are transient
    • Others are considered permanent unless they have an HTTP status code
      • Transient errors are HTTP 408, 500, 502, 503, 504