Ensure that exceptions during Discovery are correctly handled #348

emmacasolin · 2022-02-25T00:22:00Z

Specification

                      ┌─────┐        ┌─────┐
                      │ N 1 │        │ N 2 │
                      └──┬──┘        └──┬──┘
  Discovery Queue        │              │
┌────────────────┐       │              │
│ T4  T3  T2  T1 ├─────► T1      PolykeyAgent.stop()
└────────────────┘       │              │
                         │              │
                         │              X
                   Discovery.stop()
                         │
                         │
                         │
                         T1 ───────────►
                         │      GRPCClient.createClient()
                         │          Retries for 20s
                         │
               ErrorNodeConnectionTimeout
                         │
                         │
                         │
             NodeConnectionManager.stop()
                         │
                         │
                         │
                  PolykeyAgent.stop()

When stopping the Discovery domain, you need to await for the current task T1 to finish (i.e. one iteration of the discovery queue, where we discover a node/identity and its linked nodes/identities).This is because we don't have the ability to abort currently asynchronous side-effectful tasks which is scheduled in #297. The task itself involves establishing a node connection to the remote agent N 2, however, an edge case that we have not fully considered is one where N 2 has shutdown and is no longer running. In such a situation, the connection timeout which is passed from NodeConnectionManager to NodeConnection to GRPCClientAgent to GRPCClient is what is going to determine how long to wait for connection readiness (and thus how long until we can catch an error and exit the discovery process). This timeout is set to 20s for NodeConnectionManager, which is propagated to all connection timeouts.

In instances of this behaviour, you'll see retried attempts to connect through the proxy. Then the ErrorGRPCClientTimeout should be thrown, which is then rethrown as ErrorNodeConnectionTimeout. You should get this exception on withConnF, which is used by requestChainData in NodeManager, which is called by Discovery.

We need to ensure that this is indeed the sequence of events in practice, and we need to ensure that errors are correctly caught and logged out.

Additional context

Example output from an instance of this behaviour: CLI and Client & Agent Service test splitting #311 (comment)

Tasks

In our Discovery, the default timeout shouldn't be 20s, that's too long. The withConnF method should be able to override the default timeout set in NodeConnectionManager, for example by providing a value as a parameter.
We need Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 so we can actually stop the T1 when we stop the discovery instead of waiting for it to finish. In this case if T1 finishes even after stopping, ensure that T1 is removed from the DB, so you don't redo the work.
We need Integrate Error Chaining in a js-errors or @matrixai/errors package #304 so we can have a clearer error trace, so you can more easily see how the exceptions form. There is a possibility that there are more edge cases that will be exposed from this.
The discovery must log every exception that occurs even if it recovers from it similar to how network proxies report the exceptions.

The text was updated successfully, but these errors were encountered:

emmacasolin · 2022-02-27T22:56:13Z

This issue was too vague and has subsequently been split into two separate issues:

Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) #353 for being able to reduce the startup timeout for node connections created by the discovery domain
Refactor error handling of failed Node Connections created from the Discovery domain #354 - for logging out the exceptions that occur during discovery

Comments have also been added to the descriptions of #297 and #304 with respect to how those issues relate to this one.

Closing this issue now.

emmacasolin added the development Standard development label Feb 25, 2022

emmacasolin closed this as completed Feb 27, 2022

CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 24, 2022

CMCDragonkai assigned CMCDragonkai and emmacasolin and unassigned CMCDragonkai Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that exceptions during Discovery are correctly handled #348

Ensure that exceptions during Discovery are correctly handled #348

emmacasolin commented Feb 25, 2022

emmacasolin commented Feb 27, 2022