Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Error Chaining in a js-errors or @matrixai/errors package #304

Closed
12 tasks done
CMCDragonkai opened this issue Dec 3, 2021 · 81 comments · Fixed by MatrixAI/js-errors#1 or #374
Closed
12 tasks done
Assignees
Labels
development Standard development epic Big issue with multiple subissues

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 3, 2021

Specification

Now that @matrixai/errors is ready, we can use it to incorporate error chaining into PK. This will involve a series of steps:

Chaining

In all places where we are catching an error and then throwing a new error in its place, we need to be including the original error as the cause of the new error, e.g.

} catch (e) {
  throw new SomePolykeyError(e.message, { cause: e });
}

Note that the cause should always be an exception/error. If not specified it defaults to unknown. When a error is thrown, the top-level error should contain the full instances of the errors in its cause chain, as one big nested object. This data structure can be serialised and deserialised recursively, where every error in the chain has a single cause, and that cause is contained as a property within it.

It is important to realise that the replacer will remove any entry if the returned value is undefined. Unless it is an array, it will replace that value with null. This means that because our cause may be undefined, it may not exist during deserialisation. Any usage of fromJSON must be aware that the cause property may not exist.

Client Server Error Architecture

toError: ServiceError -> ErrorPolykeyRemote                         fromError: Error -> ServerStatusResponse

                                                                         ┌─────────────────────────────┐
      ┌───────────────────────────────┐                                  │ NodeId 1 - Server           │
      │ Client                        │             Request              │                             │
      │                               │                                  │ ┌─────────────────────────┐ │
      │ Receives toError ServiceError ├──────────────────────────────────┼─► Service Handlers        │ │
      │ Throws ErrorPolykeyRemote     │                                  │ │                         │ │
      │ cause: ErrorX | ErrorUnknown  ◄──────────────────────────────────┼─┤ Sends fromError ErrorX  │ │
      │ data: { nodeID: 1, ... }      │                                  │ └─────────────────────────┘ │
      └───────────────────────────────┘      Response with error         │                             │
                                             JSON string in metadata     └─────────────────────────────┘
                                          error: { type: ErrorX, ... }

It's important to realise:

  • Only the client constructs ErrorPolykeyRemote
  • The ErrorPolykeyRemote should contain nodeId: NodeId, host, port and other connection information in its own data which can aid debugging which server responded with this error
  • The ErrorPolykeyRemote should also have information about which call triggered this, and perhaps even request information, it can have a lot of information which can add later.
    • Additional client/connection metadata should be set in the data property of ErrorPolykeyRemote
    • Give data a more specific type like data: ClientMetadata & POJO so that it forces that ClientMetadata must be available
    • We can use type ClientMetadata = { nodeId: NodeId; host: Host; port: Port }. And additional information can be provided later. Like what the call was.
    • The ClientMetadata type should be in src/types.ts to avoid import loops, since it has to use the NodeId type from nodes and Host and Port from the the network domains.
    • This keeps the type along with ErrorPolykeyRemote together at the top level src.
  • The server only serialises the ErrorX that ErrorX may be any kind of error. The server must filter the information it sends to the client to ensure it is not leaking sensitive information. This can include the stack since the stack is rarely required by the client.
  • Even when the server sends the error back to the client, it must also log out this error, but only if the error is considered a "server error". That is if the is a "client error" such as 4xx errors in HTTP, then it should not be logged out. This includes:
    • Authentication
    • Validation
    • Precondition
    • Concurrency Conflict
    • Anything else coming from HTTP 4xx
    • Example: missing object/resource
  • If the client encounters unknown data during deserialisation, it only returns ErrorUnknown at the root of the JSON data. Any other unknown data should be returned as-is.
  • This means all remote errors are always ErrorPolykeyRemote, but the cause may be ErrorUnknown or ErrorX. Where ErrorX may contain anything after that (as long as the runtime schema checking during fromJSON calls work)
  • The GRPC metadata key for the serialised error JSON should just be error, in the future if we switch to using JSON RPC, this error JSON will probably be encoded as part of the JSON RPC protocol - see https://www.jsonrpc.org/specification#error_object and https://eth.wiki/json-rpc/json-rpc-error-codes-improvement-proposal
  • GRPC itself only supports "HTTP2" kind of errors, but these codes aren't really being used, all errors from the remote node is considered one kind of HTTP error, that being of UNKNOWN. We use that to represent an "application" error. This basically means we don't use most of GRPC/HTTP2's standard error codes, as our "application layer errors" are on top. This is because GRPC/HTTP2's standard error codes were designed more for the HTTP layer. Although some of the HTTP error codes are still used internally by GRPC.
  • This is the same philosophy taken by JSON RPC over HTTP: http://www.simple-is-better.org/json-rpc/transport_http.html, where success or error at the application layer is still 200 response code at the HTTP layer. HTTP errors are specific to the HTTP layer.
    200 OK
    for responses (both for Response and Error objects)
    204 No Response / 202 Accepted
    for empty responses, i.e. as response to a Notification
    307 Temporary Redirect / 308 Permanent Redirect
    for HTTP-redirections (note that the request may not be automatically retransmitted)
    405 Method Not Allowed
    if HTTP GET is not supported: for all HTTP GET requests
    if HTTP GET is supported: if the client tries to call a non-safe/non-indempotent method via HTTP GET
    415 Unsupported Media Type
    if the Content-Type is not application/json
    others
    for HTTP-errors or HTTP-features outside of the scope of JSON-RPC
    

JSON serialisation/deserialisation

Our gRPC toError and fromError utilities will need to be modified to be able to serialise and deserialise the error chain when sending errors between clients/agents or agents/agents.

  • These can utilise JSON's replacer/reviver properties of stringify/parse - this is non-recursive and processes objects in fragments, with everything being orchestrated by stringify/parse
  • The replacer needs to serialise errors into a standardised structure { type: 'ClassName', data: { ... } } and also filter out sensitive data (the stack) when errors are being sent to an agent (as opposed to a client) - it works top to bottom, creating a structure first and filtering out unwanted fields afterwards
  • The reviver needs to covert this structure back into its original data type - it works bottom to top, converting fields to their correct types and then finally constructing the finished data type
    • If the received data can't be deserialised into a known error type it should be converted to an ErrorPolykeyUnknown, with the unknown data in the data field
  • We also no longer need to separate errors in the gRPC metadata into name/message/data and this is now just one data structure - this can just be one error field
  • Need to filter out sensitive information when sent from the service handlers, this includes stack and other special data. This would a case by case basis for specific exception classes. That is, the replacer will need to execute a filter by checking against specific exception classes and filtering them out. Filter rules should be acquired from src/client and src/agent.

Logging of Errors at the Service Handlers

  • Loggers should be added to the service handlers in order to log out errors, since these do not shut down the agent process
    • Not all errors should be reported though - only validation/authentication/other precondition errors
    • This logger is separate from the logger used for grpc internals
    • How to determine if an error is for the client or for the server? This is a case by case basis, unlike HTTP, we do not have 4xx and 5xx codes emitted at the beginning, all exceptions assume programmatic usage. Therefore there is a smaller amount of client errors then there are server errors. The filter should just check that if part of "client errors" set, then don't emit. Create set of "client errors", this set can be defined in both src/client and src/server (code duplication is fine here), since each service can define their own set of what is considered a "client error".

Reporting the Error to the user at the Root Exception Handler

We have 2 root exception handlers, one at the client and one at the agent. This addresses the client side, however unless things are different, the same applies to the agent side.

On the client side, this is done in src/bin/polykey.ts.

When an exception is received, we must interpret 3 things:

  • The cause chain
  • The exitCode
  • The metadata in data

According to issue #323, the --format json doesn't currently affect the STDERR. This needs to be done now because the errors being reported are quite complex, and during testing, we expectation utilities should be parsing JSON to make it easier to test.

Therefore binUtils.outputFormatter will need to ensure that error type is a human formatted JSON, while json type can now be the JSON output for exceptions. This is doable now due to the toJSON utility function inherited from AbstractError.

Furthermore in order to acquire the options passed into the command, use rootCommand.opts(), this method was added into to PolykeyCommand to enable us to acquire the options.

const opts = rootCommand.opts();
if (opts.format === 'json') {
  process.stderr.write(
    binUtils.outputFormatter({
      type: 'json',
      data: e.toJSON(),
    });
  );
} else {
  // use `error` format
  process.stderr.write(
    binUtils.outputFormatter({
      type: 'error',
      // ...
    });
  );
}

The desired format for human format should be something like:

ErrorName: description - message
  K\tV
  K\tV
  cause ErrorName: description - message
    K\tV
    K\tV

Note the usage of \t for separation while spaces are used for indentation. We can change this format later.

The exitCode is currently a bit ambiguous, it was originally intended to mean that if this exception was the last exception caught by the process, this code should be used for the process exit code.

It seems more ideal to allow the process to decide what the exit code should be based on the family of exceptions. But since we have built our exit code this way, we need to continue to use it like this.

However now that we have a cause chain, we have to decide how to deal with the exit code when we are getting ErrorPolykeyRemote. Because of this scenario:

C -> A1 -> A2

We must differentiate exceptions that originate from the client, or the first agent or the second agent. If an exception comes from A2, we may see ErrorPolykeyRemote wrapping another ErrorPolykeyRemote.

So right now a policy can be made:

  1. Find the first cause property that is not ErrorPolykeyRemote, and use that exitCode
  2. Otherwise get the exitCode of the first exception.

This only works because ErrorPolykeyRemote's cause type is limited to ErrorPolykey, it cannot be undefined or anything else.

Additional context

Tasks

  • 1. Convert all errors from CustomError to new AbstractError<T> from @matrixai/errors
  • 2. Give all errors a static description and errCode
  • 3. In all places where one error is caught and a different one is thrown, include the original error as the cause
  • 4. Modify the validation utils to use new error chaining
  • 5. Modify gRPC fromError and toError functions to properly handle chained errors
  • 6. Include a timestamp when any error is thrown
  • 7. Add error logging to service handlers
  • 8. Update all packages relying on js-errors to 1.1.0 to use the fromJSON and toJSON
  • 9. PK should adapt its replacer and reviver functionality and bring in "rules" from both src/client and src/agent. Can use DI in the client service vs the agent service to change the serialisation/deserialisation.
  • 10. Logging should also use separate rules, one for client service, one for agent service.
  • 11. Handle the error chain at the root exception handler for src/bin/polykey.ts and src/bin/polykey-agent.ts
  • 12. Ensure that AbstractError.fromJSON can handle non-existent cause property
@CMCDragonkai CMCDragonkai added the development Standard development label Dec 3, 2021
@CMCDragonkai
Copy link
Member Author

In our root exception handler, or our ExitHandlers, when we finally log out the error, we may wish to use JSON stringification of the exception to provide full context. Note that we don't use logger.error, it goes straight to STDERR.

@CMCDragonkai
Copy link
Member Author

Error chaining would be useful in a range of areas. Cause right now we are losing error context. One example of this is in the networking when an error event is emitted to the TLS socket, this gets printed out, but it's not clear what this actual error is and where it is truly coming from.

@CMCDragonkai
Copy link
Member Author

Here's an example problem from: #321 (comment)

In our GRPC fromError, we are sending over serialised ErrorPolykey instances.

However not all errors are going to be ErrorPolykey. Therefore to deal with those errors, we should wrap them in ErrorPolykey. It would be ideal to be able to construct an error chain here, so that it is possible to tell the client:

  1. An error occurred on a remote node during a call
  2. Here is the error that actually occurred on the remote node.

In fact, this could be done for ErrorPolykey instances and child instances as well in order to also capture network/node metadata so it's obvious that remote errors are different from local errors.

Meaning the resulting exception being thrown might actually be a specific exception like ErrorPolykeyRemote. This can allow us to wrap additional data here like nodeId and handler name/location, maybe even Polykey version.

And it's the chain itself dispatched in toError to all the possible error instances.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Feb 3, 2022

One issue is the serialisation of the cause chain.

The metadata doesn't allow any kind of nested structure. However all the metadata properties can be multi-values.

This means that one could flatten the chain so that each error property goes into the array to represent the error chain.

for (const e of errorChain) {
  metadata.add('name', e.name);
  metadata.add('message', e.message);
  // This needs to be considered as well
  if (e.data != null) {
    metadata.add('data', JSON.stringify(e.data));
  }
}

@CMCDragonkai
Copy link
Member Author

Remember that grpc domain is our "usage" of GRPC, not a general GRPC library, so we can always make changes there to suit our usecase of GRPC.

@emmacasolin
Copy link
Contributor

One issue is the serialisation of the cause chain.

The metadata doesn't allow any kind of nested structure. However all the metadata properties can be multi-values.

This means that one could flatten the chain so that each error property goes into the array to represent the error chain.

for (const e of errorChain) {
  metadata.add('name', e.name);
  metadata.add('message', e.message);
  // This needs to be considered as well
  if (e.data != null) {
    metadata.add('data', JSON.stringify(e.data));
  }
}

I think this could work, and then in toError() instead of doing

const errorName = e.metadata.get('name')[0] as string;
const errorMessage = e.metadata.get('message')[0] as string;
const errorData = e.metadata.get('data')[0] as string;

We could just iterate through them all, returning an array of errors rather than a single one. I'm not sure if this would maintain the correct order though. It would then be up to #323 to decide how this array of errors is displayed.

@CMCDragonkai
Copy link
Member Author

Yea that should maintain the chain order because the multivalues are iterated in order of insertion.

@CMCDragonkai
Copy link
Member Author

Slightly different from the ValidationError problem though cause this required a special way of encoding.

Another way is to use JSON encode for all error properties rather than spreading it into the GRPC metadata. This way we could use toJSON magic method to specialise serialised encoding for specific exceptions where applicable.

@CMCDragonkai
Copy link
Member Author

Will be done here: https://github.com/MatrixAI/js-errors

@CMCDragonkai
Copy link
Member Author

From this: #357 (comment)

It looks like it would be useful debugging to have a timestamp set for when an exception is constructed. Ideally also when it is thrown, but there's no hook for this in the JS runtime I can find.

This is because there's the time when an exception is constructed, an arbitrary delay, when it was thrown, an arbitrary delay, and finally when an error is actually reported using the js-logger.

We would need high precision for this, and not just unixtime. We can use performance.now() to help with this rather than just Date.now().

@CMCDragonkai
Copy link
Member Author

Being resolved in MatrixAI/js-errors#1

@CMCDragonkai
Copy link
Member Author

Not resolved until js-errors is actually integrated into PK.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Apr 5, 2022

Integrated into @matrixai/db now as 3.2.0.

Still to do:

  1. @matrixai/async-locks - Integrated @matrixai/errors and setup error chaining js-async-locks#4
  2. @matrixai/async-init - should be done along with async-locks - Integrated @matrixai/errors and setup error chaining and async locks js-async-init#8
  3. @matrixai/workers - should be done along with async-init - Integrated @matrixai/errors and setup error chaining and async init js-workers#2
  4. encryptedfs - should be done along with js-resources integration - Upgrading @matrixai/async-init, @matrixai/async-locks, @matrixai/db, @matrixai/errors, @matrixai/workers and integrating @matrixai/resources js-encryptedfs#63

@emmacasolin
Copy link
Contributor

I'm noticing that the vaults service handlers aren't using validateSync to do validation, so they would be throwing ErrorParse rather than ErrorValidation. Will fix those up to use validateSync.

emmacasolin added a commit that referenced this issue Jun 3, 2022
We only want to log out server errors on the server side, so we now apply filters on errors before logging them.

#304
@emmacasolin
Copy link
Contributor

While I was debugging some issues with the vaults tests I was able to see the new error chain - it was definitely helpful!

    ERROR:createService:ErrorPolykeyRemote: Vault ID must be multibase base58btc encoded strings
    ErrorPolykeyRemote: Remote error from RPC call - Vault ID must be multibase base58btc encoded strings
      command   vaultsPull
      nodeId    v1tevljtp9f6g7rlc8mgmfcka2kqp9oduu034852he3d3ubsrn6o0
      host      127.0.0.1
      port      45327
      timestamp Mon Jun 06 2022 11:45:31 GMT+1000 (Australian Eastern Standard Time)
      cause: ErrorPolykeyRemote: Remote error from RPC call - Vault ID must be multibase base58btc encoded strings
        command vaultsGitInfoGet
        nodeId  vdfnugqe23b7ce64ruo6hfbkop4k7e5br98mml107ntue04sauusg
        host    127.0.0.1
        port    52426
        timestamp       Mon Jun 06 2022 11:45:31 GMT+1000 (Australian Eastern Standard Time)
        cause: ErrorValidation: Input data failed validation - Vault ID must be multibase base58btc encoded strings
          exitCode      65
          timestamp     Mon Jun 06 2022 11:45:31 GMT+1000 (Australian Eastern Standard Time)

I realised though that in this case the original error is intended for the client, not the server, but since it's wrapped in an ErrorPolykeyRemote it doesn't get caught as a client error and gets logged out by the agent. The source of the error is invalid data coming from the client, but since it doesn't get picked up until it reaches a call on a remote agent maybe it could still be relevant to the server? Not sure on this one.

@CMCDragonkai
Copy link
Member Author

So this is an error from client to agent 1 to agent 2.

A validation error of agent 2 to agent 1.

I argue that agent 1 should have validated the data before passing to agent 2. This means it would have been a client error.

If agent 2 validates again and says it's wrong, then agent 1 is at fault and that's a server error.

@emmacasolin
Copy link
Contributor

So this is an error from client to agent 1 to agent 2.

A validation error of agent 2 to agent 1.

I argue that agent 1 should have validated the data before passing to agent 2. This means it would have been a client error.

If agent 2 validates again and says it's wrong, then agent 1 is at fault and that's a server error.

Agent 1 can't really validate the data, since it's possible to pass in either a Vault Name or a Vault Id. Since Agent 1 doesn't know the names of Agent 2's vaults it can't know whether a random string that is passed in is a valid vault name or not. It's not until it reaches Agent 2 and it can check that the string is not the name of any of its vaults that it can check if it's a valid Vault Id instead.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jun 6, 2022

Ok so it's more of the case that this vault ID is missing. This is because, we first check if there is a matching vault name on agent 2, and then check if it is a base58 encoded string. If it is not, then it's really just a missing vault. The exception should be changed to state that it is simply a missing vault.

In that case, it is a client error, and therefore agent 1 should use the special filter to say that it is in fact a client error. Right now I believe it's just an array of exception classes. I imagine to do this check, you would need to change the array of exception classes to also optionally take a chain.

Array<typeof Error | Array<typeof Error>>

That way if it is a single error, you just check that, if there are multiple errors, you check it in a chain.

isClientError(e, [ErrorOtherClientError, [ErrorPolykeyRemote, ErrorVaultsVaultMissing]])

Then you would first check if it is ErrorPolykeyRemote, and if true, check if cause is ErrorVaultsVaultMissing.

Alternatively... it maybe better that the service handler on agent 1 does a catch, and checks if it is in fact ErrorVaultsVaultMissing, and then rethrows that exception as its own, thus representing an exception override.

@emmacasolin
Copy link
Contributor

Ok so it's more of the case that this vault ID is missing. This is because, we first check if there is a matching vault name on agent 2, and then check if it is a base58 encoded string. If it is not, then it's really just a missing vault. The exception should be changed to state that it is simply a missing vault.

In that case, it is a client error, and therefore agent 1 should use the special filter to say that it is in fact a client error. Right now I believe it's just an array of exception classes. I imagine to do this check, you would need to change the array of exception classes to also optionally take a chain.

Array<typeof Error | Array<typeof Error>>

That way if it is a single error, you just check that, if there are multiple errors, you check it in a chain.

isClientError(e, [ErrorOtherClientError, [ErrorPolykeyRemote, ErrorVaultsVaultMissing]])

Then you would first check if it is ErrorPolykeyRemote, and if true, check if cause is ErrorVaultsVaultMissing.

Alternatively... it maybe better that the service handler on agent 1 does a catch, and checks if it is in fact ErrorVaultsVaultMissing, and then rethrows that exception as its own, thus representing an exception override.

Rather than having to check against an array of errors, it might be better to just iterate through the chain of the error and check each cause against the set of client errors. Otherwise it seems too narrow of a check - if the underlying error is wrapped by more than one ErrorPolykeyRemote then the check would fail.

emmacasolin added a commit that referenced this issue Jun 6, 2022
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array.

#304
emmacasolin added a commit that referenced this issue Jun 6, 2022
The `data` POJO that was originally supplied in the constructor as the second parameter is now part of the `options` parameter and most places in the code have been updated to match this.

#304
emmacasolin added a commit that referenced this issue Jun 6, 2022
When any error is thrown as the result of another error occurring, the original error is now contained within the `cause` property of the new error.

#304
emmacasolin added a commit that referenced this issue Jun 6, 2022
…or chaining

Our gRPC `toError` and `fromError` utils are now able to serialise and deserialise Polykey and non-Polykey errors (as well as non-errors), including the entire error chain if this exists. Also includes the ability to filter out sensitive data, for example when the error is being sent to another agent.

Errors sent over the network in this way are now additionally wrapped on the receiving side in an `ErrorPolykeyRemote` to make the source of the error more clear.

#304
emmacasolin added a commit that referenced this issue Jun 6, 2022
All of the agent and client service handlers are now passed a shared logger that is used to log errors to stderr.

#304
emmacasolin added a commit that referenced this issue Jun 6, 2022
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array.

#304
emmacasolin added a commit that referenced this issue Jun 7, 2022
The `data` POJO that was originally supplied in the constructor as the second parameter is now part of the `options` parameter and most places in the code have been updated to match this.

#304
emmacasolin added a commit that referenced this issue Jun 7, 2022
When any error is thrown as the result of another error occurring, the original error is now contained within the `cause` property of the new error.

#304
emmacasolin added a commit that referenced this issue Jun 7, 2022
…or chaining

Our gRPC `toError` and `fromError` utils are now able to serialise and deserialise Polykey and non-Polykey errors (as well as non-errors), including the entire error chain if this exists. Also includes the ability to filter out sensitive data, for example when the error is being sent to another agent.

Errors sent over the network in this way are now additionally wrapped on the receiving side in an `ErrorPolykeyRemote` to make the source of the error more clear.

#304
emmacasolin added a commit that referenced this issue Jun 7, 2022
All of the agent and client service handlers are now passed a shared logger that is used to log errors to stderr.

#304
emmacasolin added a commit that referenced this issue Jun 7, 2022
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array.

#304
emmacasolin added a commit that referenced this issue Jun 9, 2022
The `data` POJO that was originally supplied in the constructor as the second parameter is now part of the `options` parameter and most places in the code have been updated to match this.

#304
emmacasolin added a commit that referenced this issue Jun 9, 2022
When any error is thrown as the result of another error occurring, the original error is now contained within the `cause` property of the new error.

#304
emmacasolin added a commit that referenced this issue Jun 9, 2022
…or chaining

Our gRPC `toError` and `fromError` utils are now able to serialise and deserialise Polykey and non-Polykey errors (as well as non-errors), including the entire error chain if this exists. Also includes the ability to filter out sensitive data, for example when the error is being sent to another agent.

Errors sent over the network in this way are now additionally wrapped on the receiving side in an `ErrorPolykeyRemote` to make the source of the error more clear.

#304
emmacasolin added a commit that referenced this issue Jun 9, 2022
All of the agent and client service handlers are now passed a shared logger that is used to log errors to stderr.

#304
emmacasolin added a commit that referenced this issue Jun 9, 2022
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array.

#304
@CMCDragonkai
Copy link
Member Author

It's better to be explicit than implicit in this case though.

At any case this is merged into staging now, so we should consider this to be done.

Further improvements exists in structured logging.

@CMCDragonkai
Copy link
Member Author

This is done when it was merged into staging. Further work in structured logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment