Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading lib dependencies and node.js version #374

Merged
merged 74 commits into from
Jun 7, 2022

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented May 20, 2022

Description

This PR has been copied from #366 under a new branch to allow for branch and CI/CD protections.

Several core libraries have been updated and so they need to be integrated into PK. Refer to MatrixAI/js-encryptedfs#63 to see how this was handled for EFS and repeat here.

  1. @matrixai/async-init - no major changes
  2. @matrixai/async-locks - all instances of Mutex should change to use Lock or RWLockWriter
  3. @matrixai/db - the DB now supports proper DB transactions and has introduced iterator, however no more sublevels (use KeyPath/LevelPath instead)
  4. @matrixai/errors - we make all of our errors extend AbstractError<T> and also provide static descriptions to all of them, as well as use the cause chain
  5. @matrixai/workers - no major changes here
  6. node.js - we should upgrade to Node 16 in order to integrate promise cancellation, as well as using Promise.any for ICE protocol
  7. @matrixai/resources - since the @matrixai/db no longer does any locking, the acquisition of the DBTransaction and Lock has to be done together with withF or withG

Along the way, we can explore how to deal with indexing #188 and #257, which should be easier now that DB has root levels of data, transactions.

Locking changes

There were three important types of locks brought in by js-async-locks:

  1. Lock (this is the equivalent of Mutex which we were using previously)
  2. RWLockWriter (a write-preferring read/write lock)
  3. LockBox (a collection of related locks)

In most cases, we are removing locks in favour of using optimistic concurrency control (OCC). This means that most domain locks should be removed with a few exceptions:

  • Counters/serialised data - places where we have strict serialisation, such as counters and lexicographically ordered keys, require locking the DB in order to maintain this order. This applies to the keys used by the Discovery Queue in Discovery.ts, the NotificationIds and message count used by NotificationsManager.ts, the ClaimIds and sequence numbers used by Sigchain.ts, and if the set/ping node or refresh buckets queues introduced in Testnet Deployment #326 become persistent then we would need locking there as well. This could be achieved for these cases by introducing a LockBox to the affected domains and locking the relevant keys when we access the db, e.g. withF([this.db.transaction(), this.locks.lock(...lockRequests)], async ([tran]) => {}) where this.locks is a LockBox and ...lockRequests is an array of [KeyPath, Lock] (see EFS INodeManager.ts).
  • Object maps - in NodeConnectionManager and VaultManager we need to lock groups of objects, this can be done using a LockBox where we are referencing the same ID each time we want to lock a specific object.

Everywhere else we expect that conflicts will be rare, so we don't use locks in order to simplify our codebase. In the case of a conflict, we can either retry (if safe) or bubble up the error to the user.

Errors changes

The new js-errors allows us to bring in error chaining, along with more standardised JSON serialisation/deserialisation (for sending errors across gRPC). With this error chaining ability, there are now three ways that we can handle/propagate errors:

  1. Error re-raise/re-throw: This re-raises the same error, in this case it just means you couldn't handle the error.
try {
  throw new ErrorPolykey('something happened');
} catch (e) {
  throw e;
}
  1. Error override: This overrides the error with a whole new error. Use this when you are wrapping a library of internal errors and presenting transformed external errors to the user.
try {
  throw new ErrorPolykey('something happened');
} catch (e) {
  throw new ErrorPolykey2(e.message);
}
  1. Error chain: This raises a new error and chains the new error with the old error. Use this when your code represents an onion of different layers each capable of handling different errors.
try {
  throw new ErrorPolykey('something happened');
} catch (e) {
  throw new ErrorPolykey2(e.message, { cause: e });
}

In all places where we are catching one error and throwing a different error in its place, we should be using approach 3 (error chain). If we just want to bubble the original exception upwards then use approach 1 (re-raise/re-throw). Finally, if we want to hide the original error from the user (perhaps it contains irrelevant implementation details or could be confusing and thus requires additional context) we can use approach 2 (error override). There is a fourth approach that exists in Python for errors that occur as a direct result of handling another error, however, this does not exist in TypeScript (in such a case we would use approach 3). When using approach 2 (and in some cases approach 3) you may want to log out the original error in addition to throwing the new error.

JSON serialisation/deserialisation

When sending errors between agents/from client to agent we need to serialise errors (including the error chain if this exists). Then, on the receiving side, we need to be able to deserialise back into the original error types.

We are able to do this using JSON.stringify() (serialisation) and JSON.parse() (deserialisation). These methods allow us to pass in a replacer/reviver to aid with converting our error data structure, as well as being combined with toJSON() and fromJSON() utility methods on the error class itself. These are implemented on AbstractError from js-errors, however, we need to extend these to work with ErrorPolykey in order to handle the additional exitCode property. While toJSON() can simply call super.toJSON() and add in the extra field, fromJSON() needs to be completely reimplemented (although this can be copied from AbstractError for the most part). Similarly, the replacer and reviver can be based on the replacer and reviver used in the js-errors tests.

ErrorPolykeyRemote

Errors are propagated between agents and clients as follows:

toError: ServiceError -> ErrorPolykeyRemote                         fromError: Error -> ServerStatusResponse

                                                                         ┌─────────────────────────────┐
      ┌───────────────────────────────┐                                  │ NodeId 1 - Server           │
      │ Client                        │             Request              │                             │
      │                               │                                  │ ┌─────────────────────────┐ │
      │ Receives toError ServiceError ├──────────────────────────────────┼─► Service Handlers        │ │
      │ Throws ErrorPolykeyRemote     │                                  │ │                         │ │
      │ cause: ErrorX | ErrorUnknown  ◄──────────────────────────────────┼─┤ Sends fromError ErrorX  │ │
      │ data: { nodeID: 1, ... }      │                                  │ └─────────────────────────┘ │
      └───────────────────────────────┘      Response with error         │                             │
                                             JSON string in metadata     └─────────────────────────────┘
                                          error: { type: ErrorX, ... }

ErrorPolykeyRemote is constructed on the client side (not the server side) inside our gRPC toError() utility. After the received error is deserialised, it is wrapped as the cause property of a new ErrorPolykeyRemote, which should also contain the nodeId, host, and port of the agent that originally sent the error in its data property. In order to access this information, it needs to be passed through from wherever the client/agent method is called (this would be bin commands for the client service and domain methods for the agent service). The data can then be passed through to the promisify...() methods, which in turn call toError().

Testing

Now that we have an error chain, we need to adjust our tests to be able to perform checks on these. In many cases where we were originally expecting some specific ErrorPolykey we will now be receiving an ErrorPolykeyRemote with the original error in its cause property. For simple cases like this it is simple enough to just perform the existing checks on the cause property of the received error rather than the top-level error, however this approach becomes more complicated for longer error chains. Additionally, we may want to perform checks on the top-level ErrorPolykeyRemote (such as checking the metadata for the sending agent).

In this case, it would be useful to create an expectation utility that allows one to perform checks on the entire error chain, from the top-level error to the final error in the chain. This could look something like this:

function expectErrorChain(thrownError: any, expectedErrors: Array<any>) {
  let checkingError = thrownError;
  for (error in errors) {
    // Check expected type first
    expect(checkingError).toBeInstanceOf(...);
    // Match against every expected property
    expect(checkingError.message).toBe(error.message);
    if (error instanceof ErrorPolykey) {
      expect(checkingError.data).toEqual(error.data);
      // ...
    }
    // Move along to the next in the cause chain
    checkingError = checkingError.cause;
  }
}

We could also think about using the toJSON() method on each error to allow us to use jest's expect().toMatchObject() matcher rather than having to check every error property individually. Also potentially including parameters to specify which properties of the error you do and don't want to check against.

Additional context

Database changes

Changes include but are not limited to

  1. Updating any transact wrappers by removing them or using withF, withG locking directly.
  2. In all cases where there is a conflict just throw it up the stack. We will expect to handle them within the handlers or look deeper into it later.
  3. ErrorDBTransactionConflict Error should never be seen by the user. We should catch and override it with a more descriptive error for the context.
  4. Transactions should be started within the handlers and passed all the way down to where they are needed. The idea is to make each handler attomic.
  5. Concurrency testing should be introduced but only after SI transactions has be implemented.
  6. All usage of DB should be updated to use the new API. This means removing sublevels and utilising LevelPath and KeyPaths instead.
  7. All usage of db streams should be replaced with the db iterator.
  8. All instances of db.put, db.get and db.del should be using transactions via tran.put/get/del

This applies to all domains that make use of DB OR domains that depend on others that make use of DB. The goal here is to make any even starting from the handlers atomic.

There are limitations to this however. Since a transaction can fail if there is overlapping edits between transactions. We can't really include changes to the db that will commonly or guarantee conflict. Example of this are counters or commonly updated fields. So far this has been seen in;

  1. NotificationsManager. Makes use of a counter so any transactions that include Adding or removing a notification WILL conflict. Reads also update metadata so concurrently reading the same message WILL conflict.
  2. More to follow?

Some cases we will need to make use of locking along with a transaction. A good example of this is in the NotificationManager where we are locking the counter update. When this is the case we need to take extra care with the locking. Unless the lock wraps the whole transaction it is still possible to conflict on the transaction. we can't compose operations that rely on this locking with larger transactions.

An example of this problem is.

start tran1
start tran2

start lock
	tran1 update counter
end lock

start lock
	tran2 update counter
end lock

end tran1
end tran2 // conflict!

To avoid this tran2 has to start after tran1 ends. 
We need to force serialisation of the transactions here.
As a result we can't compose with a transaction outside of
the lock.

This means that some operations or domains can't be composed with larger transactions. It has yet to be seen if this will cause an issue since more testing is required to confirm any problem. I suppose this means we can't mix pessimistic and optimistic transactions. So far it seems it will be a problem with the following domains.

  1. Vaults domain - Lifecycle and editing of vaults relies heavily on locks.
  2. NotificationsManager - Locking is needed for the counter and preventing other conflicts.

Node.js changes

After upgrading to Node v16 we will be able to bring in some new features.

  1. Promise.any - we are currently using Promise.all in our pingNode() method in NodeConnectionManager (this is being done in Testnet Deployment #326) but this should change to using Promise.any. This is because Promise.all waits for every promise to resolve/reject, however we only care about whichever finishes first and we want to cancel the rest. This change can be made in Testnet Deployment #326 after this PR is merged.
  2. AggregateError - this error is emitted by Promise.any if all of the given promises are rejected. In our case this would mean that we were unable to ping the desired node via direct connection or signalling (and eventually relaying once this is implemented), so we may want to catch this and re-throw some other error to represent this. We will also need to add this into our error serialisation/deserialisation.
  3. AbortController - there are a number of places that have been identified in Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 where we could use the new AbortController/AbortSignal. This can be done in a separate PR when rebasing Testnet Deployment #326.

Issues Fixed

Tasks

  • 1. Upgrade all exceptions to use AbstractError and use cause chain and static descriptions
  • 2. Remove all sublevel object maintenance
  • 3. Update all usages of db.get and db.put and db.del to use KeyPath
  • 4. Remove all uses of batch and domain locking and instead migrate to using DBTransaction and the withF and withG from @matrixai/resources
  • 5. Replace all uses of createReadStream, createKeyStream and createValueStream with db.iterator
  • 6. All domain methods that involve DB state should now take a tran?: DBTransaction optional parameter in the last parameter to allow one to compose a transaction context
    1. If possible, get EFS to use the new DB as well, which can be done by making DB take a "prefix", thus recreating sublevels, but only for this usecase. This would mean both EFS and PK use the same DB, with EFS completely controlling a lower level. Put on hold, Requires an update to js-db and is low priority. Make a new issue for this?
  • 8. See where RWLock can be used instead when doing concurrency control on the DBTransaction to raise the isolation level to avoid non-repeatable reads, or phantom reads or lost-updates. - SI means that non-repeatable reads and phantom reads are not possible, so that waits on SI merge, while lost-updates is where the counter race condition occurred and that was resolved with just the Lock.
  • 9. Upgrade to Node 16 (follow procedure from Upgrading to node 16.x LTS (and pkgs.nix upgrade to 21.11 or later) TypeScript-Demo-Lib#37)
  • 10. Integrate @matrixai/async-locks to use the Lock class instead of Mutex from async-mutex
  • 11. Use @matrixai/resources withF and withG to replace transactions
  • 12. Clean up left over commentary from migration
  • 13. Update nodeConnectionManager and VaultManager to use lockbox for the object maps.
  • 14. Update NodeConnectionManager withConnF to properly use new resources API features. This includes the ResourceAquire making use of error handling.
  • 15. Update usage of DB.iterator/tran.iterator as well as keyAsBuffer/valueAsBuffer
  • 16. Update to latest version of EFS with fixed decrypt and latest DB
  • 17. Filter error logging on the agent and client service handlers to only log out server errors
  • 18. Fix formatting of CLI errors to display as much (unmodified) info as possible
  • 19. Clean up commits
  • 20. Move ClientMetadata and ErrorPolykeyRemote under the grpc domain
  • 21. Concurrency checks and DB conflict retry in service handlers (pending Introduce Snapshot Isolation OCC to DBTransaction js-db#19)
  • 22. Replace the utils.promisify here with Introduce Snapshot Isolation OCC to DBTransaction js-db#19 (comment)

Tasks are divided by domain for conversion. Assignments are as follows. Domains labelled 'last' depend on a lot of other domains and will be divided later.

  • acl - Emma
  • agent - last
  • bootstrap - last
  • client - last
  • discovery - Brain
  • gestalts - Emma
  • identities - Emma
  • nodes - Emma , depends on sigchain
  • notifications - Brian
  • sessions - Emma
  • sigchain - Brain
  • vaults - Brian
  • polykeyAgent - last

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

TBD spec:

Error changes:

Testing for an error with a cause.

  1. Pending changes from JSON serialisation/deserialisation from the new AbstractError
  2. ErrorPolykeyRemote needs to have remote node metadata
  3. Work out how ErrorPolykeyRemote is propagated from agent to agent to client.
  4. Create new expectation utilities that can standardise how we expect things to fail and the checks on those exceptions, as well as the cause property
  5. Bring in the standards of exception override vs exception chain into the spec

DB changes:

domains - Each domain used to have a transact or withTransaction wrapper. Unless this wrapper actually needs some additional context, and it is just calling withF and withG directly, then avoid this and just replace it with direct usage of withF and withG utilities. We should identify domains that require additional context, in those cases we continue to use a with transaction wrapper.

transaction, and its utilisation of SI transactions

withF([db.transaction()], async ([tran]) => {
    await tran.put(k1, ...);
    await tran.put(k2, ...);
});

withF([db.transaction()], async ([tran]) => {
    await tran.put(k2, ...);
    await tran.put(k3, ...);
});
ErrorDBTransactionConflict
  .data => [k1, k2, k3]

In all cases, you just throw this exception up, and propagate it all the way to the client.

The pk client however will change the message to be more specific to the user's action.

Retrying this can be identified and sprinkled into the codebase afterwards.

  1. Work out a way to write a user-friendly message when a ErrorDBTransactionConflict occurs in the PolykeyClient (at the CLI)

  2. Identify where automatic retries can occur, and make those catch the ErrorDBTransactionConflict and resubmit the transaction

  3. Identify write-skews, this is where we must use solutions for dealing with write skews => locking and materialising the write conflict

  4. Identify serialisation requirements - counter updates, and where PCC is demanded as part of the API, in those situations use the LockBox or lock

Handlers need to now start the transaction. For agent and client.

We have tran?: DBTransaction where it allows public methods to create their own transaction context.

But this is only for debugging and convenience.

During production, the transaction context is meant to be setup at the handler level.

This means that each handler is its own atomic operation.

By default they start out at the handler level, this will make it easier for us to identify our atomic operations and to compare them.

Tests - introduce concurrency tests, consider some new cocurrency expectation combinators that allow us to more easily say one of these results has to be X, and the rest has to be Y.

Concurrency tests can be introduced domain by domain. - Use Promise.allSettled

We can however focus our concurrency testing at the handler level, because we expect that to be the unit of atomic operations.

All ops removed, changed to using keypath and levelpath.

In the iterator:

  • For tran.iterator you will always have the key no matter what because it is being used for matching between the overlay and underlying db data.
  • The keyAsBuffer and valueAsBuffer are by default true, if you change them to false, it will decode the 2, and return as decoded values. If you need both, use dbUtils.deserialize().
  • More docs for tran.iterator
  • Eliminate all subleveldown, we are no longer using that (can be brought back later, would only be useful to share db betweeen EFS and PK DB)
  • Eliminate all streams
withF([db.transaction()], async ([tran]) => {
    await tran.put('k1', 'adfdsf);
    const i = tran.iterator();
    const r1 = await tran.get('k1')
    i.seek('k1')
    const r2 = i.next()
    // r1 !== r2 (possible)
    // with SI r1 === r2
});

Just think that the iterator has the same snapshot as the original transaction.

Don't start writing concurrency tests until we have SI transactions. Serial tests are still fine.

Nodev16 Changes:

  1. Promise.any - for ICE protocol for the DC and signalling
  2. AggregateError - its in our serialisation and deserialisation, but it's only emitted by Promise.any
  3. AbortController - separate PR (when rebasing testnet deployment) - identify where these may be changed polling

Locking Changes:

All async locks and async mutex is to be replaced with js-async-locks.

In the js-async-locks:

  1. Lock - mutex
  2. RWLockWriter - read/write
  3. LockBox - locking collection

Convert object-map pattern to using LockBox by referencing the same ID.

Use INodeManager in EFS as a guide.

@tegefaulkes
Copy link
Contributor Author

This PR needs to target the staging branch when that gets created.

@emmacasolin
Copy link
Contributor

The vaults client service tests are the only ones that aren't split like the others into separate test files. They also set up the client service differently from the other tests and are missing some parameters due to the errors changes. Going to split and align them with the other tests to try and get them running.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented May 24, 2022

I'm looking into merging the vault's EFS db with the main DB. At first blush it seems easy. We can just provide the DB to the EFS constructor. However I don't think that the DB supports prefix/sublevel anymore. Do We need to re implement that on the DB before we can continue with this? @CMCDragonkai

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 24, 2022

It would require the DB to maintain knowledge of a levelpath to be prepended to any queries.

But other questions remain, what about the root paths? Does the subDB maintain their own transactions and indexes? Or are they shared with the top DB?

Anyway for now not important we can address it later.

@tegefaulkes
Copy link
Contributor Author

Tests in tests/notifications/NotificationsManager.test.ts is currently failing randomly. I've narrowed it down to the function getNotificationIds returning a malformed NotificationId in the list. For example normally the NotificationId should be 16 bytes. but we are getting undersized ones.

    IdInternal(9) [Uint8Array] [
        0, 187,  28,   5,
        0, 119, 190, 113,
      109
    ]

If I had to guess there is a problem with the iterator reading the keys out properly. It could be that the a malformed ID is used when adding the message to the DB. But also possible there is a bug with the DB's handling of the path delimiter. This could happen if the ID itself contains the delimiter and that wasn't handled properly.

I'd start with checking if the ID is malformed before it's put into the database. If it's the delimiter problem then the IDs that fail should always contain the same character.

@CMCDragonkai
Copy link
Member

Tests in tests/notifications/NotificationsManager.test.ts is currently failing randomly. I've narrowed it down to the function getNotificationIds returning a malformed NotificationId in the list. For example normally the NotificationId should be 16 bytes. but we are getting undersized ones.

    IdInternal(9) [Uint8Array] [
        0, 187,  28,   5,
        0, 119, 190, 113,
      109
    ]

If I had to guess there is a problem with the iterator reading the keys out properly. It could be that the a malformed ID is used when adding the message to the DB. But also possible there is a bug with the DB's handling of the path delimiter. This could happen if the ID itself contains the delimiter and that wasn't handled properly.

I'd start with checking if the ID is malformed before it's put into the database. If it's the delimiter problem then the IDs that fail should always contain the same character.

The delimiter in the database is a the \x00 null byte. There are tests in js-db that cover this usecase to ensure that it's properly escaping keys that have the \x00 null byte.

I'm not sure if the js-id case is covered well enough, but @emmacasolin you can check the js-db tests.

Just make sure that this branch is actually using the latest js-db check (package-lock.json and node_modules/@matrixai/db/package.json).

@emmacasolin
Copy link
Contributor

emmacasolin commented May 24, 2022

The vaults client service tests are the only ones that aren't split like the others into separate test files. They also set up the client service differently from the other tests and are missing some parameters due to the errors changes. Going to split and align them with the other tests to try and get them running.

Spec for this:

I've currently split all of the vaults client service tests (not secrets), however there were no existing tests for some of the handlers so those are currently todos.

Tests that have been split into separate files and are passing:

  • vaultsCreate, vaultsDelete, and vaultsList (all combined into a single file since they're interdependent)
  • vaultsLog
  • vaultsPermissionSet, vaultsPermissionUnset, and vaultsPermissionGet (again, combined due to interdependence)
  • vaultsRename
  • vaultsVersion

Tests that have been split into separate files but are still todo:

  • vaultsClone
  • vaultsPull
  • vaultsScan

Tests that are written but are yet to be split:

  • vaultsSecretsNew/vaultsSecretsDelete/vaultsSecretsGet
  • vaultsSecretsNewDir/vaultsSecretsList
  • vaultsSecretsMkdir
  • vaultsSecretsEdit
  • vaultsSecretsRename
  • vaultsSecretsStat

@emmacasolin
Copy link
Contributor

emmacasolin commented May 24, 2022

I'm noticing a pattern with the malformed ids.

If you look at the valid ids, they all look something like this:

   6,  40, 197, 143, 103, 104, 112,   0, 187, 239, 222, 246, 239, 105, 136, 223
   6,  40, 197, 143, 103, 194, 112,   0, 153, 154,  25,  17,  55,  81,   7, 147
   6,  40, 197, 143, 105,  71, 112,   0, 153,  51, 165, 184,  95, 177,  94, 116
   6,  40, 197, 143, 108, 192, 112,   0, 148,  42, 166, 251,  64, 233, 161,  23
   6,  40, 197, 143, 108, 208, 112,   0, 158, 229,  40,  54, 129, 244,  55, 177
   6,  40, 197, 143, 108, 221, 112,   0, 131, 130,  32,  78, 126, 110,  85, 120

Note that they all only contain one 0.

Looking at the malformed ids, these all have a second 0 somewhere in the array:

                                      0, 188, 217, 193,  78,   0, 120, 201, 211
                       0,  12, 112,   0, 134, 242, 214, 168, 251, 149, 166, 127
                                      0, 164,  67, 242,   2,  10, 126,   0,  44
                            0, 112,   0, 139,  57,  29,  46,  18, 182, 188, 110
                                      0, 179, 100, 199,   0, 123,  31,  30, 180
                                      0, 142,   9,   0, 136, 186, 123, 100,  47

Since 0 is the delimiter, it's likely that these extra zeros are being misinterpreted. Notification Ids are just randomly generated lexicographic integers, and these can contain zeros. This is the same case for Discovery Queue Ids, so this is likely why I was seeing occasional random failures in those tests in the past as well. I'll have to take a look at the ids before they ever get put into the db in order to confirm this theory.


Support for this idea is that the ids are indeed not malformed when they are created and put into the db. Here's an example of one of them:

# Before going into the db
   6,  40, 199,   7, 225, 174, 112,   0, 178,  28,   0, 153, 149, 157,  24, 131

# The key buffer read out of the db for this key
<Buffer 00 b2 1c 00 99 95 9d 18 83>

# The id after being read from the db
                                      0, 178,  28,   0, 153, 149, 157,  24, 131

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented May 24, 2022

I had a look at js-db and I'm guessing the problem is in parseLevel, src/utils.ts:155 in js-db. I can't confirm but I feel that it should be handling the if (b === esc[0]) case before the if (b === sep[0]) case. That is, it should be checking for escaped separators before handling the separators.

Referring to this part of the code.

  for (let i = 0; i < buf.byteLength; i++) {
    const b = buf[i];
    if (b === sep[0]) {
      // Note that `buf` is a subarray offset from the input
      // therefore the `sepEnd` must be offset by the same length
      sepEnd = i + (sepStart + 1);
      break;
    } else if (b === esc[0]) {
      const n = buf[i + 1];
      // Even if undefined
      if (n !== esc[0] && n !== sep[0]) {
        throw new errors.ErrorDBParseKey('Invalid escape sequence');
      }
      // Push the n
      levelBytes.push(b, n);
      // Skip the n
      i++;
    } else {
      levelBytes.push(b);
    }
  }

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 24, 2022

@emmacasolin you should be able to go to js-db's tests/utils.test.ts. Look for the test parse key paths.

In there, there are several tests for different key possiblities.

You can take generated IDs, and stick them there. Put one of the failing IDs in.

The test checks if enocding it to the a key then decoding it back to a keypath is equvalent.

Do something like:

const keyPaths: Array<KeyPath> = [
  // ...existing keys being checked
  [generatedId1],
  [generatedId2],
  [generatedId3, generatedId4],
  [failingGeneratedId]
];
// ... continue

You can generate the ids directly in that test, or just copy base the buffers.

@tegefaulkes
Copy link
Contributor Author

Out of the 5 failing proxy tests. 4 of them pass if we comment out the failing expect that is checking for an error. The Error itself is a write after end error. I'm not sure if the error is a problem if we ignore it. I haven't had any luck finding the write that is the source of the error but I haven't looked too deeply at the forward/reverse connections.

The last error is puzzling. I'm getting a type error from code internal to node.js.

Cannot read properties of null (reading 'finishShutdown')
TypeError: Cannot read properties of null (reading 'finishShutdown')
    at JSStreamSocket.finishShutdown (node:internal/js_stream_socket:160:12)
    at node:internal/js_stream_socket:147:14
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processImmediate (node:internal/timers:437:9)
    ```
    I can't find anything that could be causing this. It seems to only be a problem with that one test. The test itself is checking if the connection times out properly. It seems that the test is passing fine and the error itself only happens after the `test()` and `afterEach()` completes. 
    
    I can't see any problem with the timeout mechanism in the `ReverseConnection`. I couldn't tell you where this is coming from except for the fact that `js_stream_socket` is possibly related to the node.js `net` internal library.

Running theory is that We're not handling the shutdown of some network resource properly in this specific test case. I need to think on it some more.

@emmacasolin
Copy link
Contributor

Out of the 5 failing proxy tests. 4 of them pass if we comment out the failing expect that is checking for an error. The Error itself is a write after end error. I'm not sure if the error is a problem if we ignore it. I haven't had any luck finding the write that is the source of the error but I haven't looked too deeply at the forward/reverse connections.

The last error is puzzling. I'm getting a type error from code internal to node.js.

Cannot read properties of null (reading 'finishShutdown')
TypeError: Cannot read properties of null (reading 'finishShutdown')
    at JSStreamSocket.finishShutdown (node:internal/js_stream_socket:160:12)
    at node:internal/js_stream_socket:147:14
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processImmediate (node:internal/timers:437:9)
    ```
    I can't find anything that could be causing this. It seems to only be a problem with that one test. The test itself is checking if the connection times out properly. It seems that the test is passing fine and the error itself only happens after the `test()` and `afterEach()` completes. 
    
    I can't see any problem with the timeout mechanism in the `ReverseConnection`. I couldn't tell you where this is coming from except for the fact that `js_stream_socket` is possibly related to the node.js `net` internal library.

Running theory is that We're not handling the shutdown of some network resource properly in this specific test case. I need to think on it some more.

I'm not sure if this is related to this issue, but when running npm install I get the following errors (these errors only show up in the terminal, they aren't caught by my IDE)

src/http/utils.ts:23:19 - error TS2345: Argument of type 'Duplex' is not assignable to parameter of type 'Socket'.
  Type 'Duplex' is missing the following properties from type 'Socket': connect, setTimeout, setNoDelay, setKeepAlive, and 8 more.

23       sockets.add(socket);
                     ~~~~~~

src/http/utils.ts:25:24 - error TS2345: Argument of type 'Duplex' is not assignable to parameter of type 'Socket'.

25         sockets.delete(socket);
                          ~~~~~~


Found 2 errors in the same file, starting at: src/http/utils.ts:23

@emmacasolin
Copy link
Contributor

@emmacasolin you should be able to go to js-db's tests/utils.test.ts. Look for the test parse key paths.

In there, there are several tests for different key possiblities.

You can take generated IDs, and stick them there. Put one of the failing IDs in.

The test checks if enocding it to the a key then decoding it back to a keypath is equvalent.

Do something like:

const keyPaths: Array<KeyPath> = [
  // ...existing keys being checked
  [generatedId1],
  [generatedId2],
  [generatedId3, generatedId4],
  [failingGeneratedId]
];
// ... continue

You can generate the ids directly in that test, or just copy base the buffers.

The failing ids are indeed failing in this test. The problem occurs when you have two separators in the key part (this doesn't happen for the level part or when using escapes) that are both in front of other data. For example,

[Buffer.concat([utils.sep, Buffer.from('foobar'), utils.sep])] // passes - separators are not inbetween data
[Buffer.concat([Buffer.from('foobar'), utils.sep, Buffer.from('foobar')])] // passes - only one separator
[Buffer.concat([Buffer.from('foobar'), utils.sep, Buffer.from('foobar'), utils.sep])] // passes - only one of the separators is in front of data
[Buffer.concat([utils.sep, Buffer.from('foobar'), utils.sep, Buffer.from('foobar')])] // FAILS

With that last one, the original keypath is ['foobarfoobar'] but after being encoded and decoded it's ['foobar', 'foobar']

@tegefaulkes
Copy link
Contributor Author

I've have stop gap fixes for the problematic network tests. As I've said before 4 of the 5 tests are fixed by just ignoring the error. It doesn't seem to cause a problem with the rest of the code base and may just be an issue specific to the tests.
The other tests that with the weird error mentioned here was fixed by setting the proxyHost when starting the proxy.

As it stands all of the proxy tests are passing now. I can look deeper into the cause of the problems but I feel a this point it is low priority. If we want to explore them deeper then I can make a new issue for it to be address later.

@CMCDragonkai
Copy link
Member

Need to verify:

    "@grpc/grpc-js": "1.3.7",

Change to internals

    "@grpc/grpc-js": "1.6.7",

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 25, 2022

There are 2 issues here that I need to have a look at:

  1. For the DB key parsing issue Updating keyPathToKey to escape key parts js-db#22 - the existence of the separator in the terminal part of the key path causes problems during parseKey, it gets parsed as 2 separate parts. This is because the parseKey doesn't know when to stop parsing. It needs to know when it's the last part and stop parsing any separators. Escaping the last part is not a good solution as this affects all iteration usage including outside the DB.
  2. The network tests are still an issue (Upgrading lib dependencies and node.js version #374 (comment)), I'm in the middle of copying back from master branch what the existing network tests were and finding out what's the exact problem here. Furthmore the network tests are still not using direct imports, it needs to be changed there to avoid cross-cutting concerns during testing.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 25, 2022

The adding of the ClientMetadata in src/types.ts ends up causing src/types.ts to depend on the internals. This is a loop. We need to prevent loops. Internal domains depend on higher levels, but high level types cannot depend on internal domain types. (Which means we have to change where ClientMetadata goes.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 25, 2022

I'm getting type errors on expected number of arguments for the tests/grpc/utils. @emmacasolin can you update them according to your changes to GRPC:

 FAIL  tests/network/index.test.ts
  ● Test suite failed to run

    tests/grpc/utils/GRPCClientTest.ts:76:22 - error TS2554: Expected 3 arguments, but got 2.

    76     return grpcUtils.promisifyUnaryCall<utilsPB.EchoMessage>(
                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    77       this.client,
       ~~~~~~~~~~~~~~~~~~
    78       this.client.unary,
       ~~~~~~~~~~~~~~~~~~~~~~~~
    79     )(...args);
       ~~~~~

      src/grpc/utils/utils.ts:369:3
        369   f: (...args: any[]) => ClientUnaryCall,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:84:22 - error TS2554: Expected 3 arguments, but got 2.

    84     return grpcUtils.promisifyReadableStreamCall<utilsPB.EchoMessage>(
                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    85       this.client,
       ~~~~~~~~~~~~~~~~~~
    86       this.client.serverStream,
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    87     )(...args);
       ~~~~~

      src/grpc/utils/utils.ts:470:3
        470   f: (...args: any[]) => ClientReadableStream<TRead>,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:92:22 - error TS2554: Expected 3 arguments, but got 2.

     92     return grpcUtils.promisifyWritableStreamCall<
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     93       utilsPB.EchoMessage,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~
    ... 
     97       this.client.clientStream,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     98     )(...args);
        ~~~~~

      src/grpc/utils/utils.ts:542:3
        542   f: (...args: any[]) => ClientWritableStream<TWrite>,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:103:22 - error TS2554: Expected 3 arguments, but got 2.

    103     return grpcUtils.promisifyDuplexStreamCall<
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
    104       utilsPB.EchoMessage,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~
    ... 
    108       this.client.duplexStream,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    109     )(...args);
        ~~~~~

      src/grpc/utils/utils.ts:653:3
        653   f: (...args: any[]) => ClientDuplexStream<TWrite, TRead>,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:114:22 - error TS2554: Expected 3 arguments, but got 2.

    114     return grpcUtils.promisifyUnaryCall<utilsPB.EchoMessage>(
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    115       this.client,
        ~~~~~~~~~~~~~~~~~~
    116       this.client.unaryAuthenticated,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    117     )(...args);
        ~~~~~

      src/grpc/utils/utils.ts:369:3
        369   f: (...args: any[]) => ClientUnaryCall,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:127:22 - error TS2554: Expected 3 arguments, but got 2.

    127     return grpcUtils.promisifyReadableStreamCall<utilsPB.EchoMessage>(
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    128       this.client,
        ~~~~~~~~~~~~~~~~~~
    129       this.client.serverStreamFail,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    130     )(...args);
        ~~~~~

      src/grpc/utils/utils.ts:470:3
        470   f: (...args: any[]) => ClientReadableStream<TRead>,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.
    tests/grpc/utils/GRPCClientTest.ts:135:22 - error TS2554: Expected 3 arguments, but got 2.

    135     return grpcUtils.promisifyUnaryCall<utilsPB.EchoMessage>(
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    136       this.client,
        ~~~~~~~~~~~~~~~~~~
    137       this.client.unaryFail,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    138     )(...args);
        ~~~~~

      src/grpc/utils/utils.ts:369:3
        369   f: (...args: any[]) => ClientUnaryCall,
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        An argument for 'f' was not provided.

I think you didn't update these when the errors metadata was meant to be put in.

Alternatively to make them backwards compatible we can default the parameter to be an empty object.

@CMCDragonkai
Copy link
Member

Experiment with network fixes is here: #376, I'll get to this after js-db issues fixed. In the mean time @tegefaulkes after all the other things are fixed, best to have a look at it.

@tegefaulkes
Copy link
Contributor Author

All except for 2 tests are passing. Almost done.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 25, 2022

@emmacasolin when you're fixing up the vaults or other test domains please replace imports to be more specific where you see fit.

These 2 will still need to be fixed:


Quick update regarding the js-db and the key-parsing issue. To fix the ability to have separators in the terminal part it will involve an API change to how iterators work. Details are now in MatrixAI/js-db#19. I'll have to cherry pick what I have there into staging since the changes are also combined with the ability to use keyAsBuffer and valueAsBuffer. This will end up affecting EFS and PK where you are using iterators.

@emmacasolin
Copy link
Contributor

@tegefaulkes I've replaced the places in the tests where you had used {} as ClientMetadata with actual/mocked data, as this information is required in order for ErrorPolykeyRemotes to be constructed. Therefore, in those tests the NodeId, Host, Port, and Command name are now either taken from constants already used in the tests or mocked. Note that I didn't fix this for rpcVaults since I'm splitting those anyway, and part of doing that is constructing the client in the same way as the other tests (which does not require the metadata to be set manually since it uses an actual client).

Also there were some failing tests in grpc/utils.test.ts due to more recent errors changes. Specifically the descriptions for polykey errors were not being serialised/deserialised so that is fixed now, and there was a very brittle test for expecting error messages to be empty for ErrorPolykeyUnknown, however these are constructed in our reviver (and could change) so I removed the tests for the message and just added a note above the test for how the unknown data is encoded (because this is also implemented in the reviver).

@tegefaulkes
Copy link
Contributor Author

Cool. I think with that last commit all of the tests are passing. except for the random failures due the the DB bug.

I think @CMCDragonkai mentioned that we could make the ClientMetadata optional but since everything is fixed to be using it properly now I'm not sure it's needed.

I should mention that I changed how the metadata was converted in the ErrorPolykeyRemote toJSON and fromJSON methods. Some tests were breaking because metadata wasn't revived properly when it was coming from a remote ageng. EG a chain on ErrorPolykeyRemote>ErrorPolykeyRemote>someError.

I'm going to go look at the proxy network errors for now.

emmacasolin and others added 26 commits June 7, 2022 11:43
Problem was with the tests themselves so i've fixed the tests. If there are any similar issues crop up later, then we can explore the underlying cause in depth. However, it is low priority for now.
There were places where ClientMetadata was not being provided to grpc calls (`{} as ClientMetadata`) however this information is required in order to construct ErrorPolykeyRemote. Where metadata is needed in tests it is either now taken from constants used in the tests or mocked. There was also an issue with error descriptions not being serialised/deserialised which is now fixed. Also some brittle test cases for error messages were removed.
Had to fix the usage in 5 domains.
It was just a problem with trying to connect to '0.0.0.0' with the UTP socket.
Having `ClientMetadata` in the top-level types caused an import loop. By moving it and `ErrorPolykeyRemote` to the grpc domain this loop is resolved. This was the best option since now all usage of `ClientMetadata` is inside the grpc domain.
We only want to log out server errors on the server side, so we now apply filters on errors before logging them.
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array.

#304
@emmacasolin emmacasolin force-pushed the feature-dep-upgrades branch from 8c80f02 to c37eddb Compare June 7, 2022 01:43
@emmacasolin emmacasolin marked this pull request as ready for review June 7, 2022 01:48
@emmacasolin emmacasolin merged commit 631a0da into staging Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment