-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading lib dependencies and node.js version #374
Conversation
This PR needs to target the staging branch when that gets created. |
The vaults client service tests are the only ones that aren't split like the others into separate test files. They also set up the client service differently from the other tests and are missing some parameters due to the errors changes. Going to split and align them with the other tests to try and get them running. |
I'm looking into merging the vault's EFS db with the main DB. At first blush it seems easy. We can just provide the DB to the EFS constructor. However I don't think that the DB supports prefix/sublevel anymore. Do We need to re implement that on the DB before we can continue with this? @CMCDragonkai |
It would require the DB to maintain knowledge of a levelpath to be prepended to any queries. But other questions remain, what about the root paths? Does the subDB maintain their own transactions and indexes? Or are they shared with the top DB? Anyway for now not important we can address it later. |
Tests in
If I had to guess there is a problem with the iterator reading the keys out properly. It could be that the a malformed ID is used when adding the message to the DB. But also possible there is a bug with the DB's handling of the path delimiter. This could happen if the ID itself contains the delimiter and that wasn't handled properly. I'd start with checking if the ID is malformed before it's put into the database. If it's the delimiter problem then the IDs that fail should always contain the same character. |
The delimiter in the database is a the I'm not sure if the js-id case is covered well enough, but @emmacasolin you can check the js-db tests. Just make sure that this branch is actually using the latest js-db check ( |
Spec for this: I've currently split all of the vaults client service tests (not secrets), however there were no existing tests for some of the handlers so those are currently todos. Tests that have been split into separate files and are passing:
Tests that have been split into separate files but are still todo:
Tests that are written but are yet to be split:
|
I'm noticing a pattern with the malformed ids. If you look at the valid ids, they all look something like this:
Note that they all only contain one 0. Looking at the malformed ids, these all have a second 0 somewhere in the array:
Since 0 is the delimiter, it's likely that these extra zeros are being misinterpreted. Notification Ids are just randomly generated lexicographic integers, and these can contain zeros. This is the same case for Discovery Queue Ids, so this is likely why I was seeing occasional random failures in those tests in the past as well. I'll have to take a look at the ids before they ever get put into the db in order to confirm this theory. Support for this idea is that the ids are indeed not malformed when they are created and put into the db. Here's an example of one of them:
|
I had a look at Referring to this part of the code. for (let i = 0; i < buf.byteLength; i++) {
const b = buf[i];
if (b === sep[0]) {
// Note that `buf` is a subarray offset from the input
// therefore the `sepEnd` must be offset by the same length
sepEnd = i + (sepStart + 1);
break;
} else if (b === esc[0]) {
const n = buf[i + 1];
// Even if undefined
if (n !== esc[0] && n !== sep[0]) {
throw new errors.ErrorDBParseKey('Invalid escape sequence');
}
// Push the n
levelBytes.push(b, n);
// Skip the n
i++;
} else {
levelBytes.push(b);
}
} |
@emmacasolin you should be able to go to js-db's In there, there are several tests for different key possiblities. You can take generated IDs, and stick them there. Put one of the failing IDs in. The test checks if enocding it to the a key then decoding it back to a keypath is equvalent. Do something like: const keyPaths: Array<KeyPath> = [
// ...existing keys being checked
[generatedId1],
[generatedId2],
[generatedId3, generatedId4],
[failingGeneratedId]
];
// ... continue You can generate the ids directly in that test, or just copy base the buffers. |
Out of the 5 failing proxy tests. 4 of them pass if we comment out the failing The last error is puzzling. I'm getting a type error from code internal to node.js.
|
I'm not sure if this is related to this issue, but when running
|
The failing ids are indeed failing in this test. The problem occurs when you have two separators in the key part (this doesn't happen for the level part or when using escapes) that are both in front of other data. For example, [Buffer.concat([utils.sep, Buffer.from('foobar'), utils.sep])] // passes - separators are not inbetween data
[Buffer.concat([Buffer.from('foobar'), utils.sep, Buffer.from('foobar')])] // passes - only one separator
[Buffer.concat([Buffer.from('foobar'), utils.sep, Buffer.from('foobar'), utils.sep])] // passes - only one of the separators is in front of data
[Buffer.concat([utils.sep, Buffer.from('foobar'), utils.sep, Buffer.from('foobar')])] // FAILS With that last one, the original keypath is |
I've have stop gap fixes for the problematic network tests. As I've said before 4 of the 5 tests are fixed by just ignoring the error. It doesn't seem to cause a problem with the rest of the code base and may just be an issue specific to the tests. As it stands all of the proxy tests are passing now. I can look deeper into the cause of the problems but I feel a this point it is low priority. If we want to explore them deeper then I can make a new issue for it to be address later. |
Need to verify:
Change to internals
|
There are 2 issues here that I need to have a look at:
|
The adding of the |
I'm getting type errors on expected number of arguments for the
I think you didn't update these when the errors metadata was meant to be put in. Alternatively to make them backwards compatible we can default the parameter to be an empty object. |
Experiment with network fixes is here: #376, I'll get to this after js-db issues fixed. In the mean time @tegefaulkes after all the other things are fixed, best to have a look at it. |
All except for 2 tests are passing. Almost done. |
@emmacasolin when you're fixing up the vaults or other test domains please replace imports to be more specific where you see fit. These 2 will still need to be fixed:
Quick update regarding the js-db and the key-parsing issue. To fix the ability to have separators in the terminal part it will involve an API change to how iterators work. Details are now in MatrixAI/js-db#19. I'll have to cherry pick what I have there into staging since the changes are also combined with the ability to use |
@tegefaulkes I've replaced the places in the tests where you had used Also there were some failing tests in |
Cool. I think with that last commit all of the tests are passing. except for the random failures due the the DB bug. I think @CMCDragonkai mentioned that we could make the I should mention that I changed how the metadata was converted in the I'm going to go look at the proxy network errors for now. |
Problem was with the tests themselves so i've fixed the tests. If there are any similar issues crop up later, then we can explore the underlying cause in depth. However, it is low priority for now.
There were places where ClientMetadata was not being provided to grpc calls (`{} as ClientMetadata`) however this information is required in order to construct ErrorPolykeyRemote. Where metadata is needed in tests it is either now taken from constants used in the tests or mocked. There was also an issue with error descriptions not being serialised/deserialised which is now fixed. Also some brittle test cases for error messages were removed.
Had to fix the usage in 5 domains.
It was just a problem with trying to connect to '0.0.0.0' with the UTP socket.
Having `ClientMetadata` in the top-level types caused an import loop. By moving it and `ErrorPolykeyRemote` to the grpc domain this loop is resolved. This was the best option since now all usage of `ClientMetadata` is inside the grpc domain.
We only want to log out server errors on the server side, so we now apply filters on errors before logging them.
Sometimes an error is only considered a client error when part of the chain of another error (for example ErrorPolykeyRemote) so it is now possible to specify this array. #304
8c80f02
to
c37eddb
Compare
Description
This PR has been copied from #366 under a new branch to allow for branch and CI/CD protections.
Several core libraries have been updated and so they need to be integrated into PK. Refer to MatrixAI/js-encryptedfs#63 to see how this was handled for EFS and repeat here.
@matrixai/async-init
- no major changes@matrixai/async-locks
- all instances ofMutex
should change to useLock
orRWLockWriter
@matrixai/db
- the DB now supports proper DB transactions and has introducediterator
, however no more sublevels (useKeyPath
/LevelPath
instead)@matrixai/errors
- we make all of our errors extendAbstractError<T>
and also provide static descriptions to all of them, as well as use thecause
chain@matrixai/workers
- no major changes herenode.js
- we should upgrade to Node 16 in order to integrate promise cancellation, as well as usingPromise.any
for ICE protocol@matrixai/resources
- since the@matrixai/db
no longer does any locking, the acquisition of theDBTransaction
andLock
has to be done together withwithF
orwithG
Along the way, we can explore how to deal with indexing #188 and #257, which should be easier now that DB has root levels of
data
,transactions
.Locking changes
There were three important types of locks brought in by
js-async-locks
:Mutex
which we were using previously)In most cases, we are removing locks in favour of using optimistic concurrency control (OCC). This means that most domain locks should be removed with a few exceptions:
Discovery.ts
, theNotificationId
s and message count used byNotificationsManager.ts
, theClaimId
s and sequence numbers used bySigchain.ts
, and if the set/ping node or refresh buckets queues introduced in Testnet Deployment #326 become persistent then we would need locking there as well. This could be achieved for these cases by introducing aLockBox
to the affected domains and locking the relevant keys when we access the db, e.g.withF([this.db.transaction(), this.locks.lock(...lockRequests)], async ([tran]) => {})
wherethis.locks
is aLockBox
and...lockRequests
is an array of[KeyPath, Lock]
(see EFSINodeManager.ts
).NodeConnectionManager
andVaultManager
we need to lock groups of objects, this can be done using a LockBox where we are referencing the same ID each time we want to lock a specific object.Everywhere else we expect that conflicts will be rare, so we don't use locks in order to simplify our codebase. In the case of a conflict, we can either retry (if safe) or bubble up the error to the user.
Errors changes
The new
js-errors
allows us to bring in error chaining, along with more standardised JSON serialisation/deserialisation (for sending errors across gRPC). With this error chaining ability, there are now three ways that we can handle/propagate errors:In all places where we are catching one error and throwing a different error in its place, we should be using approach 3 (error chain). If we just want to bubble the original exception upwards then use approach 1 (re-raise/re-throw). Finally, if we want to hide the original error from the user (perhaps it contains irrelevant implementation details or could be confusing and thus requires additional context) we can use approach 2 (error override). There is a fourth approach that exists in Python for errors that occur as a direct result of handling another error, however, this does not exist in TypeScript (in such a case we would use approach 3). When using approach 2 (and in some cases approach 3) you may want to log out the original error in addition to throwing the new error.
JSON serialisation/deserialisation
When sending errors between agents/from client to agent we need to serialise errors (including the error chain if this exists). Then, on the receiving side, we need to be able to deserialise back into the original error types.
We are able to do this using
JSON.stringify()
(serialisation) andJSON.parse()
(deserialisation). These methods allow us to pass in a replacer/reviver to aid with converting our error data structure, as well as being combined withtoJSON()
andfromJSON()
utility methods on the error class itself. These are implemented onAbstractError
fromjs-errors
, however, we need to extend these to work withErrorPolykey
in order to handle the additionalexitCode
property. WhiletoJSON()
can simply callsuper.toJSON()
and add in the extra field,fromJSON()
needs to be completely reimplemented (although this can be copied fromAbstractError
for the most part). Similarly, the replacer and reviver can be based on the replacer and reviver used in thejs-errors
tests.ErrorPolykeyRemote
Errors are propagated between agents and clients as follows:
ErrorPolykeyRemote
is constructed on the client side (not the server side) inside our gRPCtoError()
utility. After the received error is deserialised, it is wrapped as the cause property of a newErrorPolykeyRemote
, which should also contain thenodeId
,host
, andport
of the agent that originally sent the error in its data property. In order to access this information, it needs to be passed through from wherever the client/agent method is called (this would be bin commands for the client service and domain methods for the agent service). The data can then be passed through to thepromisify...()
methods, which in turn calltoError()
.Testing
Now that we have an error chain, we need to adjust our tests to be able to perform checks on these. In many cases where we were originally expecting some specific
ErrorPolykey
we will now be receiving anErrorPolykeyRemote
with the original error in its cause property. For simple cases like this it is simple enough to just perform the existing checks on the cause property of the received error rather than the top-level error, however this approach becomes more complicated for longer error chains. Additionally, we may want to perform checks on the top-levelErrorPolykeyRemote
(such as checking the metadata for the sending agent).In this case, it would be useful to create an expectation utility that allows one to perform checks on the entire error chain, from the top-level error to the final error in the chain. This could look something like this:
We could also think about using the
toJSON()
method on each error to allow us to use jest'sexpect().toMatchObject()
matcher rather than having to check every error property individually. Also potentially including parameters to specify which properties of the error you do and don't want to check against.Additional context
js-errors
Database changes
Changes include but are not limited to
withF
,withG
locking directly.ErrorDBTransactionConflict
Error should never be seen by the user. We should catch and override it with a more descriptive error for the context.LevelPath
andKeyPath
s instead.db.put
,db.get
anddb.del
should be using transactions viatran.put/get/del
This applies to all domains that make use of DB OR domains that depend on others that make use of DB. The goal here is to make any even starting from the handlers atomic.
There are limitations to this however. Since a transaction can fail if there is overlapping edits between transactions. We can't really include changes to the db that will commonly or guarantee conflict. Example of this are counters or commonly updated fields. So far this has been seen in;
NotificationsManager
. Makes use of a counter so any transactions that include Adding or removing a notification WILL conflict. Reads also update metadata so concurrently reading the same message WILL conflict.Some cases we will need to make use of locking along with a transaction. A good example of this is in the
NotificationManager
where we are locking the counter update. When this is the case we need to take extra care with the locking. Unless the lock wraps the whole transaction it is still possible to conflict on the transaction. we can't compose operations that rely on this locking with larger transactions.An example of this problem is.
This means that some operations or domains can't be composed with larger transactions. It has yet to be seen if this will cause an issue since more testing is required to confirm any problem. I suppose this means we can't mix pessimistic and optimistic transactions. So far it seems it will be a problem with the following domains.
Node.js changes
After upgrading to Node v16 we will be able to bring in some new features.
Promise.any
- we are currently usingPromise.all
in ourpingNode()
method inNodeConnectionManager
(this is being done in Testnet Deployment #326) but this should change to usingPromise.any
. This is becausePromise.all
waits for every promise to resolve/reject, however we only care about whichever finishes first and we want to cancel the rest. This change can be made in Testnet Deployment #326 after this PR is merged.AggregateError
- this error is emitted byPromise.any
if all of the given promises are rejected. In our case this would mean that we were unable to ping the desired node via direct connection or signalling (and eventually relaying once this is implemented), so we may want to catch this and re-throw some other error to represent this. We will also need to add this into our error serialisation/deserialisation.AbortController
- there are a number of places that have been identified in Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 where we could use the newAbortController
/AbortSignal
. This can be done in a separate PR when rebasing Testnet Deployment #326.Issues Fixed
VaultInternal
with
context API functions withNodeConnectionManager
#356NodeGraph
bucket operations #244vaults
domain #257@matrixai/js-file-locks
to introduce RWlocks in IPC #290 - this will only be important for file locking related operations for Inter-Process communicationTasks
AbstractError
and usecause
chain and static descriptionsdb.get
anddb.put
anddb.del
to useKeyPath
batch
and domain locking and instead migrate to usingDBTransaction
and thewithF
andwithG
from@matrixai/resources
createReadStream
,createKeyStream
andcreateValueStream
withdb.iterator
tran?: DBTransaction
optional parameter in the last parameter to allow one to compose a transaction contextIf possible, get EFS to use the new DB as well, which can be done by makingPut on hold, Requires an update toDB
take a "prefix", thus recreating sublevels, but only for this usecase. This would mean both EFS and PK use the same DB, with EFS completely controlling a lower level.js-db
and is low priority. Make a new issue for this?RWLock
can be used instead when doing concurrency control on theDBTransaction
to raise the isolation level to avoid non-repeatable reads, or phantom reads or lost-updates. - SI means that non-repeatable reads and phantom reads are not possible, so that waits on SI merge, while lost-updates is where the counter race condition occurred and that was resolved with just theLock
.@matrixai/async-locks
to use theLock
class instead ofMutex
fromasync-mutex
@matrixai/resources
withF
andwithG
to replace transactionsResourceAquire
making use of error handling.DB.iterator
/tran.iterator
as well askeyAsBuffer
/valueAsBuffer
decrypt
and latest DBClientMetadata
andErrorPolykeyRemote
under thegrpc
domainutils.promisify
here with Introduce Snapshot Isolation OCC to DBTransaction js-db#19 (comment)Tasks are divided by domain for conversion. Assignments are as follows. Domains labelled 'last' depend on a lot of other domains and will be divided later.
Final checklist
TBD spec:
Error changes:
Testing for an error with a cause.
DB changes:
domains - Each domain used to have a
transact
orwithTransaction
wrapper. Unless this wrapper actually needs some additional context, and it is just callingwithF
andwithG
directly, then avoid this and just replace it with direct usage ofwithF
andwithG
utilities. We should identify domains that require additional context, in those cases we continue to use a with transaction wrapper.transaction, and its utilisation of SI transactions
In all cases, you just throw this exception up, and propagate it all the way to the client.
The pk client however will change the message to be more specific to the user's action.
Retrying this can be identified and sprinkled into the codebase afterwards.
Work out a way to write a user-friendly message when a ErrorDBTransactionConflict occurs in the PolykeyClient (at the CLI)
Identify where automatic retries can occur, and make those catch the
ErrorDBTransactionConflict
and resubmit the transactionIdentify write-skews, this is where we must use solutions for dealing with write skews => locking and materialising the write conflict
Identify serialisation requirements - counter updates, and where PCC is demanded as part of the API, in those situations use the LockBox or lock
Handlers need to now start the transaction. For agent and client.
We have
tran?: DBTransaction
where it allows public methods to create their own transaction context.But this is only for debugging and convenience.
During production, the transaction context is meant to be setup at the handler level.
This means that each handler is its own atomic operation.
By default they start out at the handler level, this will make it easier for us to identify our atomic operations and to compare them.
Tests - introduce concurrency tests, consider some new cocurrency expectation combinators that allow us to more easily say one of these results has to be X, and the rest has to be Y.
Concurrency tests can be introduced domain by domain. - Use
Promise.allSettled
We can however focus our concurrency testing at the handler level, because we expect that to be the unit of atomic operations.
All ops removed, changed to using keypath and levelpath.
In the iterator:
tran.iterator
you will always have thekey
no matter what because it is being used for matching between the overlay and underlying db data.keyAsBuffer
andvalueAsBuffer
are by defaulttrue
, if you change them to false, it will decode the 2, and return as decoded values. If you need both, usedbUtils.deserialize()
.Just think that the iterator has the same snapshot as the original transaction.
Don't start writing concurrency tests until we have SI transactions. Serial tests are still fine.
Nodev16 Changes:
Locking Changes:
All async locks and async mutex is to be replaced with
js-async-locks
.In the
js-async-locks
:Convert object-map pattern to using LockBox by referencing the same ID.
Use INodeManager in EFS as a guide.