Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status #283

Merged
merged 8 commits into from
Dec 6, 2021

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Nov 1, 2021

Description

This PR adds in the BIP39 recovery seed functionality

Issues Fixed

checklist

  • Generate the keypair from the BIP menmonic
    • Generate a BIP mnemonic
    • generate keypair from menmonic
  • update keyManager to use new key generation method.
  • add a check that the mnemonic matches the nodeId for the node. if so we update the encrypted rootkey with the new password.
    • create a litmus message when a new keypair is generated. this includes a utility function for writing it and one verifying a keypair with it.
    • create a static recoverKeyManager method for resetting the password with a mnemonic.
  • Update generateKeyPair to use the worker manager properlly.
  • tests
    • generating a keypair.
    • mnemonic deterministically recreates keypair.
    • can tell if mnemonic matches keypair/nodeId
    • can create new node from scratch, recreates nodeId and keypair.
    • can recover a keynode.
    • Optimise tests, can't construct a KeyManager for each tests, it's too slow.
  • recoverKeyManager respects lockfile.
  • env varaibles implementation
  • Add a flag to PolykeyAgent.createPolykeyAgent to do faster keypair generation to speed up tests.
  • Create Status domain for replacing the agent locking mechanism based on the Session in https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213
    • convert to @startstop
    • add running starting, running and stopping status to the info.
  • Remove all uses of lockfile and proper-lockfile Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status #283 (comment)
  • Refactor Status tests to match the new API, make sure to test all the relevant error conditions as well as the waitFor function
  • Adapt the new Status into the bootstrapState in bootstrap/utils.ts
  • Refactor bootstrap tests with the new bootstrapState function
  • bin bootstrap command is returning the recovery key and add a test for that.
  • Remove all keypair generation overrides, a fix has been applied to the deterministic key generation which should mean that keypair generation should be fast now, as before there were doubled-up PBKDF2 executions, benchmark this now, as on my system it's only 3 to 4 seconds, which can be adequately dealt with in beforeAll. This means all keyPair override parameters PLUS keyPair environment variables are not needed
    • Separate branch here, first remove all keypair overrides, and create separate branch for it
  • Rename all uses of mnemonic to recoveryCode.
  • Consider that bootstrapState can be called in the CLI process and therefore not require process IPC to communicate the recovery key being used. However this only works with pk bootstrap and not pk agent start. If we use pk agent start, then bootstrapState again can be used at the CLI process instead, but when calling pk agent start, it will be essential to run it with the root password that was used. This can eliminate any usage of child process IPC, and simplify our execution stack, and means our code will be more portable.
  • Need to catch fs errors when they are used to read the --password-file and --recovery-code-file
  • The signal handler need to be expanded to deal with exception exit conditions, all errors are caught by the root exception handler, upon finally exiting, a standard exit run should be done. Alternative to this is the usage of finally at the end to close the polykey client, but if we are adding exit handling to the signal handlers we might as well do it for the normal exit.
  • Update all commands to use the new binProcessors instead of binParsers.
  • Inline all uses of parseFilePath Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status #283 (comment)
  • Use ErrorCLIFileRead for general file reading Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status #283 (comment)
  • Integrate resource management including finally clause and the exit handler registration for pkClient usage
  • [ ] Use jest mocking on child process fork so that it actually uses ts-node directly when executing polykeyAgent.ts - not required due to using pkSpawn
  • Use jest mocking on the ExitHandler when using src/polykey.ts

Final checklist

  • Domain specific tests
  • Full tests
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

Review checklist

@tegefaulkes tegefaulkes self-assigned this Nov 1, 2021
@tegefaulkes tegefaulkes changed the title Generating Mnemonic and keypair. Implementing BIP39 recovery Nov 1, 2021
@tegefaulkes
Copy link
Contributor Author

The new keygeneration functions like so.

  • new keynode. generate mnemonic and key and returns the mnemonic to the user somehow.
  • new keynode started with mnemonic generates they keypair from that mnemonic
  • An existing keynode just loads everything like normal.
  • recovering a keynode needs to regenerate the keypair and then check that the keypair matches the old one.

We don't want to overwrite the keypair with a different one since that will corrupt the keynode. to check if the new keypair matches we can check that the NodeIds match. more importantly we need to check that the keypair matches. Roger mentioned we can do this with a signed message. we can store a message that contains the NodeId and sign that. and verify it with the new recovered keypair. though we need to validate the private key not the public key so we likely need to decrypt the message and check that the NodeID matches.

@tegefaulkes
Copy link
Contributor Author

Small problem with the generateDeterministicKeyPair() function. It's very slow. It takes 15 seconds to create a keyPair from the mnemonic.

@CMCDragonkai
Copy link
Member

That makes a worthy candidate for WorkerManager. The KeyManager is supposed to use it too.

@CMCDragonkai
Copy link
Member

@tegefaulkes please update the task list for this PR, copy initial task list from #202 and then expand it as you go along.

This is something @joshuakarp can help with tomorrow?

@tegefaulkes
Copy link
Contributor Author

The keyManager does use the workerManager but that is currently set after creation so I'll need to make some changes to that. Also, will the worker manager actually speed this up at all?

I made a checklist in the Issue. I can copy that over here.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 1, 2021 via email

@tegefaulkes
Copy link
Contributor Author

I am now checking the root cert to see if the keyPair generated from a mnemonic matches the node.

@joshuakarp
Copy link
Contributor

I am now checking the root cert to see if the keyPair generated from a mnemonic matches the node.

Is this the process you're using?:

  1. Use generateDeterministicKeyPair with provided mnemonic to create key
  2. Create node ID from this keypair
  3. Retrieve node ID from cert, and compare

I'm assuming then that the certificate is in plaintext, and we can just query the OID fields?

@tegefaulkes
Copy link
Contributor Author

That pretty much sums it up.

@tegefaulkes
Copy link
Contributor Author

Looks like pki.rsa.generateKeyPair function used in generateDeterministicKeyPair can take a workers parameter to speed up generation but it is still very slow.

It's not so much a problem when using polykey. generating the keypair only happens when creating the keynode or resetting/renewing the keypair. but it's blowing out the KeyManager test times. I need to add an optional keyPair parameter to the createKeyManager so we can override the generation just for testing.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 2, 2021

Will have some changes regarding async-init coming from https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213. Will be merging MR 213 prior to this and #266.

The CLI is situation is quite messy and has to go through a proper review. So I'm bundling my CLI review with the sessions MR merge. So some of the points discussed with @joshuakarp I'll include in that MR.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Nov 2, 2021

I added the ability to override the keypair generation when starting the KeyManager. this is mainly for testing so we can provide a keyPair generated without the mnemonic to speed up the testing. most tests should run just as fast except for the keyManager.test.ts since it needs to do renew/reset keypair tests.

We do need to see if we can speed up deterministic key generation at all. 15secs is not the end of the world when it only happens when creating a new node but it's still pretty bad.

@tegefaulkes
Copy link
Contributor Author

While updating the keyPair files doesn't have any locking applied, the function I'm using to do it writes to temp files and then does an atomic rename. this is what's being used already when creating the files and when doing a keyPair renew/reset. However that is expected to be protected by the PolykeyAgent lockfile. so the recovery function still needs to check for that and do it safely.

src/PolykeyAgent.ts Outdated Show resolved Hide resolved
src/config.ts Outdated Show resolved Hide resolved
src/bin/utils.ts Outdated Show resolved Hide resolved
src/bin/utils.ts Outdated Show resolved Hide resolved
src/bin/agent/start.ts Outdated Show resolved Hide resolved
src/bin/options.ts Outdated Show resolved Hide resolved
src/bin/options.ts Outdated Show resolved Hide resolved
@CMCDragonkai
Copy link
Member

Does this PR also apply the env variables to pk bootstrap command as well? If not, it should be an empty checklist item.

Any other things here to be done?

@joshuakarp
Copy link
Contributor

Does this PR also apply the env variables to pk bootstrap command as well? If not, it should be an empty checklist item.

Any other things here to be done?

Yeah - there's already support for acquiring both password and recovery code on pk bootstrap. As far as I'm aware, the networking environment variables are not necessary for pk bootstrap.

However, I can't seem to see whether the usage of the recovery code has been integrated into pk bootstrap. My expectation is the following:

pk bootstrap --fresh # bootstrap a new node state

# this we can do later
pk recover --recovery-code-file=./rcf --password-file=./pf # just recovers the key (but doesn't start the agent)

# FOR: empty node state

# deterministic node state generation
pk agent start --recovery-code-file=./rcf
# should also be done with bootstrapping
pk bootstrap --recovery-code-file=./rcf

# generate key pair, and then set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf
# should also be done with bootstrapping
pk bootstrap --recovery-code-file=./rcf --password-file=./pf

# FOR: non-empty node state

# generate a keypair deterministically, check if the public key is correct, and if so, prompt for password (this is the new password)
pk agent start --recovery-code-file=./rcf
# unnecessary for pk bootstrap - only occurs on empty node state

# recover the key pair, if the public key is correct, set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf 
# unnecessary for pk bootstrap - only occurs on empty node state

# the --fresh option ignores existing state (but not any existing lock)
pk agent start --recovery-code-file=./rcf --password-file=./pf --fresh
# should also be done with bootstrapping - force renewal (fails if agent running)
pk bootstrap --recovery-code-file=./rcf --password-file=./pf --fresh

Originally from #202 (comment)

@tegefaulkes am I missing this somewhere?

Still in process of reviewing.

@tegefaulkes
Copy link
Contributor Author

I'm looking at it and fixing it now.

@tegefaulkes
Copy link
Contributor Author

I had to do a small fix to it, but it's getting the env variables at line 54

  const password = await binUtils.getPassword(options.passwordFile);
  const recoveryCode = await binUtils.getRecoveryCode(options.recoveryFile);
  const fresh = !!options.fresh;

@CMCDragonkai
Copy link
Member

The options.fresh should already be a boolean, no need to cast. You just make sure it's an option without a value.

@CMCDragonkai
Copy link
Member

I've added --root-key-pair-bits option, so that tests can override it to 1024. This reduces it to about 9 seconds for the tests. However I remember seeing that it can sometimes double up. So the default time out is still 20s, and this is * 2 to 40s.

@CMCDragonkai
Copy link
Member

I came across a situation where if we had structured error logging we would better see what the problem is...

Right now starting in the background with pkSpawn works except this error:

{"name":"ErrorRootKeysParse","description":"Polykey error","message":"Only 8, 16, 24, or 32 bits supported: 888","exitCode":1,"data":{},"stack":"ErrorRootKeysParse: Only 8, 16, 24, or 32 bits supported: 888\n    at Object.readRootKeyPair (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:665:13)\n    at async Object.setupRootKeyPair (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:597:23)\n    at async Object.start (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:164:35)\n    at async Object.start (/home/cmcdragonkai/Projects/js-polykey/node_modules/@matrixai/async-init/src/CreateDestroyStartStop.ts:75:20)\n    at async Function.createKeyManager (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:90:5)\n    at async Function.createPolykeyAgent (/home/cmcdragonkai/Projects/js-polykey/src/PolykeyAgent.ts:169:8)\n    at async main (/home/cmcdragonkai/Projects/js-polykey/src/bin/polykey-agent.ts:49:15)\n    at async /home/cmcdragonkai/Projects/js-polykey/src/bin/polykey-agent.ts:128:5"}

@CMCDragonkai
Copy link
Member

The reason is because this was missing:

const nodePath = new commander.Option(
  '-np, --node-path <path>',
  'Path to Node State',
).env('PK_NODE_PATH')
 .default(config.defaults.nodePath);

So it was attempting to use the ~/.local/share/polykey as the node path.

@CMCDragonkai
Copy link
Member

This means this Exception: ErrorRootKeysParse may be thrown with a variety of messages if the password is incorrect. We should look into improving this messaging to be better in the future for UX @joshuakarp.

@joshuakarp
Copy link
Contributor

It'd be worthwhile to do a whole review of the error message displays as a whole when we get to that stage. There's similar ambiguities with thrown exceptions that wouldn't necessarily correlate to the actions of a user (e.g. ErrorNodeGraphEmptyDatabase when no nodes have been added to the database).

@CMCDragonkai
Copy link
Member

Accidentally wrote the decription instead of description in Status errors. Fixing that up.

…nd and concurrent coalescing, propagate log level to background agent
@CMCDragonkai
Copy link
Member

Pushed up my latest test fixes for agent/start.test.ts. This commit also has the new pkAgent utility as well in tests/bin/utils.test.ts that can be used to create our common agent. See if that works @joshuakarp.

@CMCDragonkai
Copy link
Member

Current state now:


[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/polykey@1.0.0 test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/agent/start.test.ts (49.938 s)
  start
    ✓ start in foreground with parameters (9793 ms)
    ✓ start in foreground with environment variables (9145 ms)
    ✓ start in background with environment variables (14456 ms)
    ✓ concurrent starts are coalesced (6561 ms)
    ✓ concurrent bootstrap is coalesced (6730 ms)

Test Suites: 1 passed, 1 total
Tests:       5 passed, 5 total
Snapshots:   0 total
Time:        49.974 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN

@CMCDragonkai
Copy link
Member

The pk agent tests are getting pretty comprehensive now. I think a few other ones:

  • Test that SIGINT, SIGTERM, SIGQUIT SIGHUP all do graceful termination.
  • Test what happens when wrong password is used
  • Test the usage of recovery code
  • Test that if you cancel in the middle of starting (as in bootstrapping), that you can bootstrap again perfectly as long as you use --fresh.
  • Test the usage of --fresh as well

I'll fix up the bootstrap tests too, and then this will be merged. All other test issues are to be fixed in downstream PRs.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2021

So I'm removing:

    // This checks the if number of directory entries is greater than 1 due to status.json
    if (!fresh && (await fs.promises.readdir(nodePath)).length > 1) {
      throw new bootstrapErrors.ErrorBootstrapExistingState();
    }

From bootstrapState.

The main reason is that agent start cannot function in the same way because all of the state initialisation logic across the entire codebase is "partial" and "forgiving" to some extent.

So if there are files there, but nothing conflicting the pk states, they are just ignored.

If there are stuff that does conflict, the bootstrap and agent start will fail in a variety of exceptions. But that should signal to the user to have an empty directory or to use the --fresh.

I actually reversed this decision. This does make sense for the pk bootstrap command. However pk agent start it doesn't quite make sense. If existing state exists for pk agent start, it will just try to start it, and potentially create state when it does start. The only factor is existing conflicting state. But this is an ok trade off.

I've also added the fresh option with only --fresh, since -f is occupied by --format which is a more common option.

@CMCDragonkai
Copy link
Member

Instead of having a separate test for each signal, I've just embedded them into each individual test to save some time. So some tests I'm using SIGTERM, another SIGINT... etc.

@CMCDragonkai
Copy link
Member

While testing the recovery code system, I came across a problem.

If the root key already exists, and the recovery code is passed in.

The way we check if our recovery code is correct is by doing:

        // Recover the key pair with the recovery code
        // Check if the generated key pair matches
        const rootKeyPairCheck = await this.generateKeyPair(bits, recoveryCode);
        if (!(await this.matchRootKeyPair(rootKeyPairCheck))) {
          throw new keysErrors.ErrorKeysRecoveryCodeIncorrect();
        }

But this relies on the bits. Which by default is 4096.

If the user selected a different bit size, they would have to remember the bit size and the recovery code together.

Instead of doing this, we need to instead save the bit size, or derive it from the saved public key.

Another problem with this mechanism is that if the root key files is in fact lost, and not just the password was forgotten, then this actually goes into generating a new root key pair, which would then be attempted to be used to decrypt the other key files like the db key and the vault key. In this scenario, if the recovery code is in fact wrong, then it would generate the wrong root key pair, and there would likely to be a decryption failure that occurs with the db key or vault key subsequently. This of course requires that both db key and vault key still exist on the filesystem.

In #164 an old issue, we explored the idea of hierarchical keys. This basically means the seed (recovery code) is used to generate all the keys in a hierarchical way which means we can recover all keys. However right now db keys and vault keys are just random separated keys, so the recovery code has limited utility. We'll have to address this later.

So basically the issue is that:

  1. If you pass in the wrong bit size, or don't pass it in, but a non-default bit size is used, you'll get a ErrorKeysRecoveryCodeIncorrect.

  2. If you don't have the root keypair anymore, then you won't be able to know what bit size was really used. So you'll just have to be lucky with the default bit size.

  3. requires revisiting Hierarchical Key Generation #164. But 1. we can solve this by persisting or deriving the bit size somehow. Let's see.

@CMCDragonkai
Copy link
Member

Turns out by reading the public key pem, we can acquire the public key bit size using our utility function that we already have in keys/utils:

function publicKeyBitSize(publicKey: PublicKey): number {
  return publicKey.n.bitLength();
}

@CMCDragonkai
Copy link
Member

This is solved by using the above function and having a protected method called recoverRootKeyPair in KeyManager.

@CMCDragonkai
Copy link
Member

[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/polykey@1.0.0 test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/agent/start.test.ts (122.537 s)
  start
    ✓ start in foreground (9612 ms)
    ✓ start in background (14891 ms)
    ✓ concurrent starts are coalesced (9850 ms)
    ✓ concurrent bootstrap is coalesced (6947 ms)
    ✓ start with existing state (17818 ms)
    ✓ start when interrupted, requires fresh on next start (21816 ms)
    ✓ start from recovery code (38039 ms)

Test Suites: 1 passed, 1 total
Tests:       7 passed, 7 total
Snapshots:   0 total
Time:        122.576 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN

@CMCDragonkai
Copy link
Member

Bootstrap tests all done:


[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/bootstrap.test.ts 

> @matrixai/polykey@1.0.0 test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/bootstrap.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/bootstrap.test.ts (37.507 s)
  bootstrap
    ✓ bootstraps node state (5154 ms)
    ✓ bootstrapping occupied node state (9685 ms)
    ✓ concurrent bootstrapping are coalesced (7267 ms)
    ✓ bootstrap when interrupted, requires fresh on next bootstrap (11185 ms)

Test Suites: 1 passed, 1 total
Tests:       4 passed, 4 total
Snapshots:   0 total
Time:        37.548 s, estimated 38 s
Ran all test suites matching /tests\/bin\/bootstrap.test.ts/i.
GLOBAL TEARDOWN

Important that when testing concurrent work, do not use pkStdio as the mocks are global and conflict with each other. Could have used pkExec in my concurrent tests, but copied from my agent/start.test.ts to use pkSpawn.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2021

Both agent/start tests and bootstrap tests are passing. The only thing left is the additional agent subcommands to be refactored in tests/bin/. Will do this in a later PR. Which includes #286 (comment).

This is being merged now! @joshuakarp @emmacasolin rebase on top of master!

@CMCDragonkai CMCDragonkai marked this pull request as ready for review December 6, 2021 13:02
@CMCDragonkai CMCDragonkai merged commit 4ca847b into master Dec 6, 2021
@CMCDragonkai CMCDragonkai deleted the BIP39_Recovery branch December 6, 2021 13:06
@joshuakarp
Copy link
Contributor

FYI @CMCDragonkai running tests/bin/agent/start.test.ts on the new laptops:

josh@matrix-vostro-5410-1> npm test -- tests/bin/agent/start                                                                  ~/Documents/Projects/js-polykey

> @matrixai/polykey@1.0.0 test /home/josh/Documents/Projects/js-polykey
> jest "tests/bin/agent/start"

Determining test suites to run...
GLOBAL SETUP
Creating global.binAgentDir: /tmp/polykey-test-bin
 PASS  tests/bin/agent/start.test.ts (140.284 s)
  start
    ✓ start in foreground (10829 ms)
    ✓ start in background (15737 ms)
    ✓ concurrent starts are coalesced (13062 ms)
    ✓ concurrent bootstrap is coalesced (11122 ms)
    ✓ start with existing state (18255 ms)
    ✓ start when interrupted, requires fresh on next start (20339 ms)
    ✓ start from recovery code (38224 ms)
    ✓ start with network configuration (10550 ms)

Test Suites: 1 passed, 1 total
Tests:       8 passed, 8 total
Snapshots:   0 total
Time:        140.31 s
Ran all test suites matching /tests\/bin\/agent\/start/i.
GLOBAL TEARDOWN
Destroying global.binAgentDir: /tmp/polykey-test-bin

@joshuakarp
Copy link
Contributor

And the same on my own laptop:

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/polykey@1.0.0 test /home/josh/Documents/MatrixAI/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
Creating global.binAgentDir: /run/user/1000/polykey-test-bin
 PASS  tests/bin/agent/start.test.ts (151.772 s)
  start
    ✓ start in foreground (11874 ms)
    ✓ start in background (16135 ms)
    ✓ concurrent starts are coalesced (10801 ms)
    ✓ concurrent bootstrap is coalesced (10876 ms)
    ✓ start with existing state (20352 ms)
    ✓ start when interrupted, requires fresh on next start (23724 ms)
    ✓ start from recovery code (42431 ms)
    ✓ start with network configuration (11073 ms)

Test Suites: 1 passed, 1 total
Tests:       8 passed, 8 total
Snapshots:   0 total
Time:        151.827 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN
Destroying global.binAgentDir: /run/user/1000/polykey-test-bin

@CMCDragonkai
Copy link
Member

Ah great that timing is the same as mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment