Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

SecretStore: auto-migration #7271

Closed
wants to merge 30 commits into from
Closed

Conversation

svyatonik
Copy link
Collaborator

@svyatonik svyatonik commented Dec 12, 2017

see #7271 (comment)
depends on https://github.com/paritytech/contracts/pull/102

What is this PR about?

ServersSetChange session, which has been introduced recently, allows us to 'migrate' secret shares from servers_set1 to servers_set2, adding and removing shares wher required. It was designed to be started manually, using HTTP API by holder of some (administrator) key, all servers are configured for. So the scenario for running this session was (without details):

  1. make sure all nodes from servers_set1 && servers_set2 are up
  2. configure SecretStore to connect to all servers from servers_set1 && servers_set2
  3. run ServersSetChange session
  4. configure SecretStore to connect to all servers from servers_set2 only
  5. turn off nodes from servers_set1 [optional]

This PR is about minimizing number of steps ^^^ for the case when set of SecretStore key servers is configured using the special KeyServerSetWithMigration contract from here: https://github.com/paritytech/contracts/pull/102/files#diff-ebbb8d3338d9c365c0e8f5258894b5afR29 . When it is used, the scenario is narrowed to:

  1. make sure all nodes from servers_set1 && servers_set2 are up
  2. call addKeyServer/removeKeyServer on this contract
  3. turn off nodes from servers_set1 when MigrationCompleted() event is fired [optional]

KeyServerSetWithMigration contract

Currently single-owned version is presented in linked PR. Multi-owned (see notes on Kovan integration below) version will likely to be added later.

Contract holds 3 sets of nodes: current set, migration set && new set:
current set - it is the current state of SecretStore
new set - equal to current set + addKeyServer/removeKeyServer are adding/removing nodes to/from here
migration set - is empty, unless migration is active

Migration represents the process of migrating secret shares from 'current set' to 'new set' (i.e. ServerSetChange session). It is started when startMigration(id) is called by one of the 'current set'+'new set' nodes. At the time of this call, nodes from 'new set' are copied to 'migration set'.

Migration ends when all nodes from 'migration set' have confirmed that migration with confirmMigration(id) call. Migration id is just a random number to make sure that all nodes are running the same migration. After migration is completed, 'migration set' is cleared && everything starts over.

IMPORTANT: there's special 'initialization' phase just after this contract is deployed. During that phase, all changes to 'new set' are simultaneously replicated to 'old set', thus making migration impossible. It is intended to resolve different configuration issues, like:

  1. previously servers were configured using configuration files && now we're moving to contract-based version
  2. moving existing SecretStore between different networks
    etc...

Initialization phase ends when completeInitialization() is called

KeyServerSetWithMigration listener

WARNING: new version of listener won't work with previous type of KeyServerSet contract. You should upgrade your contract to newer version (via registry).

The main issue with listener was to synchronize three sets of sets of KeyServers:

  1. set from the contract
  2. set which we currently (should be) connected to
  3. set for which active ServersSetChange (if any) is running

This synchronization had to be re-entrant + survive unexpected nodes shutdowns/disconnects. It is achieved with the help of KeyServerSetWithMigration contract + idempotence of ServersSetChange session.

So here how it works:

  1. when contract raises ServerAdded/ServerRemoved event, KeyServerSet component reads the state of servers set contract (snapshot)
  2. ConnectionTrigger component checks the state of the snapshot. Without details:
    2.1) if the state is Idle (current_set == new_set && !migration) => it just keeps us connecting to nodes from current_set
    2.2) if the state is MigrationRequired (current_set != new_set && !migration) => it checks if it should be 'the master of migration' && if so, calls the startMigration() method of the contract
    2.3) if the state is MigrationActive (migration == true) => it:
    2.3.1) connects to all nodes from both current_set && migration_set
    2.3.2) checks if he's the master of migration && starts ServersSetChange session if required
  3. when node receives the ServersSetChange session initialization message, it checks:
    3.1) that the sender is actually a migration master (read from the contract)
    3.2) that migration has the same id (read from the contract)
    3.3) that migration has the same migration set (read from the contract)
  4. when the session is completed:
    4.1) if the node is a part of new configuration (migration_set) && calls confirmMigration() if so
    4.2) ConnectionTrigger is back to connecting to only nodes from current_set

Known TODOs

  1. after sending startMigration/confirmMigration tx, we should wait at least for several blocks before resubmitting (see SecretStore: tx retry pattern #7323 )

Unresolved important problems (any advice is appreciated)

  1. key servers that are leaving SecretStore should be online during the whole migration process. For Kovan this means that if we're exluding authority, we should beg him not to turn off the node until MigrationCompleted() is raised. If it is shut down, the migration will never end. Possible solutions:
    1.1) manual - count this node as lost && call comething like KeyServerSetWithMigration.forceRemoval() manually
    1.2) automatic - if during migration we can not connect to the node for T interval, call comething like KeyServerSetWithMigration.forceRemoval() automatically. When najority of nodes have called forceRemoval - remove it from migration
    1.3) ???

Both solutions ^^^ (1.1) && (1.2) are leading to possible secret loss (if after removal there are less than t+1 shares for the secret left).
And the same is the essence of the problem2:

  1. initially ServersSetChange session was designed to fail if there's at least 1 secret, which will have less than t+1 shares in new nodes set (i.e. the secret becames unrecoverrable). This is fine when we're running the session manually (we have a time for solution && restart), but when migration is started automatically, there's no one to choose a wise decision. I do not have a soution for this, but here are some thoughts on it:
    2.1) since threshold is specified by caller during secret creation, we can just ignore that error && leave secret unrecoverrable. This means that caller has invested more trust in SecretStore authorities than needed. Eventually the same situation could repeat even without auto-migration, when some authorities will just turn off their nodes.
    2.2) there's a way (I believe so) to decrease threshold after shares were generated. So we might want to run threshold-decreasing session when this happens (see SecretStore: threshold decrease PoC #7562)

Notes on integration with Kovan authorities list contract

Since initial requirement was integrating SecretStore with Kovan here are some words on this:

  1. since KeyServerSet contract need to know publics + ips of every KeyServer, ValidatorSet contracts are not enough. So after new ValidatorSet is approved, someone need to update KeyServerSet contract as well (taking into account that all nodes must be online during auto-migration)
  2. although it is an additional action, it actually gives a greater control over key servers than a requirement of every validator to run a KS. Like we could possibly ignore fresh validators for a while. Or simply start with SS, containing of single Parity node && still accessble via Kovan
  3. I'm planning to add (eventually) a contract, which will allow validators to have a shared ownership of KeyServerSetWithMigration contract (where majority should vote on inclusion/exclusion of node from SS) - do not know if it is required right now, though

This change is Reviewable

@svyatonik svyatonik added A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. M4-core ⛓ Core client code / Rust. labels Dec 12, 2017
@5chdn 5chdn added this to the 1.9 milestone Dec 12, 2017
…disabled + stripped KeyServerSet contract ABI
@svyatonik svyatonik added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. labels Dec 13, 2017
@svyatonik svyatonik added A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. and removed A0-pleasereview 🤓 Pull request needs code review. labels Dec 29, 2017
@svyatonik svyatonik added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. labels Dec 29, 2017
@5chdn 5chdn modified the milestones: 1.9, 1.10 Dec 30, 2017
@5chdn 5chdn added A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. and removed A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. labels Dec 30, 2017
@5chdn 5chdn modified the milestones: 1.10, 1.9 Jan 7, 2018
@svyatonik
Copy link
Collaborator Author

ATTENTION TO POTENTIAL REVIEWERS: the contents of this PR was merged in #7323 (which was built on top of this PR). I guess, because of confusion with onice labels. So please do not merge this PR (I've left conflict by intention). Please leave comments (either on code, or the concept from description) here - once there'll be at least one review, I'll make another PR with the same description && possible code changes.

@5chdn
Copy link
Contributor

5chdn commented Jan 10, 2018

Sorry :)

@svyatonik
Copy link
Collaborator Author

Response to 'unresolved1': assume that KS that are 'leaving' are either offline, or malicious => they should be ignored in auto-migration session. Will do that in next PRs

@debris
Copy link
Collaborator

debris commented Feb 5, 2018

the pr looks good to me

@debris debris closed this Feb 5, 2018
@svyatonik
Copy link
Collaborator Author

Just logging solution to unresolved#2 here: ignore this situation in auto-migration session and run it ti the end. So some keys can became irrecoverable after auto-migration session. Will fix this in another PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants