Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate js-db and concurrency testing #419

Merged
merged 14 commits into from
Aug 23, 2022
Merged

Conversation

CMCDragonkai
Copy link
Member

@CMCDragonkai CMCDragonkai commented Jul 15, 2022

Description

This integrates the new js-db 5.0.0 which brings in rocksdb. It now has correct implementation of optimistic transactions using Snapshot Isolation.

In bringing this, all of our concurrency issues should start with a correct foundation. This means any race conflict at the DB layer will throw an ErrorDBTransactionConflict exception. These exceptions will need to either be handled with an automatic retry where it is appropriate, or when it is not appropriate, it should be bubbled all the way the user interface (CLI and GUI) and the user can decide what to do.

Write skews can be resolved with new methods getForUpdate which materialises the read-write conflict into a write-write conflict. Where these may occur, transaction locks should also be used.

The DBTransaction supports both optimistic transactions and pessimistic transactions in the same transaction object. Locking within the transaction should only be done in order to guard against write-skew thrashing (like counter races).

Because the DBTransaction has its own lock, and most operations should be done with respect to the DB, most usages of in-memory locking can be replaced with reifying it to the database. This means some in-memory constructs will be replaced with in-DB structures instead.

Transaction mutual exclusion should start at the boundary of the program. This means at the GRPC service request handler level. A transaction will be created there, and this context will be shared by the transitive closure of all nested functions and methods. This should be sufficient for most usages.

The EFS will continue to use the old leveldb until it also transitions to using the new 5.0.0 js-db. The EFS runs its own database atm, so it's independent of the main PK database. Conversion of EFS to using the new db will be a separate PR, it requires separate testing because it has to remove its own locking in favour of using transaction-native locking.

While integrating this new DB, we will also start experimenting with more formal model-checking tests against our codebase in order to test concurrent phenomena.

Issues Fixed

Tasks

  • 1. keypath and options are swapped for relevant DB and trans methods.
  • 2. GRPC handlers need to begin the transaction contexts.
  • 3. Try and simplfy the if(tran == null) db.withTranF/G blocks. use tran => function() to avoid code bloat. For generators where possible just pass the generator down with a arrow function. Generally try and avoid wrapping it in a withTransactionG if possible.
  • 4. Any simple get/set operations can just use DB directly if tran isn't provided. for single operations a transaction is pointless.
  • 5. lockbox with re-enter-ency is provided, this can be used to avoid conflicts on counters and creation.
  • 6. Remove ACL and Sigchain implementation of withTransactionF with the db version of it. The Sigchain version needs to use the transaction's locking now.
  • Update DB, EFS and logger dependencies.
  • Remove any ts-ignore comments for logger usage.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@CMCDragonkai CMCDragonkai self-assigned this Jul 15, 2022
@CMCDragonkai CMCDragonkai force-pushed the feature-concurrent-state branch 4 times, most recently from fb83c35 to 1974f1e Compare July 24, 2022 08:25
@ghost
Copy link

ghost commented Jul 24, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@CMCDragonkai CMCDragonkai changed the title WIP: Integrate js-db and integrate concurrency testing WIP: Integrate js-db and concurrency testing Jul 25, 2022
@CMCDragonkai
Copy link
Member Author

For better concurrency testing, as well as fast-check, we can bring in jest-extended. It's already starting to be used by EFS which has lot of concurrency behaviour to verify.

@CMCDragonkai CMCDragonkai force-pushed the feature-concurrent-state branch 2 times, most recently from 1a872fe to 716bcd3 Compare July 31, 2022 09:19
@CMCDragonkai
Copy link
Member Author

Rebased on master.

This has now jest-extended, fast-check, DB 5.0.1.

@CMCDragonkai
Copy link
Member Author

Transaction mutual exclusion should start at the boundary of the program. This means at the GRPC service request handler level. A transaction will be created there, and this context will be shared by the transitive closure of all nested functions and methods. This should be sufficient for most usages.

This means the DBTransaction object is passed into domain methods. So the methods of Queue for example might be:

public addSomething(...args, tran?: DBTransaction) {
}

We're using the last parameter as the optional argument here... but it may be rolled into a general options object too in the future.

But this is the design I've found that works as this also allows the domain objects to be used atomically as each method will create their own transaction context level.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes @emmacasolin this PR is an exploration of the concepts of queue, db and tracing. Several PRs are likely to be split off from this PR which will target staging. I'll be setting up a framework from this PR, and then there will be pieces of work sliced out.

  • The DB integration will require reworking how transactions are used for atomicity.
  • Tracing can be its own domain that creates span that abstracts over how we currently use logger.getChild
  • Queue involves a lot of interesting ideas

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Aug 5, 2022

By using raw: true, you can insert Buffer.from(lexi.pack(num)) directly as a key without first encoding it as a hex value. (Hex value encoding can still be useful though in case it has to maintain some defined set of values independent from other values when it is concatenated together).

Doing this currently with the timestamped level for the tasks.

@CMCDragonkai
Copy link
Member Author

When presenting task ids to the end user, in a similar vein to claim ids, we have to show multibase encoded IDs, as that's the only way for users to be able to easily copy-paste them around. But they are decoded to regular IdSortable when possible. If they are carried around in JSON and record keys they are carried around with string encoding. Something like TaskIdString.

TaskId - the regular `IdSortable` type
TaskIdString - string encoded to be used in record keys
TaskIdEncoded - multibase encoded

@CMCDragonkai
Copy link
Member Author

I'm using extractTs from IdSortable to acquire the timestamp portion of the TaskId, this way the ID and the timestamp that is being stored is the same. The timestamp can never change, but we could allow the delay to be dynamically updated, but this is not an essential feature for how we envision using the Queue.

@CMCDragonkai
Copy link
Member Author

This will be the PR where we do both the DB integration and Queue/Scheduler experiments initially, then depending on whichever one is ready, they can be separated out to be merged in.

@tegefaulkes
Copy link
Contributor

As discussed, I'll separate the changes here into two PRs, The queue changes will move into a new branch feature-queue.

@CMCDragonkai
Copy link
Member Author

You can separate the changes already here into several smaller commits too like introducing jest-extended, and moving PromiseDeconstructed out.

@CMCDragonkai
Copy link
Member Author

Don't forget to rebase on staging too. We are then making a stacked PR from this.

@tegefaulkes
Copy link
Contributor

Summary of changes that need to be made for DB 5.x.x integration.

  • keypath and options are swapped for relevant DB and trans methods.
  • GRPC handlers need to begin the transaction contexts.
  • Any transaction conflicts bubble up to the user for now while we assess how much conflicting may occur.
  • Try and simplfy the if(tran == null) db.withTranF/G blocks. use tran => function() to avoid code bloat. For generators where possible just pass the generator down with a arrow function. Generally try and avoid wrapping it in a withTransactionG if possible.
  • Any simple get/set operations can just use DB directly if tran isn't provided. for single operations a transaction is pointless.
  • lockbox with re-enter-ency is provided, this can be used to avoid conflicts on counters and creation.

@CMCDragonkai
Copy link
Member Author

For all the places where you are using getForUpdate, I'm thinking that this is only necessary during write skews. And during the instance methods for the managers. During the creation/destruction of singleton instances, it shouldn't be necessary since they are singletons.

It's only when multiple concurrent calls are likely, as in the creation/destruction of non-singleton instances, and during method invocations likely triggered from multiple RPC handlers.

@CMCDragonkai
Copy link
Member Author

Also make sure you're that you are sharing the same DBTransaction instance in all the lifecycle handlers.

@CMCDragonkai
Copy link
Member Author

Wherever it is relevant, we can always add locks to ensure serialisation. In fact one could say that all handlers are serialised in some special way, but it is often too complex, and something that we want to minimise to where it is needed. Instead we are going to bubble up the transaction conflict up the handlers. Then we can then come down later from the PK client and add auto-retry where appropriate.

@tegefaulkes
Copy link
Contributor

So If I understand the usage of getForUpdate, we'd want to use it where we get from key A and update key B or C based on the contents of A? In that case we'd want to use it for cases were we have relationships between data or indexing?

The example above is for renaming vaults were we update vaultId->vaultName, vaultName->vaultId and delete oldVaultName. Another example is the ACL where we have tight relationships between vaultId -> [nodeId, nodeId] --> permId -> permRef.

So we'd want to use getForUpdate to help enforce consistency for data we're trying to keep relationships between.

@CMCDragonkai
Copy link
Member Author

What's an example of a ACL write skew? But yes that's the general idea. Have a read on write skew on wikipedia again to refresh memory. It also depends on whether there are any consistency constraints that are being violated here.

This is kind of why I wanted SSI to avoid this problem entirely since it is a complicated edge case.

@tegefaulkes
Copy link
Contributor

For ACL there are two aspects.

  1. we're following a chain of nodeId -> permId -> permRef and end up updating the data in the permRef.
  2. While following this chain we're checking if the nodeIds are missing the permId or permRef and removing these nodes from the permRef data.

Here is a code snippet.

 /**
   * Gets the record of node ids to permission for a given vault id
   * The node ids in the record each represent a unique gestalt
   * If there are no permissions, then an empty record is returned
   */
  @ready(new aclErrors.ErrorACLNotRunning())
  public async getVaultPerm(
    vaultId: VaultId,
    tran?: DBTransaction,
  ): Promise<Record<NodeId, Permission>> {
    if (tran == null) {
      return this.db.withTransactionF((tran) =>
        this.getVaultPerm(vaultId, tran),
      );
    }
    const nodeIds = await tran.get<Record<NodeId, null>>(
      [...this.aclVaultsDbPath, vaultId.toBuffer()],
      false,
    );
    if (nodeIds == null) {
      return {};
    }
    const perms: Record<NodeId, Permission> = {};
    const nodeIdsGc: Set<NodeId> = new Set();
    for (const nodeIdString in nodeIds) {
      const nodeId: NodeId = IdInternal.fromString(nodeIdString);
      const permId = await tran.get(
        [...this.aclNodesDbPath, nodeId.toBuffer()],
        true,
      );
      if (permId == null) {
        // Invalid node id
        nodeIdsGc.add(nodeId);
        continue;
      }
      const permRef = (await tran.get(
        [...this.aclPermsDbPath, permId],
        false,
      )) as Ref<Permission>;
      if (!(vaultId in permRef.object.vaults)) {
        // Vault id is missing from the perm
        nodeIdsGc.add(nodeId);
        continue;
      }
      perms[nodeId] = permRef.object;
    }
    if (nodeIdsGc.size > 0) {
      // Remove invalid node ids
      for (const nodeId of nodeIdsGc) {
        delete nodeIds[nodeId];
      }
      await tran.put(
        [...this.aclVaultsDbPath, vaultId.toBuffer()],
        nodeIds,
        false,
      );
    }
    return perms;
  }

@CMCDragonkai
Copy link
Member Author

Ok we have identified where write skews may exist:

O <-- O
 ^
  \+- O

If it is meant to be a 1:1 relationship, then this application constraint is being violated. It's not a transaction conflict.

Solve by adding a lock to force serialisation. Then by adding a getForUpdate which would force a conflict if a lock wasn't there.

This applies to assigning a vault name to a vault id, and also creating a new claim on the sigchain, as the sigchain must be linear.

@tegefaulkes
Copy link
Contributor

I think this is good now but we'll have to see what the CI shows us. If it's good then I can start squashing and prepping to merge.

@tegefaulkes tegefaulkes force-pushed the feature-concurrent-state branch from 32ecc60 to 1b109d9 Compare August 22, 2022 09:29
@tegefaulkes
Copy link
Contributor

Should be good to merge now.

@CMCDragonkai CMCDragonkai changed the title WIP: Integrate js-db and concurrency testing Integrate js-db and concurrency testing Aug 22, 2022
@CMCDragonkai
Copy link
Member Author

CI failures, a timeout and a transaction conflict error.

@CMCDragonkai
Copy link
Member Author

Does this PR actually resolve #244?

@CMCDragonkai
Copy link
Member Author

For #244, the main idea is that NodeGraph should be taking a tran parameter as per the comment: #244 (comment).... so that way the node graph can be properly transactional and composed with other operations. I don't remember if there was anything else I missed, but @tegefaulkes took over the development of the node graph lately.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Aug 22, 2022

Long term we should remove all the object map locks in favour of the transaction locks. This is because locks don't compose, and all the object maps are just representations of the source of truth on disk. Ideally we should use the tran.queueSuccess and such that can do in-memory changes only after a commit occurs. This means local state inside a transactional function is temporary, and only after a commit succeeds, does global (domain state) get mutated.

Generally I reckon we shouldn't have any operations that is mutating the in-memory state without also involving the underlying disk, so there shouldn't be a situation where they can't coordinate with the transaction locks.

The transaction locks are far more useful generally because they are re-entrant, and they don't release until transaction destruction.

Issue created: #443

@CMCDragonkai
Copy link
Member Author

I reckon we still need to fix that the key paths used for locking and also investigate why:

    ✕ fails after receiving undefined singly signed claim (20523 ms)

This is taking longer than 20s. Seems like a possible deadlock.

@CMCDragonkai
Copy link
Member Author

I reckon we still need to fix that the key paths used for locking and also investigate why:

    ✕ fails after receiving undefined singly signed claim (20523 ms)

This is taking longer than 20s. Seems like a possible deadlock.

Actually I saw that other places also take about 20 seconds there. So I guess it's not a big deal. So we can merge without addressing this for now.

We got lots of things like queues, tasks, cancellation that is coming in to help with managing our asynchronous timelines and resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants