Skip to content

Commit

Permalink
docs: DR for clustered data-plane
Browse files Browse the repository at this point in the history
  • Loading branch information
ndr-brt committed Oct 2, 2024
1 parent 5dab9df commit 3a67dca
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Clustered data-plane

## Decision

We will make the data-plane being able to run in a clustered environment.

## Rationale

Currently, data-plane cannot run effectively in a clustered environment because:
- there's no way to identify a specific replica that is running a data flow and "terminate/suspend" it
- there's no way to re-start a data flow that was interrupted because the replica crashed

## Approach

We will provide this feature through the `DataPlaneStore` persistence layer.
A `runtimeId` will be added in the `DataFlow`, and it will be set when it gets started with the replica's `runtimeId`.
There will be a configured duration `T` (that can be in milliseconds, seconds at most. A good name for it needs to be found).

### Identify specific replica to suspend/terminate

The now synchronous `suspend`/`terminate` will become asynchronous by putting the `DataFlow` in a `-ING` state like:
- `SUSPENDING`
- `TERMINATING`

For termination (the same logic will be duplicated for suspension), in the `DataPlaneManager` state machine there will be
two new `Processor` registered:
- one filters by `TERMINATING` state and `runtimeId`: it will stop the data flow and transition it to `TERMINATED`
- one filters by `TERMINATING` and by `updatedAt` passed by at least 2/3 times `T`: it will transition the data flow to
`TERMINATED` (as cleanup for dangling data flows)

### Re-start interrupted data flow

Please consider `T` as a configured time duration (milliseconds, seconds at most).

A running data flow will need to update the `updatedAt` field every `T`
In the `DataPlaneManager` state machine, fetches items in `STARTED` with `runtimeId` different from the replica one,
that have `updatedAt` past by at least 2/3 times `T`.
These data-flows can then be started again

1 change: 1 addition & 0 deletions docs/developer/decision-records/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,4 @@
- [2024-08-16 Policy_Validation](./2024-08-16-policy-validation)
- [2024-09-24 STS Accounts API](./2024-09-24-sts-accounts-api)
- [2024-09-25 Multiple Protocol Versions](./2024-09-25-multiple-protocol-versions)
- [2024-10-02 Clustered data-plane](./2024-10-02-clustered-data-plane/)

0 comments on commit 3a67dca

Please sign in to comment.