Skip to content

Commit

Permalink
docs: DR for clustered data-plane (#4522)
Browse files Browse the repository at this point in the history
* docs: DR for clustered data-plane

* PR remark

* PR remarks
  • Loading branch information
ndr-brt authored Oct 4, 2024
1 parent d2b2f90 commit b46a4c1
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Clustered data-plane

## Decision

We will make the data-plane being able to run in a clustered environment.

## Rationale

Currently, data-plane cannot run effectively in a clustered environment because:
- there's no way to identify a specific replica that is running a data flow and "terminate/suspend" it
- there's no way to re-start a data flow that was interrupted because the replica crashed

## Approach

We will provide this feature through the `DataPlaneStore` persistence layer.
A `runtimeId` will be added in the `DataFlow`, and it will be set when it gets started with the replica's `runtimeId`.
There will be a configured duration `flowLease` (that can be in milliseconds, seconds at most).

### Identify specific replica to suspend/terminate

The now synchronous `suspend`/`terminate` will become asynchronous by putting the `DataFlow` in a `-ING` state like:
- `SUSPENDING`
- `TERMINATING`

For termination (the same logic will be duplicated for suspension), in the `DataPlaneManager` state machine there will be
two new `Processor` registered:
- one filters by `TERMINATING` state and `runtimeId`: it will stop the data flow and transition it to `TERMINATED`
- one filters by `TERMINATING` and by `updatedAt` passed by at least 2/3 times `flowLease`: it will transition the data flow to
`TERMINATED` (as cleanup for dangling data flows).

Note: once the "termination" message is sent from the control-plane to the data-plane and the ACK received, the control-plane
will consider the `DataFlow` as terminated, and it will continue evaluating the termination logic on the `TransferProcess`
(send protocol message, transition to `TERMINATED`).
We consider this acceptable because `DataFlow` termination is generally a cleanup operation that shouldn't take too much time.

### Re-start interrupted data flow

Please consider `flowLease` as a configured time duration (milliseconds, seconds at most).

A running data flow will need to update the `updatedAt` field every `flowLease`
In the `DataPlaneManager` state machine, fetches items in `STARTED` with `runtimeId` different from the replica one,
that have `updatedAt` past by at least 2/3 times `flowLease`.
These data-flows can then be started again

1 change: 1 addition & 0 deletions docs/developer/decision-records/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,4 @@
- [2024-08-16 Policy_Validation](./2024-08-16-policy-validation)
- [2024-09-24 STS Accounts API](./2024-09-24-sts-accounts-api)
- [2024-09-25 Multiple Protocol Versions](./2024-09-25-multiple-protocol-versions)
- [2024-10-02 Clustered data-plane](./2024-10-02-clustered-data-plane/)

0 comments on commit b46a4c1

Please sign in to comment.