feat: repair is rewarded #1022

AuHau · 2024-12-05T14:27:48Z

Adds support for changes in the contract that introduce repair reward as described here: codex-storage/codex-contracts-eth#154

There is one question to be answered about what the node should do when it restarts (after an outage for example) and finds his slot to be in Repair state. For more see bellow.

AuHau · 2024-12-05T14:28:42Z

codex/sales/states/unknown.nim

+  of SlotState.Repair:
+    # TODO: What to do? It could really mean that he was offline for that long that he was kicked out
+    # from the slot and lost all the collateral. Should we simply "give up" on the slot? Or try
+    # to "repair" (there is some chance that the dataset is still locally saved) the slot?
+    discard


What should happen here? Any suggestions?

In this case, we've seen that the slot state is in a Repair state for some time after the slot was freed, so it would have missed the SlotFreed event and not have re-added the slot to the slot queue. Maybe we could add the slot back into the slot queue here, and also add a SaleSlotFreed state that simply logs that the slot was freed?

We might want to consider that we cover the Repair case in sales/states/proving.proveLoop. This case is a bit different since the node will be running and presumably online so it will have received the SlotFreed event, putting the slot id back in the slot queue. But we could still move to a SaleSlotFreed state to log that a slot was freed.

Moving to a new state I think will help us when we try to tackle "preserving" state history for the UI in #885.

Ooooh, I forgot about one thing: when a slot is forcibly freed, then it is also removed from the node's slot tracking, so when the node comes online and starts the "reloading of state" from this tracking, he will actually not find out that he was kicked out from a slot. So I guess the problem solved, but IMHO, this is bad UX 😅 And it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state. But that is another issue...

Good point! Should we still handle the proveLoop case in this PR?

Good point! Should we still handle the proveLoop case in this PR?

I can't imagine that his case will happen (often), but yeah I will add handling for it in the prooving loop - good point!

But we could still move to a SaleSlotFreed state to log that a slot was freed.

I have chosen to not create another new state as it would simply transition to SaleErrored so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored. Maybe the question is what should happen regarding the returnBytes and reprocessSlot? I don't think we need to set the reprocessSlot but maybe we should prevent returnBytes as that would clean up the data from the node, preventing potentially "refilling"? 🤔 But at the same time if the refilling would not happen for any reason, then there would be a "data leak" as the bytes would be garbage collected only after the original slot's duration expiry...

it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state

This is discussed in #885. We both agreed to keep myPastRequests in the contract until we find time to develop storing it locally.

I have chosen to not create another new state as it would simply transition to SaleErrored so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored

This is one of those situations where we know that this sequence of events can happen, but is it really an error? Imo, we should treat this as a known sequence, and not an unexpected error. As I said before, it will also help when we implement #885, so that some kind of persistence of the final state can happen.

Maybe the question is what should happen regarding the returnBytes and reprocessSlot?

returnBytes only returns bytes to the Availability, it does not clean up data from the node. The data will not be cleaned up until the end of the storage request (this happens in SaleDownloading > onStore > updateExpiry). Imo, we should returnBytes to the Availability so we can refill the slot.

Should we reprocessSlot? Well, that's a bit trickier, because if we rely on the proveLoop (at the start of the next period) to realise the slot is in repair, we'll be later than other nodes who would have received the SlotFreed contract event earlier. However, we'll still have the data, so we have an advantage. And if we rely on the contract event only, then we bypass the state machine. So, we can combine the two (more work, I know):

Listen for the SlotFreed event and interrupt the SaleProving state with an onSlotFreed handler, moving to the SaleRepairing state.

(I'd still keep the check in the proveLoop for the repair case and move to the SaleRepairing state)

In SaleRepairing, there are two options:

Call onCleanUp(returnBytes: false, reprocessSlot: false) and return to the SaleInitialProving state. Remove the logic that re-adds the slot to the queue on SlotFreed.

Call onCleanUp(returnBytes: true, reprocessSlot: true), relying on the SlotFreed contract event to re-add the slot to the queue. If there are spare workers, this re-add will be processed before the SaleRepairing step, so we would hope that there was enough Availability...

My vote would be option 1 as it's the easiest to understand (imo lol), removes hidden things happening (contract event reprocessing), and sets us up for #885. It's more work, but I think it will be much easier to follow logically. Likely should be done in a separate PR to keep changes small. In the meantime, for this PR, my vote is to have a SlotRepairing state that calls onCleanUp(returnBytes: true, reprocessSlot: false).

emizzle

Approved, as no blockers, but maybe we should discuss further what should happen when a slot is in the repair state. Imo, it should go to a SlotRepairing state (which can also call onCleanUp). Anyway, I commented more in the "conversation".

emizzle · 2024-12-12T03:50:07Z

codex/sales/states/unknown.nim

+    # This should never really happen as the Slot is removed from the node's onchain slots tracking
+    # upon freeing the slot. Hence upon node's restart the slot with the Repair state should not
+    # be loaded from the onchain.
+    discard


Maybe log an error?

AuHau commented Dec 5, 2024

View reviewed changes

AuHau force-pushed the feat/repair-reward branch 2 times, most recently from f61a7ea to f568fc0 Compare December 9, 2024 16:27

AuHau marked this pull request as ready for review December 11, 2024 08:33

emizzle approved these changes Dec 12, 2024

View reviewed changes

AuHau added 6 commits December 12, 2024 18:41

feat: repair is rewarded

e4674f3

chore: update contracts repo

b10f331

feat: proving loop handles repair case

91815ba

test: assert repair state

13d5e8b

chore: update contracts repo

6e12656

fix: upon unknown state of repair go to error

87c3e07

AuHau force-pushed the feat/repair-reward branch from 7617067 to 87c3e07 Compare December 12, 2024 17:42

AuHau added this pull request to the merge queue Dec 12, 2024

Merged via the queue into master with commit 19af797 Dec 12, 2024
17 checks passed

AuHau deleted the feat/repair-reward branch December 12, 2024 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: repair is rewarded #1022

feat: repair is rewarded #1022

AuHau commented Dec 5, 2024

AuHau Dec 5, 2024

emizzle Dec 6, 2024

AuHau Dec 6, 2024

emizzle Dec 6, 2024

AuHau Dec 6, 2024

AuHau Dec 9, 2024

emizzle Dec 12, 2024

emizzle left a comment

emizzle Dec 12, 2024

feat: repair is rewarded #1022

feat: repair is rewarded #1022

Conversation

AuHau commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emizzle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment