Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: repair is rewarded #1022

Merged
merged 6 commits into from
Dec 12, 2024
Merged

feat: repair is rewarded #1022

merged 6 commits into from
Dec 12, 2024

Conversation

AuHau
Copy link
Member

@AuHau AuHau commented Dec 5, 2024

Adds support for changes in the contract that introduce repair reward as described here: codex-storage/codex-contracts-eth#154

There is one question to be answered about what the node should do when it restarts (after an outage for example) and finds his slot to be in Repair state. For more see bellow.

Comment on lines 53 to 57
of SlotState.Repair:
# TODO: What to do? It could really mean that he was offline for that long that he was kicked out
# from the slot and lost all the collateral. Should we simply "give up" on the slot? Or try
# to "repair" (there is some chance that the dataset is still locally saved) the slot?
discard
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen here? Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we've seen that the slot state is in a Repair state for some time after the slot was freed, so it would have missed the SlotFreed event and not have re-added the slot to the slot queue. Maybe we could add the slot back into the slot queue here, and also add a SaleSlotFreed state that simply logs that the slot was freed?

We might want to consider that we cover the Repair case in sales/states/proving.proveLoop. This case is a bit different since the node will be running and presumably online so it will have received the SlotFreed event, putting the slot id back in the slot queue. But we could still move to a SaleSlotFreed state to log that a slot was freed.

Moving to a new state I think will help us when we try to tackle "preserving" state history for the UI in #885.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh, I forgot about one thing: when a slot is forcibly freed, then it is also removed from the node's slot tracking, so when the node comes online and starts the "reloading of state" from this tracking, he will actually not find out that he was kicked out from a slot. So I guess the problem solved, but IMHO, this is bad UX 😅 And it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state. But that is another issue...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Should we still handle the proveLoop case in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Should we still handle the proveLoop case in this PR?

I can't imagine that his case will happen (often), but yeah I will add handling for it in the prooving loop - good point!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could still move to a SaleSlotFreed state to log that a slot was freed.

I have chosen to not create another new state as it would simply transition to SaleErrored so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored. Maybe the question is what should happen regarding the returnBytes and reprocessSlot? I don't think we need to set the reprocessSlot but maybe we should prevent returnBytes as that would clean up the data from the node, preventing potentially "refilling"? 🤔 But at the same time if the refilling would not happen for any reason, then there would be a "data leak" as the bytes would be garbage collected only after the original slot's duration expiry...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state

This is discussed in #885. We both agreed to keep myPastRequests in the contract until we find time to develop storing it locally.

I have chosen to not create another new state as it would simply transition to SaleErrored so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored

This is one of those situations where we know that this sequence of events can happen, but is it really an error? Imo, we should treat this as a known sequence, and not an unexpected error. As I said before, it will also help when we implement #885, so that some kind of persistence of the final state can happen.

Maybe the question is what should happen regarding the returnBytes and reprocessSlot?

returnBytes only returns bytes to the Availability, it does not clean up data from the node. The data will not be cleaned up until the end of the storage request (this happens in SaleDownloading > onStore > updateExpiry). Imo, we should returnBytes to the Availability so we can refill the slot.

Should we reprocessSlot? Well, that's a bit trickier, because if we rely on the proveLoop (at the start of the next period) to realise the slot is in repair, we'll be later than other nodes who would have received the SlotFreed contract event earlier. However, we'll still have the data, so we have an advantage. And if we rely on the contract event only, then we bypass the state machine. So, we can combine the two (more work, I know):

  1. Listen for the SlotFreed event and interrupt the SaleProving state with an onSlotFreed handler, moving to the SaleRepairing state.
  2. (I'd still keep the check in the proveLoop for the repair case and move to the SaleRepairing state)
  3. In SaleRepairing, there are two options:
    1. Call onCleanUp(returnBytes: false, reprocessSlot: false) and return to the SaleInitialProving state. Remove the logic that re-adds the slot to the queue on SlotFreed.
    2. Call onCleanUp(returnBytes: true, reprocessSlot: true), relying on the SlotFreed contract event to re-add the slot to the queue. If there are spare workers, this re-add will be processed before the SaleRepairing step, so we would hope that there was enough Availability...

My vote would be option 1 as it's the easiest to understand (imo lol), removes hidden things happening (contract event reprocessing), and sets us up for #885. It's more work, but I think it will be much easier to follow logically. Likely should be done in a separate PR to keep changes small. In the meantime, for this PR, my vote is to have a SlotRepairing state that calls onCleanUp(returnBytes: true, reprocessSlot: false).

@AuHau AuHau force-pushed the feat/repair-reward branch 2 times, most recently from f61a7ea to f568fc0 Compare December 9, 2024 16:27
@AuHau AuHau marked this pull request as ready for review December 11, 2024 08:33
Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, as no blockers, but maybe we should discuss further what should happen when a slot is in the repair state. Imo, it should go to a SlotRepairing state (which can also call onCleanUp). Anyway, I commented more in the "conversation".

# This should never really happen as the Slot is removed from the node's onchain slots tracking
# upon freeing the slot. Hence upon node's restart the slot with the Repair state should not
# be loaded from the onchain.
discard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log an error?

@AuHau AuHau force-pushed the feat/repair-reward branch from 7617067 to 87c3e07 Compare December 12, 2024 17:42
@AuHau AuHau added this pull request to the merge queue Dec 12, 2024
Merged via the queue into master with commit 19af797 Dec 12, 2024
17 checks passed
@AuHau AuHau deleted the feat/repair-reward branch December 12, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants