-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: repair is rewarded #1022
feat: repair is rewarded #1022
Conversation
codex/sales/states/unknown.nim
Outdated
of SlotState.Repair: | ||
# TODO: What to do? It could really mean that he was offline for that long that he was kicked out | ||
# from the slot and lost all the collateral. Should we simply "give up" on the slot? Or try | ||
# to "repair" (there is some chance that the dataset is still locally saved) the slot? | ||
discard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen here? Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we've seen that the slot state is in a Repair state for some time after the slot was freed, so it would have missed the SlotFreed
event and not have re-added the slot to the slot queue. Maybe we could add the slot back into the slot queue here, and also add a SaleSlotFreed
state that simply logs that the slot was freed?
We might want to consider that we cover the Repair
case in sales/states/proving.proveLoop
. This case is a bit different since the node will be running and presumably online so it will have received the SlotFreed
event, putting the slot id back in the slot queue. But we could still move to a SaleSlotFreed
state to log that a slot was freed.
Moving to a new state I think will help us when we try to tackle "preserving" state history for the UI in #885.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooooh, I forgot about one thing: when a slot is forcibly freed, then it is also removed from the node's slot tracking, so when the node comes online and starts the "reloading of state" from this tracking, he will actually not find out that he was kicked out from a slot. So I guess the problem solved, but IMHO, this is bad UX 😅 And it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state. But that is another issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Should we still handle the proveLoop case in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Should we still handle the proveLoop case in this PR?
I can't imagine that his case will happen (often), but yeah I will add handling for it in the prooving loop - good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could still move to a SaleSlotFreed state to log that a slot was freed.
I have chosen to not create another new state as it would simply transition to SaleErrored
so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored
. Maybe the question is what should happen regarding the returnBytes
and reprocessSlot
? I don't think we need to set the reprocessSlot
but maybe we should prevent returnBytes
as that would clean up the data from the node, preventing potentially "refilling"? 🤔 But at the same time if the refilling would not happen for any reason, then there would be a "data leak" as the bytes would be garbage collected only after the original slot's duration expiry...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it circles back to the big task of keeping some metadata store of the marketplace and not relying only on contracts state
This is discussed in #885. We both agreed to keep myPastRequests
in the contract until we find time to develop storing it locally.
I have chosen to not create another new state as it would simply transition to SaleErrored so instead, I am logging the case of repair and raising an exception that automatically transitions to SaleErrored
This is one of those situations where we know that this sequence of events can happen, but is it really an error? Imo, we should treat this as a known sequence, and not an unexpected error. As I said before, it will also help when we implement #885, so that some kind of persistence of the final state can happen.
Maybe the question is what should happen regarding the returnBytes and reprocessSlot?
returnBytes
only returns bytes to the Availability
, it does not clean up data from the node. The data will not be cleaned up until the end of the storage request (this happens in SaleDownloading
> onStore
> updateExpiry
). Imo, we should returnBytes
to the Availability so we can refill the slot.
Should we reprocessSlot
? Well, that's a bit trickier, because if we rely on the proveLoop
(at the start of the next period) to realise the slot is in repair, we'll be later than other nodes who would have received the SlotFreed
contract event earlier. However, we'll still have the data, so we have an advantage. And if we rely on the contract event only, then we bypass the state machine. So, we can combine the two (more work, I know):
- Listen for the
SlotFreed
event and interrupt theSaleProving
state with anonSlotFreed
handler, moving to theSaleRepairing
state. - (I'd still keep the check in the
proveLoop
for the repair case and move to theSaleRepairing
state) - In
SaleRepairing
, there are two options:- Call
onCleanUp(returnBytes: false, reprocessSlot: false)
and return to theSaleInitialProving
state. Remove the logic that re-adds the slot to the queue onSlotFreed
. - Call
onCleanUp(returnBytes: true, reprocessSlot: true)
, relying on theSlotFreed
contract event to re-add the slot to the queue. If there are spare workers, this re-add will be processed before theSaleRepairing
step, so we would hope that there was enough Availability...
- Call
My vote would be option 1 as it's the easiest to understand (imo lol), removes hidden things happening (contract event reprocessing), and sets us up for #885. It's more work, but I think it will be much easier to follow logically. Likely should be done in a separate PR to keep changes small. In the meantime, for this PR, my vote is to have a SlotRepairing
state that calls onCleanUp(returnBytes: true, reprocessSlot: false)
.
f61a7ea
to
f568fc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, as no blockers, but maybe we should discuss further what should happen when a slot is in the repair state. Imo, it should go to a SlotRepairing
state (which can also call onCleanUp
). Anyway, I commented more in the "conversation".
codex/sales/states/unknown.nim
Outdated
# This should never really happen as the Slot is removed from the node's onchain slots tracking | ||
# upon freeing the slot. Hence upon node's restart the slot with the Repair state should not | ||
# be loaded from the onchain. | ||
discard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log an error?
7617067
to
87c3e07
Compare
Adds support for changes in the contract that introduce repair reward as described here: codex-storage/codex-contracts-eth#154
There is one question to be answered about what the node should do when it restarts (after an outage for example) and finds his slot to be in
Repair
state. For more see bellow.