-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitstore Enhanchements #6474
Splitstore Enhanchements #6474
Conversation
923eaeb
to
98c6530
Compare
rebased on master. |
this is necessary to avoid wearing clown shoes when the node stays offline for an extended period of time (more than 1 finality). Basically it gets quite slow if we do the full 2 finality walk, so we try to avoid it unless necessary. The conditions under which a full walk is necessary is if there is a sync gap (most likely because the node was offline) during which the tracking of writes is inaccurate because we have not yet delivered the HeadChange notification. In this case, it is possible to have actually hot blocks to be tracked before the boundary and fail to mark them accordingly. So when we detect a sync gap, we do the full walk; if there is no sync gap, we can just use the much faster boundary epoch walk.
for maximal safety.
Follow up issue for testing: #6725 |
so that we preclude the following scenario: Start compaction. Start view. Finish compaction. Start compaction. which would not wait for the view to complete.
Because stebalien has allergies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address my final final review in #6718 before merging. But I'm going to give this a 👍 anyways because the change shouldn't be too difficult.
We can add after Wait is called, which is problematic with WaitGroups. This instead uses a mx/cond combo and waits while the count is > 0. The only downside is that we might needlessly wait for (a bunch) of views that started while the txn is active, but we can live with that.
Summoning @magik6k -- this is ready for you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't spot anything to be nitpicky about, I guess this means we can see if it floats.
quiet = true | ||
log.Warnf("error checking markset: %s", err) | ||
} | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still inconsistent with trackTxnRef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what way?
Are you concerned about an error not causing the transaction to abort?
Currently the only way the markset errors is if it has been closed, ie the transaction has been aborted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In trackTxnRef, we track the ref even when we error. Here, we skip when we error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes you are right; let me fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 7785467
} | ||
s.txnViewsWaiting = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we're waiting for all the views to end, but then we're not doing anything atomically. That can't be right.
Should we be waiting for all the non-tracked views to end? I.e., should we have a return at https://github.com/filecoin-project/lotus/pull/6474/files#diff-eac9e730a0594047de6e81aa421dcd33aac194e505a18945ca45e72db789687cR642?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, what? we have already started the transaction.
All the views are tracked now as they always increment the rlock.
The bariier just ensures that there is no view that started before the transaction that hasn't ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. This should be fine.
Ideally, we'd track views started after the transaction starts separately, but that's not strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
To summarize our sync conversation: a long running view is a bug and i'd very much rather hang the compaction (which is something we'll see in the logs) than do a potentially catastrophic delete.
This PR enhances the logic in splitstore:
Has
.Has
. If the object was not marked as rechable and the access happened before purge, then it would result in a miss. We fix this by making the compaction transactional, in that during compaction accesses to objects are recorded so as to not purge live objects.Has
as an implicit (recursive) write to account for vm behaviour on Copy.Follow up: