-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-8972 (2.4 blocker): bug fix for restoring task #7617
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is interesting: for created the running tasks we call its
task.close
function whereas forrestoring
we only calltask.closeStateManager
, is it intentional? If yes why? cc @ableegoldmanThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked you exactly that a few months ago :) You referenced some old PR but basically the takeaway was, a restoring task hasn't initialized anything but its state, therefore needs to close only the state manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little awkward, @abbccdda maybe we should at least put in a comment explaining this for now so we don't have to keep asking each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I've been thinking all these different task types should literally be different types, so that we can just call
task.close
and have it close whatever is needed for a restoring task, or suspended, etc. We could probably do some further cleanup, for example on close all tasks callsuspend
thencloseSuspended
but only active & running tasks are ever just suspended -- and we could probably also drop the double checkpointing (see KAFKA-9113, and please feel free to dump any of your own thoughts around safer task management on there)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree, but my confusion is that why we call close on created tasks? Basically the weird thing I noticed is that for closing as zombies:
close
closeStateMgr
close
closeSuspended
3 and 4 makes sense to me, but 1/2 are a bit weird: for created tasks, we do not initialize topology nor state managers, but still we call
close
, whereas restoring ones (where we initialized state manager but not topology) we callcloseStateMgr
.Did a closer look at the code base, I think the reason is that in
closeNonRunningTasks
(which is triggered inonPartitionsRevoked
) we actually trigger the following:I.e. we treat it as an "unclean" close, and hence we won't write checkpoint, and most importantly, when
closeTopology
throws (which would be the case since it is not initialized at all) we would just ignore it. So "accidentally" this becomes correct.This is not a great pattern, and I like @ableegoldman 's suggestion that we should consider making "close" call to be more robust in a follow-up PR, for now let's stay with what it is then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to reveal these error-prone close call setups here. My take is that each function starting with
close
should be distinguished clearly in function comment what it tries to do. And we should avoid including > 1 boolean variables, as it's hard to reason about. What about cleanClose() and uncleanClose()? Also should we be more specific about wordzombie
? This seems to be a really vague term too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (and hope) the term
zombie
is not used in an unclear way -- but there might be some confusion in the translation from "clients-speak" to "streams-speak" since what the clients simply call "lost" we call "zombie task". Some other examples along that line are the confusion about what revoked means -- sometimes its the partitions given up temporarily, sometimes only the partitions that are being reassigned and permanently given up.But clear code can always help with unclear terminology: as an example of some progress, we used to have both "suspended" and "zombie" standbys, neither of which really makes sense for a standby. We cleaned up that code in some recent PRs so that, I feel, is now quite clear
@abbccdda I also like the idea of
cleanClose
,unCleanClose
, where each specific task can make it clear what it needs to do in each case. A nice side-effect there is we can clearly comment what clean (or unclean) means at the top of each method, rather than try to interpret every brief parameter description