-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix issues upgrading state leading to possible aborts #136
Conversation
e36f3b7
to
a341300
Compare
@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength, | |||
return true; | |||
} | |||
|
|||
const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}}; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right that this is just a helper to fix the BWC issue? Do we need this on master
?
It would be good to document. Maybe we can introduce some comment prefix that indicates possible cleanups at a later stage. If I am right, this can be cleaned up for 7.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can't be removed in 7.0. It's permissible to upgrade from 6.0 to 7.last as long as you do a full cluster restart.
It could possibly be removed in 8.0, but even then we'd have to do extra work on the Java side to disallow reverting to a model snapshot that dated back to 6.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good point. I think it's still worth a follow up. I assume there will be more reasons to cut the minimum upgradeable version. So independent of the decision to be made in future, a comment can help to find it when it's time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the whole block containing this is labelled as relating to updating state.
I agree that it would be nice to remove the extra code we need for upgrading state over time. So far we haven't been disciplined about how we manage this process in this repo. Recently, I've introduced the formalism that we use completely separate code paths for each state version (where we need to upgrade) in the restore functions, which I think is a step in that direction.
I'd propose we do an audit of the code w.r.t. state update (there are other cases than this) and decide what constraints we need to impose to permit us to gradually remove this code. At that point, I think we should have a definite guidelines on how to implement state updates and how we track what can be removed when. I'd prefer to delay any further comments in the code until we've done that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I think adding a comment about what the mappings are doing would make it easier to understand.
@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength, | |||
return true; | |||
} | |||
|
|||
const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a comment to say the intention is to map states from {"INITIAL", "SMALL_TEST", "REGULAR_TEST", "NOT_TESTING", "ERROR"}
to the best equivalent in {"INITIAL", "TEST", "NOT_TESTING", "ERROR"}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for (const auto& machines : m_Machines) { | ||
if (pos < machines.size()) { | ||
return machines[pos]; | ||
} | ||
pos -= machines.size(); | ||
} | ||
LOG_ABORT(<< "Invalid index '" << pos << "'"); | ||
LOG_ABORT(<< "Invalid index '" << pos_ << "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something for a follow up? The abort is a drastic decision as in this issue we effectively kill the process although only a feature broke. We are more forgiving in other areas, e.g. restore returns a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In hindsight, this is better dealt with by trapping and reinitialising the offending object (which would have worked around this problem). I'll make that change in a follow up PR.
…detect process (elastic#136) Closes elastic#135.
We weren't properly upgrading aspects of the state from 6.2 which triggered us to abort in some cases. Specifically, we should not be restoring the machine identifier, which is initialised correctly and restored with the wrong value and we need to update states where the set has changed.
This needs back porting to 6.3 asap. Discussing internally we prefer not to request any delay to 6.3.1 which contains critical fixes and instead advise people to delay updating to 6.3 if they have realtime ML jobs until 6.3.2 becomes available.
Closes #135.