Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix issues upgrading state leading to possible aborts #136

Merged
merged 3 commits into from
Jul 2, 2018

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jul 2, 2018

We weren't properly upgrading aspects of the state from 6.2 which triggered us to abort in some cases. Specifically, we should not be restoring the machine identifier, which is initialised correctly and restored with the wrong value and we need to update states where the set has changed.

This needs back porting to 6.3 asap. Discussing internally we prefer not to request any delay to 6.3.1 which contains critical fixes and instead advise people to delay updating to 6.3 if they have realtime ML jobs until 6.3.2 becomes available.

Closes #135.

@tveasey tveasey force-pushed the bug/state-upgrade branch from e36f3b7 to a341300 Compare July 2, 2018 11:08
@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength,
return true;
}

const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right that this is just a helper to fix the BWC issue? Do we need this on master?

It would be good to document. Maybe we can introduce some comment prefix that indicates possible cleanups at a later stage. If I am right, this can be cleaned up for 7.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can't be removed in 7.0. It's permissible to upgrade from 6.0 to 7.last as long as you do a full cluster restart.

It could possibly be removed in 8.0, but even then we'd have to do extra work on the Java side to disallow reverting to a model snapshot that dated back to 6.x.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good point. I think it's still worth a follow up. I assume there will be more reasons to cut the minimum upgradeable version. So independent of the decision to be made in future, a comment can help to find it when it's time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the whole block containing this is labelled as relating to updating state.

I agree that it would be nice to remove the extra code we need for upgrading state over time. So far we haven't been disciplined about how we manage this process in this repo. Recently, I've introduced the formalism that we use completely separate code paths for each state version (where we need to upgrade) in the restore functions, which I think is a step in that direction.

I'd propose we do an audit of the code w.r.t. state update (there are other cases than this) and decide what constraints we need to impose to permit us to gradually remove this code. At that point, I think we should have a definite guidelines on how to implement state updates and how we track what can be removed when. I'd prefer to delay any further comments in the code until we've done that.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I think adding a comment about what the mappings are doing would make it easier to understand.

@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength,
return true;
}

const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment to say the intention is to map states from {"INITIAL", "SMALL_TEST", "REGULAR_TEST", "NOT_TESTING", "ERROR"} to the best equivalent in {"INITIAL", "TEST", "NOT_TESTING", "ERROR"}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I'll add a comment.

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

for (const auto& machines : m_Machines) {
if (pos < machines.size()) {
return machines[pos];
}
pos -= machines.size();
}
LOG_ABORT(<< "Invalid index '" << pos << "'");
LOG_ABORT(<< "Invalid index '" << pos_ << "'");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something for a follow up? The abort is a drastic decision as in this issue we effectively kill the process although only a feature broke. We are more forgiving in other areas, e.g. restore returns a bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In hindsight, this is better dealt with by trapping and reinitialising the offending object (which would have worked around this problem). I'll make that change in a follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Autodetect process crashes when trying to forecast on job upgraded from 6.2.4 to 6.3.1
3 participants