[ML] Fix issues upgrading state leading to possible aborts #136

tveasey · 2018-07-02T10:50:44Z

We weren't properly upgrading aspects of the state from 6.2 which triggered us to abort in some cases. Specifically, we should not be restoring the machine identifier, which is initialised correctly and restored with the wrong value and we need to update states where the set has changed.

This needs back porting to 6.3 asap. Discussing internally we prefer not to request any delay to 6.3.1 which contains critical fixes and instead advise people to delay updating to 6.3 if they have realtime ML jobs until 6.3.2 becomes available.

Closes #135.

hendrikmuhs · 2018-07-02T11:23:05Z

lib/maths/CTimeSeriesDecompositionDetail.cc

@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength,
    return true;
 }

+const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}};
+


am I right that this is just a helper to fix the BWC issue? Do we need this on master?

It would be good to document. Maybe we can introduce some comment prefix that indicates possible cleanups at a later stage. If I am right, this can be cleaned up for 7.0.

This code can't be removed in 7.0. It's permissible to upgrade from 6.0 to 7.last as long as you do a full cluster restart.

It could possibly be removed in 8.0, but even then we'd have to do extra work on the Java side to disallow reverting to a model snapshot that dated back to 6.x.

ok, good point. I think it's still worth a follow up. I assume there will be more reasons to cut the minimum upgradeable version. So independent of the decision to be made in future, a comment can help to find it when it's time.

Note that the whole block containing this is labelled as relating to updating state.

I agree that it would be nice to remove the extra code we need for upgrading state over time. So far we haven't been disciplined about how we manage this process in this repo. Recently, I've introduced the formalism that we use completely separate code paths for each state version (where we need to upgrade) in the restore functions, which I think is a step in that direction.

I'd propose we do an audit of the code w.r.t. state update (there are other cases than this) and decide what constraints we need to impose to permit us to gradually remove this code. At that point, I think we should have a definite guidelines on how to implement state updates and how we track what can be removed when. I'd prefer to delay any further comments in the code until we've done that.

droberts195

LGTM, although I think adding a comment about what the mappings are doing would make it easier to understand.

droberts195 · 2018-07-02T11:35:08Z

lib/maths/CTimeSeriesDecompositionDetail.cc

@@ -355,6 +357,8 @@ bool upgradeTrendModelToVersion6p3(const core_t::TTime bucketLength,
    return true;
 }

+const TSizeSizeMap SC_STATES_UPGRADING_TO_VERSION_6_3{{0, 0}, {1, 1}, {2, 1}, {3, 2}, {4, 3}};


Might be worth adding a comment to say the intention is to map states from {"INITIAL", "SMALL_TEST", "REGULAR_TEST", "NOT_TESTING", "ERROR"} to the best equivalent in {"INITIAL", "TEST", "NOT_TESTING", "ERROR"}.

Yes, good point. I'll add a comment.

hendrikmuhs

LGTM

hendrikmuhs · 2018-07-02T11:43:57Z

lib/core/CStateMachine.cc

    for (const auto& machines : m_Machines) {
        if (pos < machines.size()) {
            return machines[pos];
        }
        pos -= machines.size();
    }
-    LOG_ABORT(<< "Invalid index '" << pos << "'");
+    LOG_ABORT(<< "Invalid index '" << pos_ << "'");


something for a follow up? The abort is a drastic decision as in this issue we effectively kill the process although only a feature broke. We are more forgiving in other areas, e.g. restore returns a bool.

Yes. In hindsight, this is better dealt with by trapping and reinitialising the offending object (which would have worked around this problem). I'll make that change in a follow up PR.

…detect process (elastic#136) Closes elastic#135.

… autodetect process (#139) Backport #136.

…detect process (elastic#136)

…detect process (#140) Backport #136.

tveasey added >bug v7.0.0 :ml v6.4.0 labels Jul 2, 2018

tveasey requested a review from droberts195 July 2, 2018 10:50

Fix issues upgrading state leading to SIGSEGV

a341300

tveasey force-pushed the bug/state-upgrade branch from e36f3b7 to a341300 Compare July 2, 2018 11:08

Documentation

60706a7

droberts195 added the v6.3.2 label Jul 2, 2018

hendrikmuhs reviewed Jul 2, 2018

View reviewed changes

droberts195 approved these changes Jul 2, 2018

View reviewed changes

hendrikmuhs approved these changes Jul 2, 2018

View reviewed changes

hendrikmuhs reviewed Jul 2, 2018

View reviewed changes

Review comments

64da3c0

tveasey merged commit 06a0f99 into elastic:master Jul 2, 2018

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jul 3, 2018

[ML] Fix issues upgrading state leading to possible abort of the auto…

d41c335

…detect process (elastic#136) Closes elastic#135.

droberts195 mentioned this pull request Jul 3, 2018

[ML] Autodetect process crashes when trying to forecast on job upgraded from 6.2.4 to 6.3.1 #135

Closed

tveasey mentioned this pull request Jul 3, 2018

[6.4][ML] Fix issues upgrading state leading to possible abort of the autodetect process #139

Merged

tveasey added a commit that referenced this pull request Jul 3, 2018

[6.4][ML] Fix issues upgrading state leading to possible abort of the…

4008a56

… autodetect process (#139) Backport #136.

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jul 3, 2018

[ML] Fix issues upgrading state leading to possible abort of the auto…

0fa69d6

…detect process (elastic#136)

tveasey mentioned this pull request Jul 3, 2018

[6.3][ML] Fix issues upgrading state leading to possible abort of the autodetect process #140

Merged

tveasey added a commit that referenced this pull request Jul 3, 2018

[ML] Fix issues upgrading state leading to possible abort of the auto…

a6b0329

…detect process (#140) Backport #136.

tveasey deleted the bug/state-upgrade branch April 10, 2019 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix issues upgrading state leading to possible aborts #136

[ML] Fix issues upgrading state leading to possible aborts #136

tveasey commented Jul 2, 2018

hendrikmuhs Jul 2, 2018

droberts195 Jul 2, 2018

hendrikmuhs Jul 2, 2018

tveasey Jul 2, 2018

droberts195 left a comment

droberts195 Jul 2, 2018

tveasey Jul 2, 2018

hendrikmuhs left a comment

hendrikmuhs Jul 2, 2018

tveasey Jul 2, 2018

[ML] Fix issues upgrading state leading to possible aborts #136

[ML] Fix issues upgrading state leading to possible aborts #136

Conversation

tveasey commented Jul 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment