Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Autodetect process crashes when trying to forecast on job upgraded from 6.2.4 to 6.3.1 #135

Closed
dolaru opened this issue Jun 29, 2018 · 4 comments · Fixed by #136
Closed
Labels

Comments

@dolaru
Copy link
Member

dolaru commented Jun 29, 2018

Found in 6.3.1. Can be reproduced in 6.3.0.

The issue can be reproduced by running a job using the high_count function on the dns_tunneling dataset with a 30m bucket span.

While running a 6.3.1 ES instance that was upgraded from version 6.2.4, if the user tries to forecast on the dns_tunneling job that was created in version 6.2.4, the autodetect process will crash as soon as the forecast request is sent.

ES logs show:

[2018-06-29T16:08:06,742][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [ml-2] Opening job [dns_tunneling1_20180629-1542_624_ga]
[2018-06-29T16:08:06,764][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [ml-2] [dns_tunneling1_20180629-1542_624_ga] Loading model snapshot [1530283391] with latest_record_timestamp [2016-02-11T23:52:14.000Z], job latest_record_timestamp [2016-02-11T23:52:14.000Z]
[2018-06-29T16:08:06,773][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcessFactory] Restoring quantiles for job 'dns_tunneling1_20180629-1542_624_ga'
[2018-06-29T16:08:07,032][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CResourceMonitor.cc@67] Setting model memory limit to 1024 MB
[2018-06-29T16:08:07,072][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [ml-2] Successfully set job state to [opened] for job [dns_tunneling1_20180629-1542_624_ga]
[2018-06-29T16:08:07,114][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CAnomalyJob.cc@849] Processing is already complete to time 1455233400
[2018-06-29T16:08:07,126][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CForecastRunner.cc@113] Start forecasting from 2016-02-11T23:30:00+0000 to 2016-02-25T23:30:00+0000
[2018-06-29T16:08:07,127][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CTrendComponent.cc@347] Failed calculating confidence interval: Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is 0, but must be > 0 !, variance = 0, confidence = 95
[2018-06-29T16:08:07,190][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CTrendComponent.cc@347] Failed calculating confidence interval: Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is 0, but must be > 0 !, variance = 0, confidence = 95 | repeated [671]
[2018-06-29T16:08:07,190][FATAL][o.e.x.m.j.p.l.CppLogMessageHandler] [dns_tunneling1_20180629-1542_624_ga] [autodetect/28425] [CStateMachine.cc@211] Invalid index '1'
[2018-06-29T16:08:08,044][INFO ][o.e.x.m.j.p.a.NativeAutodetectProcess] [dns_tunneling1_20180629-1542_624_ga] State output finished
[2018-06-29T16:08:08,052][ERROR][o.e.x.m.j.p.a.NativeAutodetectProcess] [dns_tunneling1_20180629-1542_624_ga] autodetect process stopped unexpectedly: Failed calculating confidence interval: Error in function boost::math::normal_distribution<double>::normal_distribution: Scale parameter is 0, but must be > 0 !, variance = 0, confidence = 95
Invalid index '1'
Fatal error: 'terminate called after throwing an instance of 'std::runtime_error''
Fatal error: '  what():  Ml Fatal Exception'
Fatal error: 'si_signo 6, si_code: -6, si_errno: 0, address: 0x7ff1c2ab0428, library: /lib/x86_64-linux-gnu/libc.so.6, base: 0x7ff1c2a7b000, normalized address: 0x35428'

[2018-06-29T16:08:08,052][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [dns_tunneling1_20180629-1542_624_ga] some results not processed due to the termination of autodetect
[2018-06-29T16:08:08,087][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/28259] [CDetachedProcessSpawner.cc@184] Child process with PID 28425 was terminated by signal 6
[2018-06-29T16:08:08,094][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [ml-2] Successfully set job state to [failed] for job [dns_tunneling1_20180629-1542_624_ga]

Steps to reproduce

  1. Start a version 6.2.4 ES instance and run a job using the high_count function on the dns_tunneling dataset with a 30m bucket span.
  2. Upgrade the instance to version 6.3.1
  3. Try to run a forecast on the dns_tunneling job that was created in version 6.2.4
  4. Notice that the autodetect process has crashed, and the job state is now failed
@dolaru
Copy link
Member Author

dolaru commented Jul 3, 2018

I found that if the job is created in 5.6.10 and then upgraded straight to 6.3.1, the job fails to open.

ES log lines of the operation: 5610_job_in_631.log

tveasey added a commit to tveasey/ml-cpp-1 that referenced this issue Jul 3, 2018
@droberts195
Copy link
Contributor

droberts195 commented Jul 3, 2018

The 5.6.10 upgrade is crashing in:

        for (std::size_t i = 0u; i < regressions.size(); ++i) {
            m_Buckets.emplace_back(regressions[i], variances[i], initialTime,
                                   lastUpdates[i]);
        }

So variances or lastUpdates had fewer entries than regressions.

In fact, lastUpdates didn't even exist in 5.6, so will always be empty when upgrading from 5.6 to 6.3.

This won't be fixed by #136. It requires a different fix.

@droberts195 droberts195 reopened this Jul 3, 2018
@tveasey
Copy link
Contributor

tveasey commented Jul 3, 2018

I think I'd prefer to address this under a separate issue. This affects a different upgrade path, i.e. jumping from 5.6 to 6.3. It will work ok upgrading to an intermediate state (so there exists a work around) also it is in a different area of the code.

@tveasey
Copy link
Contributor

tveasey commented Jul 3, 2018

As per my last comment I'm going to keep this issue closed. @dolaru can you open a new issue for this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants