Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on 2.11 (JsonRpcConnection) #7687

Closed
verboEse opened this issue Dec 4, 2019 · 4 comments
Closed

Crashes on 2.11 (JsonRpcConnection) #7687

verboEse opened this issue Dec 4, 2019 · 4 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention core/crash Shouldn't happen, requires attention duplicate This issue or pull request already exists

Comments

@verboEse
Copy link

verboEse commented Dec 4, 2019

Behaviour

From time to time out icinga2 in production (single master) crashes. It always happens about 3 am, so I assume it has to do wit log rotate or another schedule.

Crash Log

  Application version: r2.11.2-1

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-11-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xcf) [0x7f2807861fff]
        (1) libc.so.6: abort (+0x16a) [0x7f280786342a]
        (2) libc.so.6: <unknown function> (+0x2be67) [0x7f280785ae67]
        (3) libc.so.6: <unknown function> (+0x2bf12) [0x7f280785af12]
        (4) icinga2: icinga::JsonRpcConnection::HandleAndWriteHeartbeats(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0xe20) [0x560f43cdce60]
        (5) icinga2: <unknown function> (+0x436b6d) [0x560f43d67b6d]
        (6) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7f280a1f172f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Failed to launch GDB: No such file or directory

My Environment

  • Enabled features (icinga2 feature list): api checker command ido-mysql influxdb mainlog notification statusdata
  • Icinga Web 2 version and modules (System - About):
    Icinga Web 2 Version: 2.7.3
    Loaded modules:
    beyondthepines | 0.5
    boxydash | 0.0.1
    businessprocess | 2.2.0
    cube | 1.1.0
    deployment | 2.0.0-beta1
    director | master
    doc | 2.7.3
    elasticsearch | 1.0.0
    globe | 1.0.4
    incubator | 0.5.0
    ipl | v0.3.0
    map | 1.1.0
    mapDatatype | 0.1.0
    monitoring | 2.7.3
    nordlicht | 1.0.0
    pdfexport | 0.9.1
    reactbundle | 0.7.0
    reporting | 0.0.0
    setup | 2.7.3
    unicorn | 1.0.2
@verboEse
Copy link
Author

verboEse commented Dec 4, 2019

I just found #7569, as the log is similar, I assume it's related. We don't see the error that often though:

-rw-r--r-- 1 nagios adm 1890 Okt 23 03:05 report.1571792755.471094
-rw-r--r-- 1 nagios adm 1890 Okt 28 03:05 report.1572228338.486929
-rw-r--r-- 1 nagios adm 1890 Okt 30 03:04 report.1572401081.731381
-rw-r--r-- 1 nagios adm 1890 Nov  6 03:05 report.1573005940.302770
-rw-r--r-- 1 nagios adm 1890 Nov 13 03:04 report.1573610680.912113
-rw-r--r-- 1 nagios adm 1890 Nov 20 03:05 report.1574215541.607461
-rw-r--r-- 1 nagios adm 1890 Nov 27 03:05 report.1574820346.100210
-rw-r--r-- 1 nagios adm 1890 Dez  2 03:04 report.1575252279.626116

@dnsmichi
Copy link
Contributor

dnsmichi commented Dec 4, 2019

@Al2Klimov @lippserd please take this into account as well, and maybe @verboEse is up to test a fix, or can provide more insights into the manner.

@verboEse at the time the crash happens, are there any other specific events being logged before, like notifications, Certificate signing requests, or a logrotate as you suspected?

@verboEse
Copy link
Author

verboEse commented Dec 4, 2019

error.log shows:
icinga2: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:199: T* boost::intrusive_ptr<T>::operator->() const [with T = icinga::Endpoint]: Assertion 'px != 0' failed.
Last line in icinga2.log was a received external command check.
On 27th of November it has been:
[2019-11-27 03:05:46 +0100] warning/ApiListener: Removing API client for endpoint 'a.client.server.name'. 0 API clients left.

So: no hints there, I think. Our hourly cron starts 16 minutes after full hour, daily cron ad 6 am, so OUR cron (and with it logrotate) is NOT the reason.

I found though that our hoster starts it's backup processes at 3 am (icinga2 is running on a VM in Hyper-V), and kernel log shows several times INFO: task jbd2/dm-3-8:389 blocked for more than 120 seconds.
So this is probably at least related to the disk not answering fast enough.

@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention blocker Blocks a release or needs immediate attention labels Dec 5, 2019
@dnsmichi dnsmichi added this to the 2.12.0 milestone Dec 5, 2019
@lippserd
Copy link
Member

lippserd commented Dec 6, 2019

Hi,

Thanks for your information so far. We do think that this is related to #7532. I'll close this one as duplicate. If anything further comes up, please do not hesitate to participate in #7532.

Best,
Eric

@lippserd lippserd closed this as completed Dec 6, 2019
@lippserd lippserd removed this from the 2.12.0 milestone Dec 6, 2019
@lippserd lippserd added the duplicate This issue or pull request already exists label Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention core/crash Shouldn't happen, requires attention duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants