Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in HA-Master setup since 2.11 #7624

Closed
BastiBr opened this issue Nov 13, 2019 · 10 comments
Closed

Segfault in HA-Master setup since 2.11 #7624

BastiBr opened this issue Nov 13, 2019 · 10 comments
Labels
core/crash Shouldn't happen, requires attention

Comments

@BastiBr
Copy link

BastiBr commented Nov 13, 2019

Describe the bug

Hi all,

since we upgraded our two master to 2.11 we experienced a kernel segfault on them a few times.
The problems occurs a few times now. The icinga2 process is then killed from the kernel.

Syslog output:

kernel: [688664.077808] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-10, throttling...
kernel: [735013.899472] show_signal_msg: 1 callbacks suppressed
kernel: [735013.899475] icinga2[12965]: segfault at 7fc0bcf7f8e0 ip 00007fc0bcf7f8e0 sp 00007fc1254625b8 error 15
systemd[1]: icinga2.service: Main process exited, code=exited, status=139/n/a

To Reproduce

Random

Expected behavior

Running without being killed

Your Environment

  • Version used (icinga2 --version): (version: r2.11.2-1)

  • Operating System and version: Ubuntu 18.04.3 LTS (Bionic Beaver) / Kernel version: 4.15.0-66-generic

  • Enabled features (icinga2 feature list):
    Enabled features: api checker command ido-mysql mainlog notification

  • Icinga Web 2 version and modules (System - About):
    Icinga Web 2 Version: 2.7.3

businessprocess | 2.2.0
cube | 1.1.0
director | 1.7.1
doc | 2.7.3
elasticsearch | 1.0.0
grafana | 1.3.5
incubator | 0.5.0
ipl | v0.3.0
monitoring | 2.7.3
reactbundle | 0.7.0
x509 | 1.0.0

  • Config validation (icinga2 daemon -C):
[2019-11-13 13:14:27 +0100] information/cli: Icinga application loader (version: r2.11.2-1)
[2019-11-13 13:14:27 +0100] information/cli: Loading configuration file(s).
[2019-11-13 13:14:28 +0100] information/ConfigItem: Committing config item(s).
[2019-11-13 13:14:28 +0100] information/ApiListener: My API identity: op-icn2mas-p102.domain.int
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 17682 Dependencies.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 4 NotificationCommands.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 23170 Notifications.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 35 HostGroups.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1946 Hosts.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 2 EventCommands.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 103 Downtimes.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 14 Comments.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1319 Zones.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1321 Endpoints.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 7 ApiUsers.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 15 UserGroups.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 291 CheckCommands.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 9 TimePeriods.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 446 Users.
[2019-11-13 13:14:34 +0100] information/ConfigItem: Instantiated 19552 Services.
[2019-11-13 13:14:35 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-11-13 13:14:35 +0100] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
object Endpoint "op-icn2mas-p101.domain.int" {
        host = "IP"
}

object Endpoint "op-icn2mas-p102.domain.int" {
}

object Zone "master-zone" {
        endpoints = [ "op-icn2mas-p101.domain.int", "op-icn2mas-p102.domain.int" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Endpoint "op-icn2pci-p101.domain.int" {
        host = "IP"
}

object Endpoint "op-icn2pci-p102.domain.int" {
        host = "IP"
}

object Zone "pci-zone" {
        parent = "master-zone"
        endpoints = [ "op-icn2pci-p101.domain.int", "op-icn2pci-p102.domain.int" ]
}

/* EXTERNAL ZONE */
object Endpoint "op-icn2ext-p101.domain.int" {
        host = "IP"
}

object Endpoint "op-icn2ext-p102.domain.int" {
        host = "IP"
}

object Zone "ext-zone" {
        parent = "master-zone"
        endpoints = [ "op-icn2ext-p101.domain.int", "op-icn2ext-p102.domain.int" ]
}

/* INTERNAL ZONE */
object Endpoint "op-icn2css-p101.domain.int" {
        host = "IP"
}

object Endpoint "op-icn2css-p102.domain.int" {
        host = "IP"
}

object Zone "css-zone" {
        parent = "master-zone"
        endpoints = [ "op-icn2css-p101.domain.int", "op-icn2css-p102.domain.int" ]
}

Additional context

If I can help or provide more information, please let me know.
Thank you.

Best regards,
Basti

@dnsmichi dnsmichi added the core/crash Shouldn't happen, requires attention label Nov 14, 2019
@BastiBr
Copy link
Author

BastiBr commented Nov 15, 2019

Another one on the other master. Directly after a director deployment:

kernel: [877041.929207] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-10, throttling...
kernel: [937452.545282] show_signal_msg: 1 callbacks suppressed
kernel: [937452.545285] icinga2[6688]: segfault at 646e616d6d6f ip 00007f06d9910207 sp 00007f064c94d0b0 error 4 in libc-2.27.so[7f06d9879000+1e7000]

@dnsmichi
Copy link
Contributor

Can you enable coredumps and generate such following these instructions, please?
https://icinga.com/docs/icinga2/latest/doc/21-development/#core-dump

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Nov 15, 2019
@BastiBr
Copy link
Author

BastiBr commented Nov 15, 2019

Of course. I will post the dump after the next crash.

@lippserd
Copy link
Member

lippserd commented Dec 6, 2019

Hi,

This might be related to #7532

@BastiBr Do you still face this issue and could provide a dump?

Best,
Eric

@BastiBr
Copy link
Author

BastiBr commented Dec 9, 2019

Hi,

unfortunately we don't had the same error again.
But we experienced exact the same errors in the meantime (often right after a deployment), like mentioned in #7532

Unfortunately we have just these additional logs found.

icinga.log

[2019-12-06 15:18:47 +0100] warning/Process: PID 18058 ('/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '--close-stdio' '-e' '/var/log/icinga2/error.log' '--validate' '--define' 'System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/') 
died mysteriously: waitpid failed
...
[2019-12-06 15:18:54 +0100] critical/Process: Fork failed with error code 0 (Success)

Crash Report:

Application version: r2.11.2-1

System information:
  Platform: Ubuntu
  Platform version: 18.04.3 LTS (Bionic Beaver)
  Kernel: Linux
  Kernel version: 4.15.0-66-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xc7) [0x7f8553b73e97]
        (1) libc.so.6: abort (+0x141) [0x7f8553b75801]
        (2) libc.so.6: <unknown function> (+0x89897) [0x7f8553bbe897]
        (3) libc.so.6: <unknown function> (+0x9090a) [0x7f8553bc590a]
        (4) libc.so.6: cfree (+0x4dc) [0x7f8553bcce2c]
        (5) icinga2: <unknown function> (+0x6d9131) [0x555a216e8131]
        (6) icinga2: icinga::JsonRpcConnection::SendMessageInternal(boost::intrusive_ptr<icinga::Dictionary> const&) (+0x4f) [0x555a2161e6cf]
        (7) icinga2: <unknown function> (+0x5f6dec) [0x555a21605dec]
        (8) icinga2: boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (+0x75) [0x555a21662ee5]
        (9) icinga2: <unknown function> (+0x7580db) [0x555a217670db]
        (10) icinga2: icinga::IoEngine::RunEventLoop() (+0x5e) [0x555a2175bb7e]
        (11) libstdc++.so.6: <unknown function> (+0xbd66f) [0x7f8551a4e66f]
        (12) libpthread.so.0: <unknown function> (+0x76db) [0x7f85529346db]
        (13) libc.so.6: clone (+0x3f) [0x7f8553c5688f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Failed to launch GDB: No such file or director

Br,
Basti

@BastiBr
Copy link
Author

BastiBr commented Dec 9, 2019

Well, just right now, another error after a director deployment, but no SEGFAULT:

Dec  9 10:19:07 icn2mas01 kernel: [ 1271.380426] do_general_protection: 1 callbacks suppressed
Dec  9 10:19:07 icn2mas01 kernel: [ 1271.380444] traps: icinga2[29083] general protection ip:55facd2f563e sp:7f1d3f8369b8 error:0 in icinga2[55faccc18000+a71000]
Dec  9 10:19:07 icn2mas01 systemd[1]: icinga2.service: Main process exited, code=exited, status=139/n/a

No dump written under /var/lib/cores. Should a dump have been written?

Br,
Basti

@dnsmichi dnsmichi pinned this issue Dec 13, 2019
@dnsmichi dnsmichi unpinned this issue Dec 13, 2019
@lippserd
Copy link
Member

@BastiBr Since you get this error right after a director deployment could please try to remove the stack size limit via systemd as mentioned here #7532 (comment)? Also, if this does not help, it would be great if you could test our new snapshot packages.

Your error looks like a general protection fault and should have written a core dump. @dnsmichi Am I right here?

@BastiBr
Copy link
Author

BastiBr commented Dec 16, 2019

I changed the systemd stack size limit option but it seems that the paramater was already active before the changes:

/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 daemon --no-stack-rlimit --close-stdio -e /var/log/icinga2/error.log

I will try to test the snapshot build in our staging environment, but this env. is very small compared to production.

@dnsmichi
Copy link
Contributor

Your error looks like a general protection fault and should have written a core dump. @dnsmichi Am I right here?

It could be the case that the core dump is not written because of permissions or file sizes.

I've also just learned that Ubuntu handles that differently 👀
https://askubuntu.com/a/1109747

@lippserd
I think the direction with the stack limit is a blind guess, I strongly believe that the JSON library upgrade will solve the problem. I have no proof unfortunately yet.

@dnsmichi
Copy link
Contributor

I strongly believe that this is the same problem as reported with #7532

@BastiBr please post your test results over there, I'd like to close and reference this issue.

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/crash Shouldn't happen, requires attention
Projects
None yet
Development

No branches or pull requests

3 participants