Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga 2.11rc1 unable to be connected & crash (after upgrade) #7470

Closed
Sec42 opened this issue Sep 3, 2019 · 7 comments
Closed

Icinga 2.11rc1 unable to be connected & crash (after upgrade) #7470

Sec42 opened this issue Sep 3, 2019 · 7 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients)

Comments

@Sec42
Copy link

Sec42 commented Sep 3, 2019

Describe the bug

After upgrading from 2.10 to 2.11rc1 the resulting server was uncommunicative on the :5665 port.
Additionally the server crashed after ~15 min of runtime.

Details

To clarify the "uncommunicative" part. The server listened on 5665, and accepted connections, but did not respond to anything after the SSL handshake (with credentials)

root@munnvmonpmac11:~#curl -v -k -u user:pass 'https://munnvmonpmac11:5665/'
* About to connect() to munnvmonpmac11 port 5665 (#0)
*   Trying x.x.x.x...
* Connected to munnvmonpmac11 (x.x.x.x) port 5665 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*       subject: CN=munnvmonpmac11,XXXX
*       start date: Sep 28 15:36:15 2018 GMT
*       expire date: Sep 25 15:36:15 2028 GMT
*       common name: munnvmonpmac11
*       issuer: OU=xxx
* Server auth using Basic with user 'user'
> GET / HTTP/1.1
> Authorization: Basic xxx
> User-Agent: curl/7.29.0
> Host: munnvmonpmac11:5665
> Accept: */*
>
* Empty reply from server
* Connection #0 to host munnvmonpmac11 left intact

Without credentials, the expected error message was given:

root@munnvmonpmac11:~#curl -k 'https://munnvmonpmac11:5665/' 
<h1>Unauthorized. Please check your user credentials.</h1>

icinga2 console --connect https://munnvmonpmac11:5665/ behaved similarly (returning a timeout error in response to any line)

Additionally, the server crashed after some runtime, although I'm unsure if this is related to the other issue or not.

To Reproduce

happened after a yum upgrade to 2.11rc1

Environment

~1000 Hosts with a total of 15000 Services.
~200 Hosts directly connecting to the master, the rest via satellites.

Additional context

All our agents are configured to connect to he master server. Logs for one of the hosts looked like this:

[2019-09-03 13:27:19 +0000] information/ApiListener: New client connection for identity 'example-host' from [x.x.x.x]:54378
[2019-09-03 13:28:19 +0000] warning/ApiListener: Removing API client for endpoint 'example-host'. 2 API clients left.
[2019-09-03 13:28:35 +0000] information/ApiListener: Sending config updates for endpoint 'example-host' in zone 'example-host'.
[2019-09-03 13:28:35 +0000] information/ApiListener: Syncing configuration files for global zone 'global' to endpoint 'example-host'.
[2019-09-03 13:28:35 +0000] information/ApiListener: Finished sending config file updates for endpoint 'example-host' in zone 'example-host'.
[2019-09-03 13:28:35 +0000] information/ApiListener: Syncing runtime objects to endpoint 'example-host'.
[2019-09-03 13:28:35 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'example-host'.
[2019-09-03 13:28:35 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'example-host' in zone 'example-host'.
[2019-09-03 13:28:35 +0000] warning/ApiListener: Removing API client for endpoint 'example-host'. 1 API clients left.
[2019-09-03 13:28:49 +0000] information/JsonRpcConnection: No messages for identity 'example-host' have been received in the last 60 seconds.
[2019-09-03 13:28:49 +0000] warning/JsonRpcConnection: API client disconnected for identity 'example-host'
[2019-09-03 13:28:59 +0000] information/ApiListener: New client connection for identity 'example-host' from [x.x.x.x]:54450
[2019-09-03 13:30:29 +0000] information/JsonRpcConnection: No messages for identity 'example-host' have been received in the last 60 seconds.
[2019-09-03 13:30:29 +0000] warning/JsonRpcConnection: API client disconnected for identity 'example-host'
[2019-09-03 13:30:34 +0000] warning/ApiListener: Removing API client for endpoint 'example-host'. 1 API clients left.
[2019-09-03 13:30:39 +0000] information/ApiListener: New client connection for identity 'example-host' from [x.x.x.x]:54514
[2019-09-03 13:32:09 +0000] information/JsonRpcConnection: No messages for identity 'example-host' have been received in the last 60 seconds.
[2019-09-03 13:32:09 +0000] warning/JsonRpcConnection: API client disconnected for identity 'example-host'
[2019-09-03 13:32:19 +0000] information/ApiListener: New client connection for identity 'example-host' from [x.x.x.x]:54578
[2019-09-03 13:33:49 +0000] information/JsonRpcConnection: No messages for identity 'example-host' have been received in the last 60 seconds.
[2019-09-03 13:33:49 +0000] warning/JsonRpcConnection: API client disconnected for identity 'example-host'
[2019-09-03 13:33:55 +0000] warning/ApiListener: Removing API client for endpoint 'example-host'. 2 API clients left.

Crashlogs:

  Application version: 2.11.0-0.rc1.1

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-957.21.3.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

	(0) libc.so.6: gsignal (+0x37) [0x7fc8d42d22c7]
	(1) libc.so.6: abort (+0x148) [0x7fc8d42d39b8]
	(2) libc.so.6: <unknown function> (+0x78e17) [0x7fc8d4314e17]
	(3) libc.so.6: <unknown function> (+0x81609) [0x7fc8d431d609]
	(4) /usr/lib64/icinga2/sbin/icinga2() [0x6447b6]
	(5) icinga2: icinga::JsonRpcConnection::SendMessageInternal(boost::intrusive_ptr<icinga::Dictionary> const&) (+0x6c) [0x998c0c]
	(6) /usr/lib64/icinga2/sbin/icinga2() [0x998d6f]
	(7) icinga2: boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (+0x75) [0x9b1145]
	(8) /usr/lib64/icinga2/sbin/icinga2() [0x6378c1]
	(9) icinga2: icinga::IoEngine::RunEventLoop() (+0x58) [0x910918]
	(10) libstdc++.so.6: <unknown function> (+0xb5070) [0x7fc8d4e52070]
	(11) libpthread.so.0: <unknown function> (+0x7dd5) [0x7fc8d4670dd5]
	(12) libc.so.6: clone (+0x6d) [0x7fc8d439a02d]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

[New LWP 23490]

...

[New LWP 23048]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fc8d4360fad in nanosleep () from /lib64/libc.so.6

Thread 203 (Thread 0x7fc8ceb48700 (LWP 23048)):
#0  0x00007fc8d4674965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000063775c in boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .local.25172] ()
No symbol table info available.
#2  0x0000000000637c12 in boost::asio::detail::posix_thread::func<boost::asio::thread_pool::thread_function>::run() [clone .local.25170] ()
No symbol table info available.
#3  0x000000000082f2df in boost_asio_detail_posix_thread_function ()
No symbol table info available.
#4  0x00007fc8d4670dd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00007fc8d439a02d in clone () from /lib64/libc.so.6
No symbol table info available.

...

Thread 1 (Thread 0x7fc8d73c98c0 (LWP 23047)):
#0  0x00007fc8d4360fad in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fc8d4360e44 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x000000000062d274 in icinga::Utility::Sleep(double) ()
No symbol table info available.
#3  0x0000000000b6aecd in icinga::Application::RunEventLoop() ()
No symbol table info available.
#4  0x0000000000b6bbdb in icinga::IcingaApplication::Main() ()
No symbol table info available.
#5  0x00000000006fbce9 in icinga::Application::Run() ()
No symbol table info available.
#6  0x0000000000bc3cb0 in StartUnixWorker(std::vector<std::string, std::allocator<std::string> > const&) [clone .499179.10614] ()
No symbol table info available.
#7  0x0000000000bc50d0 in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector<std::string, std::allocator<std::string> > const&) const ()
No symbol table info available.
#8  0x0000000000bcf683 in Main() [clone .17173.16253] ()
No symbol table info available.
#9  0x00000000006026a8 in main ()
No symbol table info available.
@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 4, 2019

Please test that with the current snapshot packages, rc1 received quite a few fixes in this regard.
https://icinga.com/docs/icinga2/snapshot/doc/21-development/#rhelcentos

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Sep 4, 2019
@Sec42
Copy link
Author

Sec42 commented Sep 4, 2019

Hi @dnsmichi As these servers don't have direct internet access and are managed via spacewalk, can you provide the repo URLs? The package listed is not very useful.

After checking http://packages.icinga.com/, i suspect the correct URL is http://packages.icinga.com/epel/$releasever/snapshot/ ?

Also, if I may suggest, if your response to a bug report is just telling people that rc1 is untrustworthy/buggy, remove that package, or at least add a note informing people about it.

Thanks.

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 4, 2019

@dnsmichi
Copy link
Contributor

Fixes for #7431 might influence this.

@Sec42
Copy link
Author

Sec42 commented Oct 30, 2019

I have now tackled another upgrade to 2.11.1 and did not get the "uncommunicative" problem any more. So that seems to be fixed.

I still got long unresponsive periods and regular crashes with 2.11.1 and 2.11.2 on random master & satellite hosts after each reload.

This problem was greatly reduced after setting "log_duration=0" in all agent/client "Endpoint" definitions in the (top-down) configuration.

Not sure if you are interested in chasing/debugging those crashes.

@dnsmichi
Copy link
Contributor

Thanks for the update. The hint with the replay log is something we may have seen with #7597 @lippserd

I'd assume that /var/log/icinga2/api/log is not cleared after replaying the logs to the parent instance?

@dnsmichi
Copy link
Contributor

The crash log is the same as we see in #7532

There's a mitigation fix available in the snapshot packages. Would be awesome if you can test them and report back into #7532 @Sec42 - thanks :)

@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) and removed needs feedback We'll only proceed once we hear from you again labels Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients)
Projects
None yet
Development

No branches or pull requests

2 participants