Open sessions not closing, hitting ulimit #3048
Labels
theme/internal-cleanup
Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization
type/bug
Feature does not function as expected
Looks to be related to #3018
consul version
for both Client and ServerClient:
Consul v0.8.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Server:
Consul v0.8.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
consul info
for both Client and ServerClient:
agent:
check_monitors = 0
check_ttls = 0
checks = 5
services = 5
build:
prerelease =
revision = '21f2d5a
version = 0.7.5
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 1
goroutines = 1282
max_procs = 2
os = linux
version = go1.7.5
serf_lan:
encrypted = true
event_queue = 0
event_time = 4
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 9
members = 5
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 3
build:
prerelease =
revision = 'e9ca44d
version = 0.8.1
consul:
bootstrap = false
known_datacenters = 2
leader = true
leader_addr = 172.30.12.139:8300
server = true
raft:
applied_index = 274252
commit_index = 274252
fsm_pending = 0
last_contact = 0
last_log_index = 274252
last_log_term = 5
last_snapshot_index = 270360
last_snapshot_term = 5
latest_configuration = [{Suffrage:Voter ID:172.30.12.139:8300 Address:172.30.12.139:8300} {Suffrage:Voter ID:172.30.13.179:8300 Address:172.30.13.179:8300} {Suffrage:Voter ID:172.30.14.249:8300 Address:172.30.14.249:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 2
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 5
runtime:
arch = amd64
cpu_count = 1
goroutines = 611
max_procs = 2
os = linux
version = go1.8.1
serf_lan:
encrypted = true
event_queue = 0
event_time = 4
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 9
members = 5
query_queue = 0
query_time = 1
serf_wan:
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 35
members = 6
query_queue = 0
query_time = 1
Operating system and Environment details
cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Description of the Issue (and unexpected/desired result)
Quick note, unrelated to this issue, but noticed the output above: Running 'consul version' and consul info' on agent running as client shows different build version. Running those on agent running on server shows same build version. How is that possible?
Anyways, the issue I am experiencing is that the consul agent is running out of open file descriptors and can't open a session.
lsof -p 24420
....
consul 24420 consul 1020u IPv6 10374425 0t0 TCP localhost:8500->localhost:35154 (ESTABLISHED)
consul 24420 consul 1021u IPv6 10384990 0t0 TCP localhost:8500->localhost:39126 (ESTABLISHED)
consul 24420 consul 1022u IPv4 12268580 0t0 TCP localhost:60878->localhost:8500 (ESTABLISHED)
consul 24420 consul 1023u IPv4 10407018 0t0 TCP localhost:47794->localhost:8500 (ESTABLISHED)
You can see that there are 1024 open FDs. The last of them being '1023u'
Reproduction steps and Log Fragments
Just sit and wait, and eventually it will run out. I believe the cause to be the following
From client log:
2017/05/15 17:41:45 [ERR] agent: Failed to invoke watch handler '/usr/local/consul-scripts/restartPrometheus.sh': exit status 1
I can confirm that the script is never being called. This is due to the way we had our servers configured. We were using a chef recipe that installed consul running as the 'consul' user, but the user's login shell from /etc/passwd is
consul:x:999:999:Service user for consul:/home/consul:/bin/false
With the login shell set to /bin/false, the script is never getting called. When I updated the consul service accounts shell to /bin/bash, the script is getting called and working fine and there are no more errors in the consul log. I haven't rolled out our user login shell fix across our environment yet, but in a test environment where I had the consul user with a login shell, we haven't run out of sessions. I will post an update once I roll out the login shell fix.
As stated at the beginning, it is possibly related to
#3018
The text was updated successfully, but these errors were encountered: