Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open sessions not closing, hitting ulimit #3048

Closed
fortman opened this issue May 15, 2017 · 1 comment
Closed

Open sessions not closing, hitting ulimit #3048

fortman opened this issue May 15, 2017 · 1 comment
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected

Comments

@fortman
Copy link

fortman commented May 15, 2017

Looks to be related to #3018

consul version for both Client and Server

Client:
Consul v0.8.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Server:
Consul v0.8.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

consul info for both Client and Server

Client:
agent:
check_monitors = 0
check_ttls = 0
checks = 5
services = 5
build:
prerelease =
revision = '21f2d5a
version = 0.7.5
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 1
goroutines = 1282
max_procs = 2
os = linux
version = go1.7.5
serf_lan:
encrypted = true
event_queue = 0
event_time = 4
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 9
members = 5
query_queue = 0
query_time = 1

Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 3
build:
prerelease =
revision = 'e9ca44d
version = 0.8.1
consul:
bootstrap = false
known_datacenters = 2
leader = true
leader_addr = 172.30.12.139:8300
server = true
raft:
applied_index = 274252
commit_index = 274252
fsm_pending = 0
last_contact = 0
last_log_index = 274252
last_log_term = 5
last_snapshot_index = 270360
last_snapshot_term = 5
latest_configuration = [{Suffrage:Voter ID:172.30.12.139:8300 Address:172.30.12.139:8300} {Suffrage:Voter ID:172.30.13.179:8300 Address:172.30.13.179:8300} {Suffrage:Voter ID:172.30.14.249:8300 Address:172.30.14.249:8300}]
latest_configuration_index = 1
num_peers = 2
protocol_version = 2
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 5
runtime:
arch = amd64
cpu_count = 1
goroutines = 611
max_procs = 2
os = linux
version = go1.8.1
serf_lan:
encrypted = true
event_queue = 0
event_time = 4
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 9
members = 5
query_queue = 0
query_time = 1
serf_wan:
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 35
members = 6
query_queue = 0
query_time = 1

Operating system and Environment details

cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Description of the Issue (and unexpected/desired result)

Quick note, unrelated to this issue, but noticed the output above: Running 'consul version' and consul info' on agent running as client shows different build version. Running those on agent running on server shows same build version. How is that possible?

Anyways, the issue I am experiencing is that the consul agent is running out of open file descriptors and can't open a session.

lsof -p 24420
....
consul 24420 consul 1020u IPv6 10374425 0t0 TCP localhost:8500->localhost:35154 (ESTABLISHED)
consul 24420 consul 1021u IPv6 10384990 0t0 TCP localhost:8500->localhost:39126 (ESTABLISHED)
consul 24420 consul 1022u IPv4 12268580 0t0 TCP localhost:60878->localhost:8500 (ESTABLISHED)
consul 24420 consul 1023u IPv4 10407018 0t0 TCP localhost:47794->localhost:8500 (ESTABLISHED)

You can see that there are 1024 open FDs. The last of them being '1023u'

Reproduction steps and Log Fragments

Just sit and wait, and eventually it will run out. I believe the cause to be the following

From client log:
2017/05/15 17:41:45 [ERR] agent: Failed to invoke watch handler '/usr/local/consul-scripts/restartPrometheus.sh': exit status 1

I can confirm that the script is never being called. This is due to the way we had our servers configured. We were using a chef recipe that installed consul running as the 'consul' user, but the user's login shell from /etc/passwd is
consul:x:999:999:Service user for consul:/home/consul:/bin/false

With the login shell set to /bin/false, the script is never getting called. When I updated the consul service accounts shell to /bin/bash, the script is getting called and working fine and there are no more errors in the consul log. I haven't rolled out our user login shell fix across our environment yet, but in a test environment where I had the consul user with a login shell, we haven't run out of sessions. I will post an update once I roll out the login shell fix.

As stated at the beginning, it is possibly related to
#3018

@slackpad slackpad added type/bug Feature does not function as expected theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization labels May 25, 2017
@preetapan
Copy link
Contributor

The fix for issue #3018 in PR #3195 should fix this, will close this one out. @fortman let us know if this happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants