Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik crashing #458

Closed
rogeriollacerda opened this issue Jun 14, 2016 · 20 comments
Closed

Traefik crashing #458

rogeriollacerda opened this issue Jun 14, 2016 · 20 comments
Labels
Milestone

Comments

@rogeriollacerda
Copy link

rogeriollacerda commented Jun 14, 2016

traefik.toml.txt

I'm using traefik with something like 1000 req/sec. Many times during the day, it crash with no error logs. I just can see the goroutines dump with many I/O wait. I'm running oficial docker image with traefik executable.

sysctl parameters applied:

fs.file-max="9999999"
fs.nr_open="9999999"
net.core.netdev_max_backlog="4096"
net.core.rmem_max="16777216"
net.core.somaxconn="65535"
net.core.wmem_max="16777216"
net.ipv4.ip_local_port_range="1025 65535"
net.ipv4.tcp_fin_timeout="30"
net.ipv4.tcp_keepalive_time="30"
net.ipv4.tcp_max_syn_backlog="20480"
net.ipv4.tcp_max_tw_buckets="400000"
net.ipv4.tcp_no_metrics_save="1"
net.ipv4.tcp_syn_retries="2"
net.ipv4.tcp_synack_retries="2"
net.ipv4.tcp_tw_recycle="1"
net.ipv4.tcp_tw_reuse="1"
vm.min_free_kbytes="65536"
vm.overcommit_memory="1"

Traefik version:

2016/06/14 20:08:01 v1.0.0-rc1 built on the 2016-05-30_10:28:25PM

Ubuntu Server 14.04
traefik.log.txt

@emilevauge emilevauge added this to the 1.0 milestone Jun 15, 2016
@emilevauge
Copy link
Member

emilevauge commented Jun 15, 2016

@rogeriollacerda thanks for reporting this.

  • Are you sure you don't have any other logs? That's weird because there is no error log before the dump, and I cannot see where is the issue here...
  • Could try with the new 1.0.0-rc2 version? I don't think it will change anything... But, just to be sure...

@rogeriollacerda
Copy link
Author

rogeriollacerda commented Jun 15, 2016

@emilevauge thanks for answer.

I was using the docker image with from traefik. Today, after your answer, I'm trying with Dockerfile:

FROM scratch
COPY certs/ca-certificates.crt /etc/ssl/certs/
COPY traefik /
COPY traefik.toml /etc/traefik/traefik.toml
EXPOSE 80
ENTRYPOINT ["/traefik"]

I am using the 1.0.0-rc2 now. After the test in production environment, I can post the results here. Now we are receiving 2000 requests per second in a single Traefik instance.

@emilevauge
Copy link
Member

FYI, you can also use the official rc2 Docker image :)

@emilevauge
Copy link
Member

Probably link to #462

@emilevauge
Copy link
Member

emilevauge commented Jun 16, 2016

@rogeriollacerda could you try using a lower value for MaxIdleConnsPerHost ? For example 10000.

@rogeriollacerda
Copy link
Author

@emilevauge I'm still with the same problem. I'll try to change the MaxIdleConnsPerHost but in my first docker image, I used the default value (200). Do you think can be the request number? Yesterday, I changed the GOMAXPROCS to 10 too. Traefik is running in a 24 cores server only for it.

@emilevauge
Copy link
Member

@rogeriollacerda: I also fixed a memory leak yesterday in PR #464 and think it could be linked to your issue: you can grab the docker image containous/traefik:pr-464 to test the fix.

@rogeriollacerda
Copy link
Author

@emilevauge I already changed the MaxIdleConnsPerHost and update with the fixed memory leak version in production environment. I'll send you a feedback soon. TKS a lot!

@rogeriollacerda
Copy link
Author

@emilevauge FYI

Traefik docker stats after 22 minutes:

CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
0e598b649f06 214.80% 926.6 MB / 8.892 GB 10.42% 24.64 GB / 23.89 GB 0 B / 0 B

No problems yet.

@rogeriollacerda
Copy link
Author

@emilevauge

The same problem after new version. The error was:

fatal error: concurrent map read and map write

stderr file attached.
stderr.tar.gz

@emilevauge
Copy link
Member

OK, you didn't gave me the error part in your previous logs : fatal error: concurrent map read and map write. I will investigate.

@rogeriollacerda
Copy link
Author

@emilevauge Sure, you are right. By default, mesos delete the failed containers logs. Now, for traefik, I store all logs in another place. Sorry didn't send you the file before. Tks

@emilevauge
Copy link
Member

OK, the funny part of this is that the crash is due to the health check...
The issue is in thoas/stats#13

@rogeriollacerda
Copy link
Author

@emilevauge

Is possible disable stats removing 8080 tcp port from configuration? Traefik will collect anyway?

@emilevauge
Copy link
Member

As a temporary workaround, you can change your health check to /api instead of /health.

@rogeriollacerda
Copy link
Author

@emilevauge my container healthcheck is / path. Change to /api is better than / ?

@emilevauge
Copy link
Member

Yep

@rogeriollacerda
Copy link
Author

Ok tks

@rogeriollacerda
Copy link
Author

@emilevauge FYI, it seems Ok with your workaround. Stats:

pid: 1,
uptime: "3h19m2.442664624s",
uptime_sec: 11942.442664624,
time: "2016-06-17 20:41:54.91559806 +0000 UTC",
unixtime: 1466196114,

total_status_code_count:
200: 10795270,
301: 6

@rogeriollacerda
Copy link
Author

@emilevauge any news about this? Tks!

@ldez ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants