Skip to content

Latest commit

 

History

History
77 lines (58 loc) · 3.51 KB

README.md

File metadata and controls

77 lines (58 loc) · 3.51 KB

Consul Bug Report

IMPORTANT NOTE: Consul 0.6.x has fixed this particular problem. I have been testing it at more than 4 times the previous limit - where it was dead and it's fine.

Great work Hashicorp, @slackpad and @armon!

First - make sure you've setup the ENV variables you need for AWS, then build the AMI:

packer build packer.json

Then you need to copy and update the variables file:

cp consul/variables.dist consul/variables.tf
vi variables.tf

Then update the DD_API key in consul/scripts/datadog.sh - when the bootstrap and server nodes boot - make sure to add 'role:consul-server' to them in the web UI so that you can aggregate the data. You can sign up for a free account to test - click on "Get Started Free".

Boot the cluster with cd consul && terraform apply. Once the cluster is up and running, give it a few minutes to settle down. This is how it looks at rest:

http://shared.froese.org/2015/f01gy-13-10.jpg

NOTE: We saw problems at just over 100 nodes - we're spinning up 120 by default - at .02 / hour it's a very cheap test.

To get it into an unstable mode, first, ssh to a server or bootstrap node:

# Setup the KV values.
consulkv set services/bubs 1
consulkv set services/bunk 1
consulkv set services/cassandra 1
consulkv set services/consul 1
consulkv set services/context-server 1
consulkv set services/daniels 1
consulkv set services/delancie 1
consulkv set services/haproxy 1
consulkv set services/kafka 1
consulkv set services/lamar 1
consulkv set services/postgresql 1
consulkv set services/rawls 1
consulkv set services/redis 1
consulkv set services/spidly 1
consulkv set services/spiros 1
# Generates some CPU stress.
consul event -service datadog -name destress
consul event -service datadog -name stress
# Get Consul Template going.
consul exec -service datadog "cd /tmp && rm -f services.cfg"
consul exec -service datadog "cd /tmp && wget https://gist.githubusercontent.com/darron/bf8dd32540d1dc09dac3/raw/433910515fd7a7070cbfe5a932363ec9f43a3688/services.cfg"
consul exec -service datadog "cd /tmp && wget https://gist.githubusercontent.com/darron/22c88190b69b5f20095f/raw/66fba5b4fead6255589ba01fb8306671ddf428b0/services.ctmpl"
consul exec -service datadog "consul-template -config /tmp/services.cfg &"
consul exec -service datadog "cd /tmp && wget https://gist.githubusercontent.com/darron/a4e5a51325dc7d4feddf/raw/2673f9438c0a740125ad0071ea7efade8a477c49/consul-template.conf"
consul exec -service datadog "cd /tmp && chmod 644 consul-template.conf && sudo mv consul-template.conf /etc/init/"
consul exec -service datadog "cd /etc/init.d/ && sudo ln -s /lib/init/upstart-job consul-template"
consul exec -service datadog "sudo service consul-template start"

Pick 3 random client nodes and login - every minute stop or start Consul on each node. Rotate back and forth. This makes Consul Template run at least every minute.

Doing this - I was able to make the cluster very unstable and lose quorum over and over:

http://shared.froese.org/2015/lw2mf-11-03.jpg

http://shared.froese.org/2015/cobi3-14-13.jpg

Once you've have enough debugging - you can calm it all down like this:

consul exec -service datadog sudo pkill consul-template
consul event -service datadog -name destress

Then a simple terraform destroy will remove your cluster.