Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

etcd2 cluster doesnt start #62

Closed
baldeynz opened this issue Nov 15, 2016 · 8 comments
Closed

etcd2 cluster doesnt start #62

baldeynz opened this issue Nov 15, 2016 · 8 comments
Labels
triage/support Indicates an issue that is a support question.

Comments

@baldeynz
Copy link

baldeynz commented Nov 15, 2016

Hi

Im not sure whether i have something configured incorrectly or whether this is a bug so apologies if this is posted as an issue and its not. I have looked for docs around this config and cant see anything obviously wrong but happy to read more if someone can point me in the right direction.

Problem is that using default config the etcd2 cluster wont start.

Im using v0.9.1-rc.2

below is the etc2d config in userdata/cloud-config-etcd file which is unchanged from what was generated using the kube-aws render command

_units:
- name: etcd2.service
drop-ins:
- name: 20-etcd2-aws-cluster.conf
content: |
[Unit]
Requires=decrypt-tls-assets.service
After=decrypt-tls-assets.service

        [Service]
        Environment=ETCD_NAME=%H

        Environment=ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
        Environment=ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
        Environment=ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

        Environment=ETCD_CLIENT_CERT_AUTH=true
        Environment=ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
        Environment=ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
        Environment=ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

        Environment=ETCD_INITIAL_CLUSTER_STATE=new
        Environment=ETCD_INITIAL_CLUSTER={{.EtcdInitialCluster}}
        Environment=ETCD_DATA_DIR=/var/lib/etcd2
        Environment=ETCD_LISTEN_CLIENT_URLS=https://%H:2379
        Environment=ETCD_ADVERTISE_CLIENT_URLS=https://%H:2379
        Environment=ETCD_LISTEN_PEER_URLS=https://%H:2380
        Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=https://%H:2380
        PermissionsStartOnly=true
        ExecStartPre=/usr/bin/chown -R etcd:etcd /var/lib/etcd2
  enable: true
  command: start_

my cluster.yaml contains this:

hostedZoneId: "sandbox.testwaikato.kiwi"
and
etcdCount: 3

all other DNS and etcd config in it is default.

The problem is that when the etcd tries to start with this config there is an error:

Nov 15 20:10:52 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: Starting etcd2...
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1200
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_CLUSTER=ip-172-19-76-198.ap-southeast-2.compute.internal=https://ip-172-19-76-198.ap-southeast-2.compute.intern
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_NAME=ip-172-19-76-198.sandbox.testwaikato.kiwi
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: etcd Version: 2.3.7
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Git SHA: fd17c91
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Go Version: go1.7.1
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: Go OS/Arch: linux/amd64
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: found invalid file/dir lost+found under data dir /var/lib/etcd2 (Ignore this if you are upgrading etcd)
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: peerTLS: cert = /etc/etcd2/ssl/etcd.pem, key = /etc/etcd2/ssl/etcd-key.pem, ca = , trusted-ca = /etc/etcd2/ssl/ca.pem, client-cert-auth = false
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: listening for peers on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: clientTLS: cert = /etc/etcd2/ssl/etcd.pem, key = /etc/etcd2/ssl/etcd-key.pem, ca = , trusted-ca = /etc/etcd2/ssl/ca.pem, client-cert-auth = true
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: listening for client requests on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: stopping listening for client requests on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2379
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: stopping listening for peers on https://ip-172-19-76-198.sandbox.testwaikato.kiwi:2380
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi etcd2[1265]: couldn't find local name "ip-172-19-76-198.sandbox.testwaikato.kiwi" in the initial cluster configuration
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: Failed to start etcd2.
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Unit entered failed state.
Nov 15 20:10:54 ip-172-19-76-198.sandbox.testwaikato.kiwi systemd[1]: etcd2.service: Failed with result 'exit-code'.

running hostname on each etcd server returns the non aws hostname
e.g
ip-172-19-76-198.sandbox.testwaikato.kiwi

if i then go onto my etcd2 servers and change the /etc/systemd/system/etc2.service.d/20-etcd2-aws-cluster.conf file by replacing the %H references with the aws DNS values for the host

e.g

Environment=ETCD_LISTEN_CLIENT_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2379
Environment=ETCD_ADVERTISE_CLIENT_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2379
Environment=ETCD_LISTEN_PEER_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380

and start the service it works.

It looks to me like the Environment=ETCD_INITIAL_CLUSTER=ip-172-19-76-198.ap-southeast-2.compute.internal=https://ip-172-19-76-198.ap-southeast-2.compute.internal:2380,ip-172-19-77-197.ap-southeast-2.compute.internal=https://ip-172-19-77-197.ap-southeast-2.compute.internal:2380,ip-172-19-76-199.ap-southeast-2.compute.internal=https://ip-172-19-76-199.ap-southeast-2.compute.internal:2380 line is using the aws DNS entries but by having %H in the user data i get ip-172-19-76-199.sandbox.testwaikato.kiwi in my config and even though they both resolve etcd wont start because of this?

So is this a bug or is there someway to set the config to either my local dns names OR aws hostnames in userdata/cloud-config-etcd ?

@mumoshu mumoshu added the triage/support Indicates an issue that is a support question. label Nov 16, 2016
@mumoshu
Copy link
Contributor

mumoshu commented Nov 16, 2016

@baldeynz Hi, thanks for trying kube-aws!

Unfortunately, kube-aws doesn't support hostnames coming from Route53 private DNS(!= not the one provided by default in AWS) for etcd nodes.

To provide some context, would you mind looking into https://github.com/coreos/kube-aws/blob/master/config/config.go#L254-L321, especially
https://github.com/coreos/kube-aws/blob/master/config/config.go#L298-L319 to see how we're computing hostnames to be provided to {{.EtcdInitialCluster}} hence ETCD_INITIAL_CLUSTER.

Anyways, I believe that this is something we'd like to support in the future.

@baldeynz
Copy link
Author

Thanks for the quick response :)

@cmcconnell1
Copy link
Contributor

Was directed here via feedback from another issue: coreos/coreos-kubernetes#675 (comment)
Quoting from the feedback to me in the other issue noted above:

As you can see in that issue, you can workaround it by modifying cloud-config-etcd to use hostnames from the AWS default private DNS.

Disclaimer I'm using the latest kube-aws version v0.9.1-rc.3 and the above solution was for the previous RC2 version. . .

I tried to follow the steps noted above by @baldeynz
First I manually edited the/etc/systemd/system/etcd2.service.d/20-etcd2-aws-cluster.conf file on my existing etcd node (I currently only deploy/have one) and then after seeing the errors about the empty PEM file, I went ahead and destroyed my stack with kube-aws destroy and then modified the cloud-config-etcd file with the same modifications using the resolvable AWS DNS name and redeployed the stack. Unfortunately, this did not work for me and below is what I see after completing the steps and trying to restart etcd2

journalctl -xe out shows what appears to be a cert error now preventing the service from running--the critical error here seems to be tls: failed to find any PEM data in certificate input see below the .pem files are zero bytes on my etcd node.

Full output included below:
  Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2379
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1200
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2380
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_INITIAL_CLUSTER=ip-10-1-10-5.us-west-1.compute.internal=https://ip-10-1-10-5.us-west-1.compute.internal:2380
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2379
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2380
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_NAME=ip-10-1-10-5.us-west-1.compute.internal
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: etcd Version: 2.3.7
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: Git SHA: fd17c91
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: Go Version: go1.7.1
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: Go OS/Arch: linux/amd64
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: found invalid file/dir lost+found under data dir /var/lib/etcd2 (Ignore this if you are upgrading etcd)
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: peerTLS: cert = /etc/etcd2/ssl/etcd.pem, key = /etc/etcd2/ssl/etcd-key.pem, ca = , trusted-ca = /etc/etcd2/ssl/ca.pem, client-cert-auth = false
Nov 18 00:08:36 ip-10-1-10-5.terradatum.com etcd2[4138]: tls: failed to find any PEM data in certificate input
Nov 18 00:11:51 ip-10-1-10-5.terradatum.com systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Nov 18 00:11:51 ip-10-1-10-5.terradatum.com systemd[1]: Failed to start etcd2.
-- Subject: Unit etcd2.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd2.service has failed.
--
-- The result is failed.
Nov 18 00:11:51 ip-10-1-10-5.terradatum.com systemd[1]: etcd2.service: Unit entered failed state.
Nov 18 00:11:51 ip-10-1-10-5.terradatum.com systemd[1]: etcd2.service: Failed with result 'exit-code'.

This appears to be the problem that the etcd2 service is complaining about validating that the .pem files are 0 bytes/empty:

ls -lat /etc/etcd2/ssl/
total 76
drwxr-xr-x. 2 root root 4096 Nov 18 00:40 .
-rw-r--r--. 1 root root    0 Nov 18 00:40 etcd.pem
-rw-r--r--. 1 root root    0 Nov 18 00:39 etcd-key.pem
-rw-r--r--. 1 root root    0 Nov 18 00:38 etcd-client.pem
-rw-r--r--. 1 root root    0 Nov 18 00:37 etcd-client-key.pem
-rw-r--r--. 1 root root    0 Nov 18 00:36 ca.pem
drwxr-xr-x. 3 root root 4096 Nov 18 00:36 ..
-rw-r--r--. 1 root root 1215 Nov 18 00:36 ca.pem.enc
-rw-r--r--. 1 root root 1820 Nov 18 00:36 etcd-client-key.pem.enc
-rw-r--r--. 1 root root 1239 Nov 18 00:36 etcd-client.pem.enc
-rw-r--r--. 1 root root 1824 Nov 18 00:36 etcd-key.pem.enc
-rw-r--r--. 1 root root 1309 Nov 18 00:36 etcd.pem.enc

Unfortunately this process did not work for me with the latest RC3 version, perhaps I should try reverting back to the previous RC version used above which was RC2 and see if we have better luck with it?
If this manual workaround process ends up working for us, it could allow us to move forward on a temp basis, provided we always know and can specify/configure the IP addresses (for the etcd node, etc.) in each clusters' cluster.yaml (separate git repos). However, this seem problematic for having multiple clusters/stacks (automated and deployed via kube-aws) in our shared VPC, which is the ideal scenario for us. Suppose we could write scripts to increment the ports and CIDR ranges for new kube clusters similar to what we used to do for many tomcat instances on the same host if/as needed tho.

Thanks @baldeynz and @mumoshu for your information and assistance.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 18, 2016

Hi @cmcconnell1, thanks as always!

At first, there's no significant change between rc.2 and rc.3 regarding how etcd nodes are provisioned, therefore the issue might have no relation to the difference in versions.

Two things:

  • Would you mind looking into the credentials directory by e.g. running ls -lah credentials/?
    • Are the original credentials/etcd*pem and ca.pem files in your local disk with non-zero sizes?

    • What it looks like in my env is:

      $ ls -lh e2e/assets/kubeawstest1/credentials/
      total 96
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 admin-key.pem
      -rw-------  1 cw_kuoka  staff   1.1K 11 18 10:13 admin.pem
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 apiserver-key.pem
      -rw-------  1 cw_kuoka  staff   1.3K 11 18 10:13 apiserver.pem
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 ca-key.pem
      -rw-------  1 cw_kuoka  staff   1.0K 11 18 10:13 ca.pem
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 etcd-client-key.pem
      -rw-------  1 cw_kuoka  staff   1.1K 11 18 10:13 etcd-client.pem
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 etcd-key.pem
      -rw-------  1 cw_kuoka  staff   1.1K 11 18 10:13 etcd.pem
      -rw-------  1 cw_kuoka  staff   1.6K 11 18 10:13 worker-key.pem
      -rw-------  1 cw_kuoka  staff   1.1K 11 18 10:13 worker.pem
      
  • Would you mind regenerating all the assets including cluster.yaml, user-data/*, credentials/* via first removing those and then running kube-aws init, kube-aws render?

@mumoshu
Copy link
Contributor

mumoshu commented Nov 18, 2016

@cmcconnell1 Also, could you check decrypt-tls-assets.service is working? I suggest running systemctl status decrypt-tls-assets.service and journalctl -u decrypt-tls-assets.service. That's the service responsible in decryptiong those tls assets(*.pem files).

Putting full log files to gist.github.com would help as we're still unsure which part of the whole system is failing(=close to the root cause).

@cmcconnell1
Copy link
Contributor

Hello @mumoshu
For the initial creds/certs, below is the source credentials dir and contents (post kube-aws init/render) on my EC2 workstation using the latest RC4 revision:

ls -lh ./credentials/
total 48K
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 18 00:25 admin-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.1K Nov 18 00:25 admin.pem
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 18 00:25 apiserver-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.3K Nov 18 00:25 apiserver.pem
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 17 18:15 ca-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.1K Nov 18 00:25 ca.pem
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 18 00:25 etcd-client-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.1K Nov 18 00:25 etcd-client.pem
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 18 00:25 etcd-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.2K Nov 18 00:25 etcd.pem
-rw-------. 1 cmcconnell cmcconnell 1.7K Nov 18 00:25 worker-key.pem
-rw-------. 1 cmcconnell cmcconnell 1.2K Nov 18 00:25 worker.pem

I also repeated the specified manual hack to modify the cloud-init-etcd file, hardcoding the expected etcd nodes AWS DNS name which resolves on the etcd node.

host ip-10-1-10-5.us-west-1.compute.internal
ip-10-1-10-5.us-west-1.compute.internal has address 10.1.10.5

Full output from requested systemctl and journalctl commands on gists below, as for the "full log files" can you specify locations and names for all logs you would like to see?:

systemctl status decrypt-tls-assets.service
https://gist.github.com/cmcconnell1/db5e23b9a619641a848dcb1f0fe7fd41

journalctl -u decrypt-tls-assets.service
https://gist.github.com/cmcconnell1/9cf03cfb91dd7511f47f15eb9590e6e5

journalctl -xe
https://gist.github.com/cmcconnell1/62d0f20456531e8f18cf4f7588cb1b92

on the etcd node
ip-10-1-10-5 ~ # date
Fri Nov 18 20:33:17 UTC 2016

ls -alt  /etc/etcd2/ssl/
total 76
drwxr-xr-x. 2 root root 4096 Nov 18 19:14 .
-rw-r--r--. 1 root root    0 Nov 18 19:13 etcd.pem
-rw-r--r--. 1 root root    0 Nov 18 19:12 etcd-key.pem
-rw-r--r--. 1 root root    0 Nov 18 19:12 etcd-client.pem
-rw-r--r--. 1 root root    0 Nov 18 19:11 etcd-client-key.pem
-rw-r--r--. 1 root root 1215 Nov 18 19:10 ca.pem.enc
-rw-r--r--. 1 root root 1824 Nov 18 19:10 etcd-client-key.pem.enc
-rw-r--r--. 1 root root 1239 Nov 18 19:10 etcd-client.pem.enc
-rw-r--r--. 1 root root 1820 Nov 18 19:10 etcd-key.pem.enc
-rw-r--r--. 1 root root 1309 Nov 18 19:10 etcd.pem.enc
-rw-r--r--. 1 root root    0 Nov 18 19:10 ca.pem
drwxr-xr-x. 3 root root 4096 Nov 18 18:30 ..

Thanks again

@pieterlange
Copy link
Contributor

It seems your nodes are not able to connect to the internet in order to download the awscli image.

Nov 18 19:13:37 ip-10-1-10-5.terradatum.com decrypt-tls-assets[681]: docker: Error response from daemon: Get https://quay.io/v1/_ping: dial tcp 54.243.113.169:443: i/o timeout.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 22, 2016

Hi @cmcconnell1, thanks as always!

I agree to @pieterlange for a possible source of your issue.

More concretely, I suspect that:

  • your network ACL is forbidding uses of the whole/or the part of ephemeral ports(from 32768 to 61000) for outbound connections(like 443=https in your case)
  • or your security group is forbidding uses of ephemeral ports in the outbound rules?

Excuse me if I'm repeating what you might know again but AFAIK ephemeral ports is used to "receive" packets from the other end of a tcp session. Forbidding uses of those in the ACL/SG outbound rule would end up dropping all the data like HTTP responses hence i/o timeout as you've seen in log.

I also have an experience trouble-shooting a slightly similar issue caused by blocking ephemeral ports in coreos/coreos-kubernetes#744 (comment).

tyrannasaurusbanks pushed a commit to tyrannasaurusbanks/kube-aws that referenced this issue Sep 14, 2018
…e-proxy-race-condition to hcom-flavour

* commit '1d3373d1c2d7a6db17df8dfcbc14606b8fa3c9ad':
  Fix for issue: kubernetes-retired#1424
  Fix for issue: kubernetes-retired#1424
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

4 participants