Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

metrics server install-kube-system fix causing intermittent hang on new clusters #1424

Closed
davidmccormick opened this issue Jul 26, 2018 · 0 comments

Comments

@davidmccormick
Copy link
Contributor

davidmccormick commented Jul 26, 2018

It looks like there has been an issue within 'install-kube-system' when the metrics service is enabled which was documented and fixed in issue: #1039 (comment). We are now finding that when we spin up some new clusters that they get stuck within 'install-kube-system' which is not completing and allowing 'cfn-signal' to fire.

We see it get stuck in the logs...

Jul 26 09:41:37 ip-10-4-5-95.us-west-2.compute.internal retry[2152]: Sleeping 3 seconds.
Jul 26 09:41:40 ip-10-4-5-95.us-west-2.compute.internal retry[2152]: Sleeping 3 seconds.
Jul 26 09:41:43 ip-10-4-5-95.us-west-2.compute.internal retry[2152]: Sleeping 3 seconds.

Looking at 'install-kube-system' it looks to be stuck in this code: -

# See https://github.com/kubernetes-incubator/kube-aws/issues/1039#issuecomment-348978375
if ks get apiservice v1beta1.metrics.k8s.io && ! ps ax | grep '[h]yperkube proxy'; then
  echo "apiserver is up but kube-proxy isn't up. We have likely encountered #1039."
  echo "Temporary deleting the v1beta1.metrics.k8s.io apiservice as a work-around for #1039"
  ks delete apiservice v1beta1.metrics.k8s.io

  echo Waiting until controller-manager stabilizes and it creates a kube-proxy pod.
  until ps ax | grep '[h]yperkube proxy'; do
      echo Sleeping 3 seconds.
      sleep 3
  done
  echo kube-proxy stared. apiserver should be responsive again.
fi

It seems that this code executes before the code that installs the kube-proxy and so gets stuck in the inner loop waiting for the proxy to come up.

omar-nahhas pushed a commit to HotelsDotCom/kube-aws that referenced this issue Jul 26, 2018
Moving the installation of kube-proxy up, just after kukbe-system-ns
omar-nahhas added a commit to HotelsDotCom/kube-aws that referenced this issue Jul 26, 2018
The fix is to move the installation of kube-proxy up in the install-kube-system script, so it gets
a chance to start before the installation checks for it and goes into a loop.
omar-nahhas added a commit to HotelsDotCom/kube-aws that referenced this issue Jul 26, 2018
@mumoshu mumoshu closed this as completed in f40a82e Aug 3, 2018
mateiidavid pushed a commit to HotelsDotCom/kube-aws that referenced this issue Aug 14, 2018
…ubernetes-retired#1426)

The fix is to move the installation of kube-proxy up in the install-kube-system script, so it gets
a chance to start before the installation checks for it and goes into a loop.

Fixes kubernetes-retired#1424
tyrannasaurusbanks pushed a commit to tyrannasaurusbanks/kube-aws that referenced this issue Sep 14, 2018
The fix is to move the installation of kube-proxy up in the install-kube-system script, so it gets
a chance to start before the installation checks for it and goes into a loop.
tyrannasaurusbanks pushed a commit to tyrannasaurusbanks/kube-aws that referenced this issue Sep 14, 2018
tyrannasaurusbanks pushed a commit to tyrannasaurusbanks/kube-aws that referenced this issue Sep 14, 2018
…e-proxy-race-condition to hcom-flavour

* commit '1d3373d1c2d7a6db17df8dfcbc14606b8fa3c9ad':
  Fix for issue: kubernetes-retired#1424
  Fix for issue: kubernetes-retired#1424
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant