Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help wanted] Ingress controller exiting/shutting down unexpectedly #612

Closed
transhapHigsn opened this issue Jul 1, 2020 · 12 comments
Closed

Comments

@transhapHigsn
Copy link

I am using HAProxy IC with k3s. But, i am seeing unexpected shutdown/exiting of ingress controller. I have increased the verbosity level to see if there is any resource-level error, but i didn't find anything.

K3S version: v1.18.2+k3s1 (without traefik)
Ingress controller manifest:

apiVersion: v1
kind: Namespace
metadata:
  name: ingress-controller
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ingress-controller
  namespace: ingress-controller
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
      - endpoints
      - nodes
      - pods
      - secrets
    verbs:
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
  - apiGroups:
      - ""
    resources:
      - services
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - create
      - patch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses/status
    verbs:
      - update
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  name: ingress-controller
  namespace: ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
      - pods
      - secrets
      - namespaces
    verbs:
      - get
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - update
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - create
  - apiGroups:
      - ""
    resources:
      - endpoints
    verbs:
      - get
      - create
      - update
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ingress-controller
subjects:
  - kind: ServiceAccount
    name: ingress-controller
    namespace: ingress-controller
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: ingress-controller
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  name: ingress-controller
  namespace: ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: ingress-controller
subjects:
  - kind: ServiceAccount
    name: ingress-controller
    namespace: ingress-controller
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: ingress-controller
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: haproxy-ingress
  namespace: ingress-controller
data:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    run: haproxy-ingress
  name: haproxy-ingress
  namespace: ingress-controller
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      run: haproxy-ingress
  template:
    metadata:
      labels:
        run: haproxy-ingress
    spec:
      hostNetwork: true
      nodeSelector:
        groupRole: master
      serviceAccountName: ingress-controller
      containers:
      - name: haproxy-ingress
        image: quay.io/jcmoraisjr/haproxy-ingress
        args:
        - --configmap=ingress-controller/haproxy-ingress
        - --v=10
        ports:
        - name: http
          containerPort: 80
        - name: https
          containerPort: 443
        - name: stat
          containerPort: 1936
        - name: ingress-stats
          containerPort: 10254
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10253
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
      tolerations:
      - operator: Exists

Exiting logs:

I0701 11:30:17.345102       7 queue.go:111] syncing haproxy/haproxy-ingress
I0701 11:30:17.345116       7 controller.go:335] Starting HAProxy update id=2
I0701 11:30:17.348214       7 instance.go:276] old and new configurations match
I0701 11:30:17.348279       7 controller.go:386] Finish HAProxy update id=2: ingress=0.756847ms total=0.756847ms
I0701 11:31:17.324449       7 queue.go:70] queuing item sync status
I0701 11:31:17.324535       7 queue.go:111] syncing sync status
I0701 11:31:17.333147       7 status.go:354] skipping update of Ingress haproxy/haproxy-ingress (no change)
I0701 11:32:17.324738       7 queue.go:70] queuing item sync status
I0701 11:32:17.324795       7 queue.go:111] syncing sync status
I0701 11:32:17.332959       7 status.go:354] skipping update of Ingress haproxy/haproxy-ingress (no change)
I0701 11:33:17.324998       7 queue.go:70] queuing item sync status
I0701 11:33:17.325050       7 queue.go:111] syncing sync status
I0701 11:33:17.331958       7 status.go:354] skipping update of Ingress haproxy/haproxy-ingress (no change)
I0701 11:33:52.427353       7 main.go:45] Shutting down with signal terminated
I0701 11:33:52.427398       7 controller.go:1566] shutting down controller queues
I0701 11:33:52.427445       7 status.go:124] updating status of Ingress rules (remove)
I0701 11:33:52.441101       7 status.go:143] removing address from ingress status ([15.206.58.70])
I0701 11:33:52.444739       7 status.go:365] updating Ingress haproxy/haproxy-ingress status to []
I0701 11:33:52.453166       7 main.go:38] Exiting (0)

@jcmoraisjr Can you help me out on this? I am not sure how to debug this either.

@jcmoraisjr
Copy link
Owner

Hi, the liveness will send a sigterm to the controller if haproxy fails to answer the health check:

        livenessProbe:
          httpGet:
            path: /healthz
            port: 10253

Try to remove the liveness and check also if haproxy is propery configured.

@transhapHigsn
Copy link
Author

@jcmoraisjr using empty config map for IC. Where should I check this?

check also if haproxy is propery configured.

@transhapHigsn
Copy link
Author

@jcmoraisjr where should I check if HAproxy is properly configured or not? If you can point to that in manifest shared above, that would be a great help.

@Unichron
Copy link
Contributor

Unichron commented Jul 2, 2020

You also didn't configure any resource requests/limits so it can be that the scheduler is killing it because the node is out of memory or something (not necessarily because of this pod, it usually doesn't consume much of anything). This can be checked by kubectl describe daemonset/haproxy-ingress -n ingress-controller, if i remember correctly such events are displayed there.

@transhapHigsn
Copy link
Author

Thanks @Unichron I will fix that too, only reason i haven't done that yet because I'm bit unsure about resource usage of HAproxy IC.

@Unichron
Copy link
Contributor

Unichron commented Jul 2, 2020

@transhapHigsn Well, the controller itself doesn't consume much, except maybe if you have huge number of kubernetes objects to track (e.g. ingresses, services, pods). The underlying haproxy is also extremely efficient, but the requirements can vary depending on your load. I would suggest looking at the relevant haproxy docs for guidance in this: http://cbonte.github.io/haproxy-dconv/2.0/management.html#6

@jcmoraisjr
Copy link
Owner

using empty config map for IC. Where should I check this?

The daemonset object has a liveness probe which might be failing for any reason. A failing liveness probe will stop haproxy ingress pretty much like this. Check also events, either via get events from the ingress namespace or describe pod <controller-pod-name>. If liveness is failing, you should see an event there. And if so you can workaround this by removing the liveness or verifying why port 10253 cannot be reached.

@transhapHigsn
Copy link
Author

Thanks @jcmoraisjr @Unichron for this. I will check out if this works for me.

@transhapHigsn
Copy link
Author

transhapHigsn commented Jul 4, 2020

@jcmoraisjr @Unichron It just happened again. In events, it is showing up following error.

Liveness probe failed: Get http://10.0.101.100:10253/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I have removed liveness probe for now, but I have come to find out that this usually happens when request rate is higher than usual. Will scheduling IC on multiple nodes help here?

Someone referred that, timeout issue can be due to this: k3s-io/k3s#1266 . what do you think?

@jcmoraisjr
Copy link
Owner

Hi, your haproxy proxies might be saturated and taking much time to answer requests and also the health check. If you're not using v0.10 (beta but good enough for production) give it a chance (see any backward compatibility issues in the changelog) and configure prometheus, doc here. Scheduling a few more controllers should help. Otoh if your current ingress nodes has less or the same number of cores than the number of threads (default to 2 since v0.8, doc here) you can upgrade your host spec. Note also that increasing the number of threads doesn't increase maximum conn/s and req/s in the same rate, eg on our environment we cannot see any gain using 5 threads or more, so we increase the number of controllers.

@transhapHigsn
Copy link
Author

@jcmoraisjr I will check this out, and get back.

@transhapHigsn
Copy link
Author

@jcmoraisjr After making above changes, the performance is consistent and optimal and I haven't seen any of the previous errors. Although today I observed that requests were not forwarded towards a newly spawned running pod in a deployment (of 2 replicas), this lead to increase in request timeouts for some time. I am not exactly sure what caused it, and I am not able to replicate it again.

Thanks for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants