Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: ASO + CAPZ integration broken #4274

Closed
rbjorklin opened this issue Sep 16, 2024 · 11 comments
Closed

Bug: ASO + CAPZ integration broken #4274

rbjorklin opened this issue Sep 16, 2024 · 11 comments
Labels
bug 🪲 Something isn't working

Comments

@rbjorklin
Copy link

rbjorklin commented Sep 16, 2024

Version of Azure Service Operator: 2.9.0 2.8.0 that comes with version 1.16.1 of cluster-api-provider-azure.

Describe the bug
The capz-controller-manager pod is throwing errors when attempting to connect to azureserviceoperator-webhook-service.

To Reproduce
Steps to reproduce the behavior:
Deploy an AKS-ASO cluster with cluster-api.

Expected behavior
No TLS errors are thrown.

Screenshots

Additional context

Detailed information, click to expand.
E0916 22:22:15.403491       1 controller.go:329] "Reconciler error" err="failed to reconcile resources: failed to apply resource: conversion webhook for containerservice.azure.com/v1api20231001storage, Kind=ManagedCluster failed: Post \"https://azureserviceoperator-webhook-service.management-aks-aso.svc:443/convert?timeout=30s\": tls: failed to verify certificate: x509: certificate is valid for capi-kubeadm-control-plane-webhook-service.management-aks-aso.svc, capi-kubeadm-control-plane-webhook-service.management-aks-aso.svc.cluster.local, not azureserviceoperator-webhook-service.management-aks-aso.svc" controller="azureasomanagedcontrolplane" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureASOManagedControlPlane" AzureASOManagedControlPlane="management-aks-aso/management-aks-aso" namespace="management-aks-aso" name="management-aks-aso" reconcileID="3cff6256-7a50-4f65-8d6c-eb55d942c048"

Suggested fix:
Editing the service in place and changing this to say: app.kubernetes.io/name: azure-service-operator fixes the problem.

@rbjorklin rbjorklin added the bug 🪲 Something isn't working label Sep 16, 2024
@matthchr matthchr changed the title Bug: <Brief description of bug> Bug: ASO + CAPZ integration broken Sep 17, 2024
@matthchr
Copy link
Member

cc @nojnhuh, are you aware of this?

@matthchr
Copy link
Member

matthchr commented Sep 17, 2024

Can you share the full ASO deployment and webhook service YAML as rendered in CAPZ?

@nojnhuh
Copy link
Member

nojnhuh commented Sep 17, 2024

I haven't seen this before locally or in our tests.

@rbjorklin How are you installing CAPZ and ASO?

Version of Azure Service Operator: 2.9.0 that comes with version 1.16.1 of cluster-api-provider-azure.

We are still bundling ASO v2.8.0 with CAPZ: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/v1.16.1/config/aso/kustomization.yaml#L5. Where are you seeing evidence that it's actually ASO v2.9?

@matthchr
Copy link
Member

In terms of changes in this space, I added 2 new labels here in 2.9.0: #3184, but tests all pass and the existing controller-manager labels weren't touched or changed in any way

@rbjorklin
Copy link
Author

You're absolutely right and I can't read. It is indeed version 2.8.0. I'm following the Cluster API quick start guide using kind as a bootstrap cluster.

See the manifests below:

service.yaml
click to expand.
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: azure-service-operator
    app.kubernetes.io/version: v2.8.0
    cluster.x-k8s.io/provider: infrastructure-azure
    clusterctl.cluster.x-k8s.io: ""
  name: azureserviceoperator-webhook-service
  namespace: management-aks-aso
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 443
    protocol: TCP
    targetPort: 9443
  selector:
    control-plane: controller-manager
  sessionAffinity: None
  type: ClusterIP
deployment.yaml
click to expand.
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app.kubernetes.io/name: azure-service-operator
    app.kubernetes.io/version: v2.8.0
    cluster.x-k8s.io/provider: infrastructure-azure
    clusterctl.cluster.x-k8s.io: ""
    control-plane: controller-manager
  name: azureserviceoperator-controller-manager
  namespace: management-aks-aso
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      control-plane: controller-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: manager
      creationTimestamp: null
      labels:
        aadpodidbinding: aso-manager-binding
        app.kubernetes.io/name: azure-service-operator
        app.kubernetes.io/version: v2.8.0
        control-plane: controller-manager
    spec:
      containers:
      - args:
        - --metrics-addr=0.0.0.0:8443
        - --secure-metrics=true
        - --profiling-metrics=false
        - --health-addr=:8081
        - --enable-leader-election
        - --v=2
        - --crd-pattern=
        - --webhook-port=9443
        - --webhook-cert-dir=/tmp/k8s-webhook-server/serving-certs
        env:
        - name: AZURE_CLIENT_ID
          valueFrom:
            secretKeyRef:
              key: AZURE_CLIENT_ID
              name: aso-controller-settings
        - name: AZURE_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              key: AZURE_CLIENT_SECRET
              name: aso-controller-settings
              optional: true
        - name: AZURE_TENANT_ID
          valueFrom:
            secretKeyRef:
              key: AZURE_TENANT_ID
              name: aso-controller-settings
        - name: AZURE_SUBSCRIPTION_ID
          valueFrom:
            secretKeyRef:
              key: AZURE_SUBSCRIPTION_ID
              name: aso-controller-settings
        - name: AZURE_CLIENT_CERTIFICATE
          valueFrom:
            secretKeyRef:
              key: AZURE_CLIENT_CERTIFICATE
              name: aso-controller-settings
              optional: true
        - name: AZURE_CLIENT_CERTIFICATE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: AZURE_CLIENT_CERTIFICATE_PASSWORD
              name: aso-controller-settings
              optional: true
        - name: AZURE_AUTHORITY_HOST
          valueFrom:
            secretKeyRef:
              key: AZURE_AUTHORITY_HOST
              name: aso-controller-settings
              optional: true
        - name: AZURE_RESOURCE_MANAGER_ENDPOINT
          valueFrom:
            secretKeyRef:
              key: AZURE_RESOURCE_MANAGER_ENDPOINT
              name: aso-controller-settings
              optional: true
        - name: AZURE_RESOURCE_MANAGER_AUDIENCE
          valueFrom:
            secretKeyRef:
              key: AZURE_RESOURCE_MANAGER_AUDIENCE
              name: aso-controller-settings
              optional: true
        - name: AZURE_TARGET_NAMESPACES
          valueFrom:
            secretKeyRef:
              key: AZURE_TARGET_NAMESPACES
              name: aso-controller-settings
              optional: true
        - name: AZURE_OPERATOR_MODE
          valueFrom:
            secretKeyRef:
              key: AZURE_OPERATOR_MODE
              name: aso-controller-settings
              optional: true
        - name: AZURE_SYNC_PERIOD
          valueFrom:
            secretKeyRef:
              key: AZURE_SYNC_PERIOD
              name: aso-controller-settings
              optional: true
        - name: USE_WORKLOAD_IDENTITY_AUTH
          valueFrom:
            secretKeyRef:
              key: USE_WORKLOAD_IDENTITY_AUTH
              name: aso-controller-settings
              optional: true
        - name: AZURE_USER_AGENT_SUFFIX
          valueFrom:
            secretKeyRef:
              key: AZURE_USER_AGENT_SUFFIX
              name: aso-controller-settings
              optional: true
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: mcr.microsoft.com/k8s/azureserviceoperator:v2.8.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: manager
        ports:
        - containerPort: 9443
          name: webhook-server
          protocol: TCP
        - containerPort: 8081
          name: health-port
          protocol: TCP
        - containerPort: 8443
          name: metrics-port
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 200m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/run/secrets/tokens
          name: azure-identity
          readOnly: true
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: azureserviceoperator-default
      serviceAccountName: azureserviceoperator-default
      terminationGracePeriodSeconds: 10
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: webhook-server-cert
      - name: azure-identity
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              audience: api://AzureADTokenExchange
              expirationSeconds: 3600
              path: azure-identity

@nojnhuh
Copy link
Member

nojnhuh commented Sep 17, 2024

CAPZ generally assumes it and ASO are being installed in the capz-system namespace. How are you installing it in the management-aks-aso namespace? Is the KubeadmControlPlane controller also running in that namespace?

@rbjorklin
Copy link
Author

rbjorklin commented Sep 17, 2024

This is what I do:

kind create cluster \
        --name management-cluster \
        --kubeconfig ~/.kube/configs/kind-management-cluster.yamlclusterctl init \
        --kubeconfig ~/.kube/configs/kind-management-cluster.yaml \
        --core cluster-api:v1.7.5 \
        --bootstrap kubeadm:v1.7.5 \
        --control-plane kubeadm:v1.7.5 \
        --infrastructure azure:v1.16.1 \
        --wait-providers \
        --target-namespace management-aks-aso

I just tried removing --target-namespace and I do indeed see a capz-system namespace now. So for my understanding: am I only supposed to have one capi + capz deployment per management cluster that then manages all workload clusters?

@rbjorklin
Copy link
Author

Is the KubeadmControlPlane controller also running in that namespace?

Yes:

k -n management-aks-aso get pods
NAME                                                             READY   STATUS    RESTARTS   AGE
azureserviceoperator-controller-manager-db79d7d5d-46xkc          1/1     Running   0          93s
capi-controller-manager-7cdf496cd5-rmv8t                         1/1     Running   0          95s
capi-kubeadm-bootstrap-controller-manager-5b99b66467-hpsz5       1/1     Running   0          95s
capi-kubeadm-control-plane-controller-manager-544d667cc9-rnsht   1/1     Running   0          95s
capz-controller-manager-7b77f56677-j66km                         1/1     Running   0          93s

@nojnhuh
Copy link
Member

nojnhuh commented Sep 17, 2024

So for my understanding: am I only supposed to have one capi + capz deployment per management cluster that then manages all workload clusters?

That's what I would recommend since that's what we do in all of our tests. It is theoretically possible to have multiple CAPI setups in the same workload cluster, but I'm not sure that use case is supported by CAPI.

@rbjorklin
Copy link
Author

Thanks for your time and assistance! Turns out this is not a bug then and I just need to RTFM:

To enable single controller multi-tenancy

@matthchr
Copy link
Member

I've filed #4277 to make the ASO label selector more robust anyway, as I've never liked that we have such a generic one.

That doesn't mean CAPZ will work how you tried to install it, but it should mean ASO is a bit more resilient to these sorts of topologies if they do come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working
Projects
Development

No branches or pull requests

4 participants