Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keda not creating pod after 2nd message in sqs queue #5902

Closed
manjurshaikh1988 opened this issue Jun 21, 2024 · 3 comments
Closed

keda not creating pod after 2nd message in sqs queue #5902

manjurshaikh1988 opened this issue Jun 21, 2024 · 3 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@manjurshaikh1988
Copy link

manjurshaikh1988 commented Jun 21, 2024

Report

i have Keda + sqs + EKS setup
when there is 1st message in sqs queue keda is creating 1st pod
but when there is 2nd message in sqs queue keda is not creating 2nd pod
if i send 3rd message in sqs queue keda is creating pod

https://keda.sh/docs/2.13/concepts/scaling-jobs/

apiVersion: v1
kind: Secret
metadata:
  name: keda-sqs-auth
  namespace: backend
type: Opaque  
data:
  #awsRoleArn: "xxxxx
  "  #echo -n "arn:aws:iam::xxx:role/keda-uat" | base64
  AWS_ACCESS_KEY_ID:xxxxx # Required.
  AWS_SECRET_ACCESS_KEY:xxxxx # Required.
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-aws-credentials
  namespace: backend
spec:
  secretTargetRef:
  - parameter: awsAccessKeyID     # Required.
    name: keda-sqs-auth            # Required.
    key: AWS_ACCESS_KEY_ID        # Required.
  - parameter: awsSecretAccessKey # Required.
    name: keda-sqs-auth           # Required.
    key: AWS_SECRET_ACCESS_KEY    # Required.
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: unified-sqs-queue-scaledjob
  namespace: backend
spec:
  jobTargetRef:
    #parallelism: 2  # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    #completions: 1    # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    #activeDeadlineSeconds: 3600 #  Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 0  # Specifies the number of retries before marking this job failed. Defaults to 6
    activeDeadlineSeconds: 16200  #900
    template:
      metadata:
        labels:
          app: unified
        annotations:
          # Add toleration for GPU SKU, preventing scheduling on nodes with the specified GPU SKU.
          scheduler.alpha.kubernetes.io/tolerate-until-node-unschedulable: "true"
      spec:
        restartPolicy: Never # Prevent pods from restarting
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodegroup ##k get nodes --show-labels
                  operator: In
                  values:
                  - gpu   
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - unified
              topologyKey: kubernetes.io/hostname
        tolerations:
          # Tolerate nodes with GPU SKU.
        - key: "dedicated"
          operator: "Equal"
          value: "gpupool"  #gpupool-apppool
          effect: "NoSchedule"
        serviceAccountName: s3irsa
        terminationGracePeriodSeconds: 600 # time in seconds before terminating the pod gracefully after it receives a completion message
        containers:
          - name: unified
            image: xxx.dkr.ecr.ap-south-1.amazonaws.com/xx-unified:keda
            imagePullPolicy: Always
            env:
              - name: ALLOW_EMPTY_PASSWORD
                value: "yes"
            volumeMounts:
              - name: aws
                mountPath: /training
            resources:
              # requests:
              #   cpu: 7000m
  #              memory: 20000Mi
  #            limits:
  #              cpu: 2500m
  #              memory: 20000Mi
            ports:
              - containerPort: 5000
                protocol: TCP
                name: unified
        volumes:
        - name: aws
          persistentVolumeClaim:
            #claimName: uat-training
            #claimName: s3-uatdatabs
            claimName: uat-efs
  pollingInterval: 30 # How often KEDA will check the SQS queue
  minReplicaCount: 0 # Minimum number of jobs that KEDA can create
  #maxReplicaCount: 1 # Maximum number of jobs that KEDA can create
  successfulJobsHistoryLimit: 2 # Number of successful jobs to keep
  failedJobsHistoryLimit: 2 # Number of failed jobs to keep
  # scalingStrategy:
  #   strategy: "accurate"  #"default" # Scaling strategy (default, custom, or accurate)
  #   pendingPodConditions:               
  #   - "Pending"
  #   - "ContainerCreating"
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-south-1.amazonaws.com/xxxx/xx-unifiedservice.fifo
        queueLength: "1"
        awsRegion: "ap-south-1"
        scaleOnInFlight: "false"
      authenticationRef: 
        name: keda-trigger-auth-aws-credentials # Ensure this references your actual AWS credentials stored in K8s secrets

Expected Behavior

after second message in sqs keda should create the 2nd pod

Actual Behavior

i have Keda + sqs + EKS setup
when there is 1st message in sqs queue keda is creating 1st pod
but when there is 2nd message in sqs queue keda is not creating 2nd pod
if i send 3rd message in sqs queue keda is creating pod

Steps to Reproduce the Problem

  1. send 1st message in sqs
  2. check pod is getting created or not
  3. send 2nd message in sqs
  4. check pod should be created.

Logs from KEDA operator

manjur@MacBook-Pro keda % kubectl logs -f keda-operator-7f5d566f89-2fk22 
2024/06/21 11:58:45 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-06-21T11:58:45Z	INFO	setup	Starting manager
2024-06-21T11:58:45Z	INFO	setup	KEDA Version: 2.12.1
2024-06-21T11:58:45Z	INFO	setup	Git Commit: dc76ca70f19c22e8f0c806f84d95256d771f3dc9
2024-06-21T11:58:45Z	INFO	setup	Go Version: go1.20.8
2024-06-21T11:58:45Z	INFO	setup	Go OS/Arch: linux/amd64
2024-06-21T11:58:45Z	INFO	setup	Running on Kubernetes 1.28+	{"version": "v1.28.9-eks-036c24b"}
2024-06-21T11:58:45Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0621 11:58:45.933781       1 leaderelection.go:250] attempting to acquire leader lease keda-uat/operator.keda.sh...
2024-06-21T11:58:45Z	INFO	controller-runtime.metrics	Starting metrics server
2024-06-21T11:58:45Z	INFO	controller-runtime.metrics	Serving metrics server	{"bindAddress": ":8080", "secure": false}
I0621 11:59:23.015266       1 leaderelection.go:260] successfully acquired lease keda-uat/operator.keda.sh
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-06-21T11:59:23Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-06-21T11:59:23Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-06-21T11:59:23Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-06-21T11:59:23Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-06-21T11:59:23Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-06-21T11:59:23Z	INFO	Starting Controller	{"controller": "cert-rotator"}
2024-06-21T11:59:23Z	INFO	cert-rotation	starting cert rotator controller
2024-06-21T11:59:23Z	INFO	cert-rotation	no cert refresh needed
2024-06-21T11:59:23Z	INFO	cert-rotation	certs are ready in /certs
2024-06-21T11:59:23Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-06-21T11:59:23Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-06-21T11:59:23Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-06-21T11:59:23Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-06-21T11:59:23Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"unified-sqs-queue-scaledjob","namespace":"backend"}, "namespace": "backend", "name": "unified-sqs-queue-scaledjob", "reconcileID": "42e024b8-00aa-4f40-8f0a-96959528d2d0"}
2024-06-21T11:59:23Z	INFO	Starting workers	{"controller": "cert-rotator", "worker count": 1}
2024-06-21T11:59:23Z	INFO	cert-rotation	no cert refresh needed
2024-06-21T11:59:23Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-06-21T11:59:23Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-06-21T11:59:23Z	INFO	cert-rotation	no cert refresh needed
2024-06-21T11:59:23Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-06-21T11:59:23Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-06-21T11:59:23Z	INFO	RolloutStrategy: immediate, Deleting jobs owned by the previous version of the scaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"unified-sqs-queue-scaledjob","namespace":"backend"}, "namespace": "backend", "name": "unified-sqs-queue-scaledjob", "reconcileID": "42e024b8-00aa-4f40-8f0a-96959528d2d0", "numJobsToDelete": 3}
2024-06-21T11:59:23Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"unified-sqs-queue-scaledjob","namespace":"backend"}, "namespace": "backend", "name": "unified-sqs-queue-scaledjob", "reconcileID": "42e024b8-00aa-4f40-8f0a-96959528d2d0"}
2024-06-21T11:59:23Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of running Jobs": 0}
2024-06-21T11:59:23Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of pending Jobs ": 0}
2024-06-21T11:59:23Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Effective number of max jobs": 1}
2024-06-21T11:59:23Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 1}
2024-06-21T11:59:23Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 1}
2024-06-21T11:59:24Z	INFO	cert-rotation	CA certs are injected to webhooks
2024-06-21T11:59:24Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
2024-06-21T11:59:53Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of running Jobs": 1}
2024-06-21T11:59:53Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of pending Jobs ": 1}
2024-06-21T11:59:53Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Effective number of max jobs": 0}
2024-06-21T11:59:53Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}
2024-06-21T11:59:53Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}
2024-06-21T12:00:23Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of running Jobs": 1}
2024-06-21T12:00:23Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of pending Jobs ": 1}
2024-06-21T12:00:23Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Effective number of max jobs": 0}
2024-06-21T12:00:23Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}
2024-06-21T12:00:23Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}
2024-06-21T12:00:53Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of running Jobs": 1}
2024-06-21T12:00:53Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of pending Jobs ": 1}
2024-06-21T12:00:53Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Effective number of max jobs": 0}
2024-06-21T12:00:53Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}
2024-06-21T12:00:53Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "unified-sqs-queue-scaledjob", "scaledJob.Namespace": "backend", "Number of jobs": 0}

KEDA Version

2.12.1

Kubernetes Version

1.28

Platform

Amazon Web Services

Scaler Details

AWS SQS

Anything else?

No response

@manjurshaikh1988 manjurshaikh1988 added the bug Something isn't working label Jun 21, 2024
Copy link

stale bot commented Aug 22, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Aug 22, 2024
@JorTurFer
Copy link
Member

Hello
sorry because I missed this message 😢
You have set scaleOnInFlight: false so KEDA won't take that value into account. When you enqueue a message, you have 1 message and 0 jobs, so KEDA creates a job and your job start processing the message, so the queue has 0 pending messages (because in flight messages are ignored).
When you enqueue the second message, you have 1 pending message and 1 job, which is correct as you have set queueLength: 1. When you enqueue your 3rd message, you have 2 pending messages and 1 job, and KEDA creates a job.

In your case, I'd set scaleOnInFlight: true to spawn a job per message, not taking into account if they are in progress or not

Copy link

stale bot commented Aug 31, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

2 participants