Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcp-pubsub throwing "could not find stackdriver metric" #5429

Closed
ddlenz opened this issue Jan 23, 2024 · 14 comments · Fixed by #5452
Closed

gcp-pubsub throwing "could not find stackdriver metric" #5429

ddlenz opened this issue Jan 23, 2024 · 14 comments · Fixed by #5452
Labels
bug Something isn't working

Comments

@ddlenz
Copy link

ddlenz commented Jan 23, 2024

Report

After updating to 2.13.0, gcp_pub_sub_scaler repeatedly throws "error getting metric" and scale_handler throws "error getting scale decision" with "could not find stackdriver metric with query fetch pubsub_subscription" with unacked messages and failure to scale.

Expected Behavior

keda scales application from zero

Actual Behavior

keda fails to scale application from zero

Steps to Reproduce the Problem

  1. create gcp-pubsub scaler
  2. publish a message to trigger the scaler
  3. check unacked messages and logs

Logs from KEDA operator

No response

KEDA Version

2.13.0

Kubernetes Version

Other

Platform

Google Cloud

Scaler Details

gcp pubsub

Anything else?

kubernetes version 1.28.3-gke.1203001

@ddlenz ddlenz added the bug Something isn't working label Jan 23, 2024
@JorTurFer
Copy link
Member

JorTurFer commented Jan 24, 2024

Hello,
Does it work on previous versions? Could you share your ScaledObject?

Could you share KEDA operator logs as well?

Steps to Reproduce the Problem

  1. create gcp-pubsub scaler
  2. publish a message to trigger the scaler
  3. check unacked messages and logs

This case is already covered by e2e tests and it works. One thing that can happen is that if you don't have messages, you won't get a metric because the API itself responses with an error (which is normal if there isn't any activity related with the queue AFAIK about pub-sub monitoring).

There is a change that could affect, but I don't think that it's affecting as the e2e tests still passed, and the change kept the default behavior (and that's why I ask for more info xD)

@eremeevfd
Copy link

First, thank you for an incredible and very useful product!

Unfotunatyel, we've encountered the same issue as well:
Logs from Keda:

2024-01-25T10:20:42Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "algorithms-selfie", "scaledObject.Name": "translucency", "scaler": "pubsu
bScaler", "error": "could not find stackdriver metric with query fetch pubsub_subscription | metric 'pubsub.googleapis.com/subscription/num_undelivered_messages' | filter (resource.project_id == 'sasuke-core-dev' &
& resource.subscription_id == 'selfie_v2.translucency-2.0.0') | within 1m"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:764
 github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:628

However when I try to fetch the same metric in Google Metrics Explorer I get some results:
Screenshot 2024-01-25 at 11 23 06
Here we can see that there were some messages, but now it's empty, but does it really return error? Shouldn't it be like zero or some null value?
Could it also be a problem with TriggerAuthentication maybe? Or it should emit another error about permissions

@FrancoisPoinsot
Copy link

I am having the exact same issue.
Same keda version
Kubernetes 1.26
Similar error log.
Also with gcp pubsub scaler.
Can also play the query successfuly in Google Metric Explorer.

I had to rollback to 2.12.1 because we had issue with some workload not scaling up.
That is what we detected originaly.

@JorTurFer
Copy link
Member

JorTurFer commented Jan 30, 2024

So, does it work in KEDA v2.12.1 and not in KEDA v2.13.0?
By chance, wouldn't you have a way that I can follow to replicate the issue? Something like: push a message somehow, check KEDA somehow, etc...
I have 0 experience with GCP, so although I check the changes, I don't see any important thing (at least yet), and having a way to reproduce it in our account would be awesome to compare both versions. I mean, a replication way that works on v2.12.1 and doesn't work on v2.13.0

@ekaputra07
Copy link

Hi, first of all, thanks for this great project!

But facing same issue here, I'm using:

KEDA v2.13.0 / GKE Autopilot / Pub/Sub

My ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: some-name
spec:
  scaleTargetRef:
    name: some-name
  triggers:
    - type: gcp-pubsub
      authenticationRef:
        name: trigger-authentication-dev
        kind: ClusterTriggerAuthentication
      metadata:
        mode: NumUndeliveredMessages
        value: "5"
        activationValue: "0"
        subscriptionName: projects/my-project/subscriptions/my-sub

The error:

2024-01-31T04:53:00Z	ERROR	gcp_pub_sub_scaler	error getting metric	{"type": "ScaledObject", "namespace": "default", "name": "some-name", "metricType": "pubsub.googleapis.com/subscription/num_undelivered_messages", "error": "could not find stackdriver metric with query fetch pubsub_subscription | metric 'pubsub.googleapis.com/subscription/num_undelivered_messages' | filter (resource.project_id == 'my-project' && resource.subscription_id == 'my-sub') | within 1m"}

I copy-pasted the query I found in error message and run it on GCP's Metrics Explorer:
iScreen Shoter - 2024013110828997 PM

Things that I noticed are:

  • using within 1m the query sometime return result, but most of the time it doesn't
  • using higher time range for example within 5m always working (return result)

Based on its documentation, it says:

subscription/num_undelivered_messages
Number of unacknowledged messages (a.k.a. backlog messages) in a subscription. Sampled every 60 seconds. After sampling, data is not visible for up to 120 seconds.

Those might not related, but I'm just trying to provide data as much as possible and hopefully they helps to debug the situation.


And based on the scaler code, looks like we mark it as an Error if stackdriver doesn't return value.

if err == iterator.Done {

@FrancoisPoinsot
Copy link

FrancoisPoinsot commented Jan 31, 2024

Some more context about what happened.

I have 8 gcp-pubsub scaledObject.
5 with minReplicaCount:0 .
3 with minReplicaCount:1 .

When we upgraded keda to 2.13.0, all 5 deployements targeted by scaledobject with minReplicaCount:0 started to have scaling issue.
They were scaled down to 0. No matter how many pods existed.

Those with minReplicaCount:1 seems unaffected.
So I am wondering if the problem is specificaly related to the Activation phase.

I am trying to get a reproduction scenario.

here is a basic manifest.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: test-francois
  namespace: test-francois
spec:
  maxReplicaCount: 5
  minReplicaCount: 0
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-francois
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-clustertrigger-auth-gcp-credentials
    metadata:
      mode: SubscriptionSize
      subscriptionName: test-francois-sub
      value: "4"
    type: gcp-pubsub

You need to publish manually in the topic in question. Don't need to ack any message. Just use the value to scale up/down any random deployment.

Using this I can reproduce having an error log in keda-operator.
Also I get a steady increase in keda_scaler_errors metric.
Also, watching the generated hpa, this is what I see:

NAME                     REFERENCE                  TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-test-francois   Deployment/test-francois   2/4 (avg)   1         5         1          117m
keda-hpa-test-francois   Deployment/test-francois   <unknown>/4 (avg)   1         5         1          117m

Roughly 1/3 of the entries show the target with <unknown>

However this setup is not enough to reproduce the issue above. The deployement is not being scaled down to 0.
But I wonder if that is just a matter of how frequent those error shows up.

@FrancoisPoinsot
Copy link

FrancoisPoinsot commented Jan 31, 2024

Looking at https://github.com/kedacore/keda/pull/5246/files which as been merged for 2.13.0, I see GetMetrics call has been replaced with QueryMetrics

https://github.com/kedacore/keda/pull/5246/files#diff-aaa03b99f93c680bd727f6f0a3e9d932c34344ad25b3a254f9a56178c853fe3bR233

And getMetrics query 2 minutes back in time instead of the 1m that is currently used.

startTime := time.Now().UTC().Add(time.Minute * -2)

@JorTurFer
Copy link
Member

JorTurFer commented Feb 1, 2024

Nice research! I was thinking that maybe we changed a default behavior by mistake and it looks like that (and we have to fix it)

I'm thinking on adding the aggregation window as optional parameter too (for next version)

@JorTurFer
Copy link
Member

@FrancoisPoinsot , I've reverted the change in default time horizon in this PR.

The generated image with that change is ghcr.io/kedacore/keda-test:pr-5452-c5cf46759c5691b29bb45c6bbb60e3be10cd9f7a. Would you be willing to test it?

@FrancoisPoinsot
Copy link

I confirm that with ghcr.io/kedacore/keda-test:pr-5452-c5cf46759c5691b29bb45c6bbb60e3be10cd9f7a the error log is gone and keda_scaler_errors doesn't show any error.
And the HPA behave as expected.
That looks like a fix.

@JorTurFer
Copy link
Member

Do you see any increase of the goroutines now?

@FrancoisPoinsot
Copy link

goroutines count looks stable too.

@JorTurFer
Copy link
Member

Thanks for the feedback ❤️

Probably I was right and the issue with the routines was the not closed properly. As now the scaler isn't being regenerated on each check, the issue is mitigated. I've included the proper closing of the connection too as part of the PR: 4084ee0

@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Feb 12, 2024
@JoelDimbernat
Copy link

For anyone still encountering this error, ensure that your service account is granted the role roles/monitoring.viewer on the project. It's necessary to access pubsub.googleapis.com/subscription/num_undelivered_messages and I don't think it's documented anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
6 participants