Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

tshaiman · 2023-02-21T13:32:54Z

Keda Service Bus Scaler keeps asking for Arm GetAccess on Service Bus -> no reused of AdminClient

we saw a significant increase of getAcccess On Azure ARM Endpoint, aiming to authenticate that the request has access to the service bus.

This happens when all the pods are at scale zero
we have 16 queues , with 8 environments , with polling interval of 30 seconds => that is around 256 calls Per minute
Looking at the Keda Service bus Scaler code we see the following

azure_servicebus_scaler.go

func (s *azureServiceBusScaler) getAzureServiceBusLength(ctx context.Context) (int64, error) {
	// get adminClient
	adminClient, err := s.getServiceBusAdminClient()
	if err != nil {
		return -1, err
	}
	// switch case for queue vs topic here
	switch s.metadata.entityType {
	case queue:
		return getQueueLength(ctx, adminClient, s.metadata)
	case subscription:
		return getSubscriptionLength(ctx, adminClient, s.metadata)
	default:
		return -1, fmt.Errorf("no entity type")
	}
}

a servicebusAdminClient is created but it is not assigned to any varaible on "s" (azureServiceBusScaler)

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
	if s.client != nil {
		return s.client, nil
	}
  .......//Creating the client
}

so each time a check is made to see if s.client is initialized , it is always be null , hence new client will be created over and over again.

Expected Behavior

The admin.Client needs to be stored in the azureServiceBusScaler object and not be created every time

Actual Behavior

The admin.Client is created on every "getQueueLength" request hence cause Azure ARM calls to be throttled with "429-Too many requests "
when we query Azure ARM Endpoint for investigation we found the following :

HttpIncomingRequests
| where TIMESTAMP > ago(1d)| where subscriptionId == "24237b72-8546-4da5-b204-8c3cb76dd930"
| summarize count() by operationName
| order by count_ desc

POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS - 709361 calls

so CheckAccess API has 709K calls per days , even when no pods are running.

Steps to Reproduce the Problem

run Keda with at least 160 queues and use polling interval of 30 seconds/ or use less queues with low polling interval
Goto Azure Portal and view the number of Requests on the Service Bus Queue :

image 1 : queue length is empty

image 2: number of calls is 250-300 per minute.

ensure all calls are
POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS

Logs from KEDA operator

no errros

KEDA Version

2.9.3

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

would like to submit a patch after this is confirmed to be a bug

The text was updated successfully, but these errors were encountered:

JorTurFer · 2023-02-23T12:38:40Z

Hello!
Nice catch! KEDA asks the queues even if there isn't any pod, but definitively the client should be reused in the scaler

Are you willing to contribute with the fix?

I think that a small refactor of this function could be enough.

Something like this could be enough:

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
	if s.client != nil {
		return s.client, nil
	}
	var err error
	var client *admin.Client
	switch s.podIdentity.Provider {
	case "", kedav1alpha1.PodIdentityProviderNone:
		client, err = admin.NewClientFromConnectionString(s.metadata.connection, nil)
	case kedav1alpha1.PodIdentityProviderAzure, kedav1alpha1.PodIdentityProviderAzureWorkload:
		creds, chainedErr := azure.NewChainedCredential(s.podIdentity.IdentityID, s.podIdentity.Provider)
		if chainedErr != nil {
			return nil, err
		}
		client, err = admin.NewClient(s.metadata.fullyQualifiedNamespace, creds, nil)
	default:
		err = fmt.Errorf("incorrect podIdentity type")
	}

	s.client = client
	return client, err
}

zroubalik · 2023-02-23T12:47:53Z

Great find!

xoanmm · 2023-02-23T13:04:22Z

Hey,

I want to tackle this issue

tshaiman · 2023-02-23T18:06:04Z

oh man you are quick i was just about to submit a PR 😂

you snooze you loose

Relates to #4262 Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com>

tshaiman · 2023-02-25T21:45:44Z

@JorTurFer / @tomkerkhove / @xoanmm : I'm afriad that the fix in #4273 is now causing another issue.

I took the fix for a test drive ( build custom docker image) and after 1 hour I started getting Pod Identity Exception
I think this is because Azure Tokens are valid for 1 hour and the code does not take this into account but keep using the cached admin client.

peeking at the code of Identity in Keda and with knowledge on how Azure Identity library work I got to say this entire Identity issue needs to be discussed.

having said that the fix is not working for a pod that is longed lived .

JorTurFer · 2023-02-25T21:49:04Z

What error do you see? In theory, the service bus SDK should call to GetToken when the token has expired. I mean, maybe there is other error, but I think that this implementation is correct as we should cache the client instead of creating a new one each time.
KEDA also recreates the scaler when there is any error getting the metrics, so even if the client is cached in the scaler, the whole scaler is regenerated on any error.
I think that the error should be in other place.

tshaiman · 2023-02-25T22:04:01Z

I see the same-old ,most common error that keda does not have a valid token to use.
I do agree that we should cache the client , but I was looking how tokens are being asked and used in Keda , which made me rethink about the entire issue we see here ( and in the infamous Workload Identity bugs we saw few months ago )
I started a discussion here :
#4279

I agree that the SDK needs to call GetToken behind the scenes , but as I opened the custom implementation/wrapper of Identity in keda , as described in the discussion , I'm not sure this will eventually happen since the usage of the well defined contract against Azure .Identity library is not met in this case.

JorTurFer · 2023-02-25T22:22:13Z

I'm deploying that change into my cluster right now, let's wait until tomorrow to check how it looks.
What are you using? Workload Identity or AAD-Pod-Identity?

tshaiman · 2023-02-25T22:26:25Z

pod-identity , as workload identity had a bug that related to their webhook few months ago

JorTurFer · 2023-02-26T01:46:39Z

I don't see any error for the moment (76 min):

I'll keep both apps (aad-pod-identity and wi) deployed some days to check if there is any error

JorTurFer · 2023-02-26T02:32:51Z

Nothing yet (125min), Let's see after 12-24 hours

tshaiman · 2023-02-26T06:22:17Z

what i did in my test was to replace only the fixed code ( saving admin client ) into v2.9.3
that could be insufficient with more fixes merged to master

tshaiman · 2023-02-26T08:01:59Z

nice tool ! is this Lens ?

…

On Sun, Feb 26, 2023 at 4:33 AM Jorge Turrado Ferrero < ***@***.***> wrote: Nothing yet (125min), Let's see after 12-24 hours [image: image] <https://user-images.githubusercontent.com/36899226/221389014-fa42c9bf-c1df-4cb8-9b89-d4b12ccd6297.png> — Reply to this email directly, view it on GitHub <#4262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD5FRI27ODBM7GKDHKLZ2LWZK6F5ANCNFSM6AAAAAAVDBX4V4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

JorTurFer · 2023-02-26T11:58:08Z

is this Lens ?

yes, It's openlens (an open source fork of lens).
It isn't perfect and sometime I still need kubectl, but for majority of the case works realy nice

JorTurFer · 2023-02-26T12:00:16Z

Same behavior after 11 hours, I guess that we need to wait until 12/24 hours to ensure that the token is regenerated.

BTW, do you see any error in the aad-pod-identity pods? as it's an authentication proxy, maybe there is any error there that givesd more info about the problems

tshaiman · 2023-02-26T12:53:43Z

is it 2.9.3 with the fix , or is it s2.9.4 candidate.
what i did is to apply the fix on this bug to 2/.9.3 which may be incorrect.
can I get an image tag to test it ?

JorTurFer · 2023-02-26T13:08:11Z

I am using 'main' tag because it's the only that contains the fix for the moment ('main' tag contains the next version changes)

Be careful trying that image because it contains the certificarte management (a new feature introduced to improve the security), so you can't just update the tag, there are required changes in the RBAC to allow this management. Other option is to manage the certificates from your side, but that's more complicated. Next version docs explains it https://keda.sh/docs/2.10/operate/security/#use-your-own-tls-certificates
And we will release a blog post explaining how to do it with cert-manager after release

tshaiman · 2023-02-26T13:14:20Z

ok that is too much for now indeed
i will just save the auth client on my 2.9.3 tag and rebuild with ‘make container-build׳

i don’t want you to spend time on this if after 12 hours it works - most chances i didn’t build correctly .

will update shortly

JorTurFer · 2023-02-26T22:04:28Z

21H without any error:

Tomorrow morning I'll check if there is any error after 24h. I guess that if tomorrow this still works, the access token renewal is done correctly and I can close the issue again 😄
I created 2 ScaledObject, one for add-pod-identity and other one for aad-wi so both are covered with this test

tshaiman · 2023-02-26T23:07:02Z

@JorTurFer : conducted my tests as well and confirmed to be working correctly.
the problem was running against "workload-identity" which is still not fixed completely by workload-identity team with the webhook and its policy .

thanks for all your effots and clarifications

JorTurFer · 2023-02-27T08:07:30Z

Nice!

…re#4275) Relates to kedacore#4262 Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com>

tshaiman added the bug Something isn't working label Feb 21, 2023

JorTurFer added this to Roadmap - KEDA Core Feb 21, 2023

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Feb 21, 2023

tomkerkhove assigned xoanmm Feb 23, 2023

tomkerkhove moved this from To Triage to To Do in Roadmap - KEDA Core Feb 23, 2023

xoanmm mentioned this issue Feb 23, 2023

fix: change way of returns service bus namespace object #4273

Merged

5 tasks

JorTurFer closed this as completed in #4273 Feb 23, 2023

github-project-automation bot moved this from To Do to Ready To Ship in Roadmap - KEDA Core Feb 23, 2023

tomkerkhove added a commit that referenced this issue Feb 24, 2023

docs(changelog): Clarify recent Azure Service Bus improvement

14005a7

Relates to #4262 Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com>

tomkerkhove mentioned this issue Feb 24, 2023

docs(changelog): Clarify recent Azure Service Bus improvement #4275

Merged

2 tasks

JorTurFer pushed a commit that referenced this issue Feb 24, 2023

docs(changelog): Clarify recent Azure Service Bus improvement (#4275)

2a773a2

Relates to #4262 Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com>

JorTurFer reopened this Feb 25, 2023

github-project-automation bot moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Feb 25, 2023

JorTurFer closed this as completed Feb 25, 2023

github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Feb 25, 2023

JorTurFer reopened this Feb 25, 2023

github-project-automation bot moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Feb 25, 2023

tshaiman closed this as completed Feb 26, 2023

github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Feb 26, 2023

JorTurFer moved this from Ready To Ship to Done in Roadmap - KEDA Core Mar 13, 2023

xoanmm pushed a commit to xoanmm/keda that referenced this issue Mar 22, 2023

docs(changelog): Clarify recent Azure Service Bus improvement (kedaco…

cf47cb1

…re#4275) Relates to kedacore#4262 Signed-off-by: Tom Kerkhove <kerkhove.tom@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

tshaiman commented Feb 21, 2023 •

edited

Loading

JorTurFer commented Feb 23, 2023 •

edited

Loading

zroubalik commented Feb 23, 2023

xoanmm commented Feb 23, 2023

tshaiman commented Feb 23, 2023

tshaiman commented Feb 25, 2023 •

edited

Loading

JorTurFer commented Feb 25, 2023

tshaiman commented Feb 25, 2023

JorTurFer commented Feb 25, 2023 •

edited

Loading

tshaiman commented Feb 25, 2023

JorTurFer commented Feb 26, 2023

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

tshaiman commented Feb 26, 2023 via email

JorTurFer commented Feb 26, 2023 •

edited

Loading

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 26, 2023 •

edited

Loading

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 27, 2023

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

Comments

tshaiman commented Feb 21, 2023 • edited Loading

Keda Service Bus Scaler keeps asking for Arm GetAccess on Service Bus -> no reused of AdminClient

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Feb 23, 2023 • edited Loading

zroubalik commented Feb 23, 2023

xoanmm commented Feb 23, 2023

tshaiman commented Feb 23, 2023

tshaiman commented Feb 25, 2023 • edited Loading

JorTurFer commented Feb 25, 2023

tshaiman commented Feb 25, 2023

JorTurFer commented Feb 25, 2023 • edited Loading

tshaiman commented Feb 25, 2023

JorTurFer commented Feb 26, 2023

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

tshaiman commented Feb 26, 2023 via email

JorTurFer commented Feb 26, 2023 • edited Loading

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 26, 2023

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 26, 2023 • edited Loading

tshaiman commented Feb 26, 2023

JorTurFer commented Feb 27, 2023

tshaiman commented Feb 21, 2023 •

edited

Loading

JorTurFer commented Feb 23, 2023 •

edited

Loading

tshaiman commented Feb 25, 2023 •

edited

Loading

JorTurFer commented Feb 25, 2023 •

edited

Loading

JorTurFer commented Feb 26, 2023 •

edited

Loading

JorTurFer commented Feb 26, 2023 •

edited

Loading