Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ connections are not closed when a cron trigger is active in the same ScaledObject #1413

Closed
mboutet opened this issue Dec 11, 2020 · 5 comments · Fixed by #1416
Closed
Labels
bug Something isn't working

Comments

@mboutet
Copy link
Contributor

mboutet commented Dec 11, 2020

RabbitMQ scaler does not seem to properly close connections leading to KEDA being OOM killed and unnecessary memory/cpu pressure on RabbitMQ. This happens when a ScaledObject has both an active cron trigger and rabbitmq trigger (active or not).

Expected Behavior

The connection should be closed after use. In an ideal scenario, connections should be reused as suggested in #1121. Initiating an AMQP connection is really expensive as the client and the server has to exchange 7 TCP packages.

Actual Behavior

More connections are being opened than being closed.

Here's an excerpt of a tcpdump. As you can see, a lot of connections are opened and rarely closed. In fact, I think the connection is closed by RabbitMQ and not by KEDA.

477	1.924740	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
549	1.973281	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
557	1.974241	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
966	2.306129	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1242	2.902323	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1294	2.969788	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1345	2.988623	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1374	2.994603	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1441	3.020546	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1585	3.348450	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1630	3.376734	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1725	3.423396	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
1727	3.423435	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
1771	3.550319	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1823	3.645093	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1865	3.662555	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1907	3.680074	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
1949	3.741359	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2324	6.057304	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2488	6.965303	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2545	7.245724	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2729	8.422790	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
2825	9.127867	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2876	9.250854	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
2938	9.264355	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
3726	12.057110	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
3729	12.057499	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
3755	12.059142	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4255	12.492415	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4297	12.513571	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4503	12.631002	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4554	12.683435	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4602	12.712576	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4910	13.236314	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
4956	13.281010	10.0.54.158	10.244.0.95	AMQP	576	Connection.Start 
5045	13.423175	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
5046	13.423228	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
5047	13.423252	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 
5048	13.423260	10.244.0.95	10.0.54.158	AMQP	127	Connection.Close reply=Pooled connection closed due to inactivity 

Memory usage of KEDA:
Screen Shot 2020-12-08 at 2 35 20 PM

Connection usage of RabbitMQ:
Screen Shot 2020-12-08 at 2 34 41 PM

Steps to Reproduce the Problem

  1. Create some ScaledObject using the RabbitMQ trigger and a cron trigger such as:
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
            - periodSeconds: 60
              type: Percent
              value: 25
          stabilizationWindowSeconds: 60
        scaleUp:
          policies:
            - periodSeconds: 60
              type: Pods
              value: 1
          stabilizationWindowSeconds: 5
  cooldownPeriod: 600
  maxReplicaCount: 25
  minReplicaCount: 0
  pollingInterval: 30
  scaleTargetRef:
    name: my-deployment
  triggers:
    - metadata:
        desiredReplicas: '1'
        start: 0 0 * * *
        end: 0 23 * * *
        timezone: America/Toronto
      type: cron
    - authenticationRef:
        name: rabbitmq-secret
      metadata:
        queueLength: '30'
        queueName: my-queue
      type: rabbitmq
  1. Observed the KEDA memory and RabbitMQ connection count slowly increasing.

Additional Info

There have been some related issues in the past:

In particular, #318 seemed to have been addressed by #361 using a defer scaler.Close() in the scale loop. However, I can not find this deferred call in the v2 codebase.

I initially introduced a connection pooler to sit in between RabbitMQ and the clients. However, this did not help as now the connection pooler was also being OOM killed and not reusing connections. I eventually identified KEDA as the client opening so much connections. You can read the discussion in the issue I opened in the connection pooler here cloudamqp/amqproxy#43 (comment).

Specifications

  • KEDA Version: v2.0.0 (50bec80)
  • Platform & Version: AKS
  • Kubernetes Version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-16T20:43:08Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
  • Scaler(s): RabbitMQ + cron: 20 ScaledObject and 20 ScaledJob. Each ScaledObject has 3 RabbitMQ triggers on average (i.e. scaled based on 3 different queues) whereas each ScaledJob only has 1 RabbitMQ trigger.
@mboutet mboutet added the bug Something isn't working label Dec 11, 2020
@mboutet
Copy link
Contributor Author

mboutet commented Dec 11, 2020

Just to show the waste of resources on RabbitMQ when the cluster is idle (i.e. no load):
Screen Shot 2020-12-11 at 12 38 08 PMScreen Shot 2020-12-11 at 12 38 14 PM
The drop is just after KEDA gets OOM killed.

@balchua
Copy link
Contributor

balchua commented Dec 12, 2020

Im not in anyway a rabbitmq expert, but seems like we are not closing the channel but only the connection is being closed.

func (s *rabbitMQScaler) Close() error {
if s.connection != nil {
err := s.connection.Close()
if err != nil {
rabbitmqLog.Error(err, "Error closing rabbitmq connection")
return err
}
}
return nil
}

If you get the chance to try this out by doing a s.channel.Close().

@mboutet
Copy link
Contributor Author

mboutet commented Dec 12, 2020

According to https://pkg.go.dev/github.com/streadway/amqp#Connection.Close, one does not need to manually close the channels. Closing the connection will also close all the underlying channels. And from my experience working with pika (the AMQP Python client), the behaviour is the same. This is because channels are multiplexed on the connection if I recall correctly.

There may be one or more code paths where the scaler.Close() is not called. I tried to navigate the code but I'm not super familiar with Go nor the codebase.

Two places where I could find that the scalers are created and perhaps not closed are:

ss, err := h.GetScalers(scalableObject)

func (h *scaleHandler) checkScalers(ctx context.Context, scalableObject interface{}, scalingMutex sync.Locker) {

The second one seems more suspect to me because it is called multiple times within the scale loop whereas the first is only called once at startup (if I understood correctly).

EDIT: Nevermind, the scalers are closed in:

func (h *scaleHandler) checkScaledObjectScalers(ctx context.Context, scalers []scalers.Scaler) bool {

func (h *scaleHandler) checkScaledJobScalers(ctx context.Context, scalers []scalers.Scaler, scaledJob *kedav1alpha1.ScaledJob) (bool, int64, int64) {

@mboutet
Copy link
Contributor Author

mboutet commented Dec 13, 2020

I discovered something interesting. I have around 10 ScaledObject with a spec containing a cron and one or more rabbitmq triggers such as:

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
            - periodSeconds: 60
              type: Percent
              value: 25
          stabilizationWindowSeconds: 60
        scaleUp:
          policies:
            - periodSeconds: 60
              type: Pods
              value: 1
          stabilizationWindowSeconds: 5
  cooldownPeriod: 600
  maxReplicaCount: 25
  minReplicaCount: 0
  pollingInterval: 30
  scaleTargetRef:
    name: my-deployment
  triggers:
    - metadata:
        desiredReplicas: '1'
        start: 0 6 * * 1-5
        end: 0 22 * * 1-5
        timezone: America/Toronto
      type: cron
    - authenticationRef:
        name: rabbitmq-secret
      metadata:
        queueLength: '30'
        queueName: my-queue
      type: rabbitmq

In other words, the my-deployment will be scaled to zero during the night and weekends. So, the interesting part is that when I deployed in my test env today, I could not observe the excessive connections as I did during the week.

To convince myself, I set the cron schedule to 0 6 * * * -> 0 22 * * * and immediately noticed the connections rising in RabbitMQ's metrics. The image below shows the connection count increasing as soon as I updated the cron schedule to be active and then going flat once I reverted to 0 6 * * 1-5 -> 0 22 * * 1-5.
image

I updated the issue's description with these informations.

@mboutet mboutet changed the title RabbitMQ scaler does not seem to properly close connections RabbitMQ connections are not closed when a cron trigger is active in the same ScaledObject. Dec 13, 2020
@mboutet mboutet changed the title RabbitMQ connections are not closed when a cron trigger is active in the same ScaledObject. RabbitMQ connections are not closed when a cron trigger is active in the same ScaledObject Dec 13, 2020
@zroubalik
Copy link
Member

@mboutet thanks for report and investigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants