reset erlang scheduler cpu bind type to "unbound" #1518

chideat · 2023-12-27T11:17:28Z

chideat
Dec 27, 2023

Is your feature request related to a problem? Please describe.

We used 7 machines each with 128 cores and 320GB to set up a new production environment. In performing the RabbitMQ performance validation, we deployed a single replica of rabbitmqcluster with 8 cores and 8GB on one of the nodes. With 10 quorum queues and 30 producers, the incoming messages could reach 60-70K/s.

We then deployed another single replica of rabbitmqcluster with the same configuration on that node and stress-tested all instances together, but the incoming rate for the quorum queues remained at 60-70K/s.

After deploying 5 more single replicas of rabbitmqcluster and conducting a stress test on all instances together, the total production/consumption was only around 60-70K/s, showing no improvement.

Through multiple rounds of testing, we discovered that the CPU utilization of cores 0-7 was consistently >90%, while all other CPU cores remained <10%.

Describe the solution you'd like

We set erlang +stbt config to unbound and redo stress test, the load spread around all the cores, and the total production came to 270K/s，though the throughput was not very stable.

Config as below:

spec:
  rabbitmq:
    envConfig: |
      RABBITMQ_SCHEDULER_BIND_TYPE="u"

Additional context

In Kubernetes, it is quite common to deploy multiple RabbitMQ pods on the same node. If the default value of +stbt is kept as "db" (the actual value is 'tnnps'), it can significantly limit the overall performance of RabbitMQ.

Besides, the unbound config is not listed in rabbitmq doc, which i think shoud be added too.

Doc refer:

lukebakken · 2023-12-27T14:05:54Z

lukebakken
Dec 27, 2023
Maintainer

The scheduler bind type is set to db by default for good reason - we have done extensive testing in typical RabbitMQ environments and it is the best setting for the most users. I doubt we will change the default, though we may make it easier for users to discover and determine if it is appropriate for their environment.

In Kubernetes, it is quite common to deploy multiple RabbitMQ pods on the same node

I believe this operator uses k8s anti-affinity in order to prevent RabbitMQ pods from running on the same physical host. @mkuratczyk and @Zerpet should be able to provide more information about that.

Besides, the unbound config is not listed in rabbitmq doc, which i think shoud be added too.

This would be the appropriate place to open a pull request to mention that setting.

cc @michaelklishin

1 reply

mkuratczyk Dec 27, 2023
Maintainer

By default, the Operator doesn't set anti-affinity rules. Picking the defaults is always hard but I'd say that for a relatively obscure option like Erlang runtime CPU binding, the default should favour production use cases as it does right now. If someone runs multiple RabbitMQ nodes on a single Kubernetes node, then it's like a test env where performance is less important.

chideat · 2023-12-27T15:21:24Z

chideat
Dec 27, 2023
Author

By default, the Operator doesn't set anti-affinity rules. Picking the defaults is always hard but I'd say that for a relatively obscure option like Erlang runtime CPU binding, the default should favour production use cases as it does right now. If someone runs multiple RabbitMQ nodes on a single Kubernetes node, then it's like a test env where performance is less important.

@mkuratczyk if one RabbitMQ node on a single Kubernetes node, will it be a waste of the resource？do you deploy any other pods on this kubernetes？

1 reply

mkuratczyk Dec 27, 2023
Maintainer

It's up to you how you manage your resources. I mostly deploy RabbitMQ for testing and benchmarking, so I do deploy a RabbitMQ node per Kubernetes node, otherwise the results would be "random". I'm not saying you can't deploy multiple RMQ nodes per Kubernetes node but if you do, then as you already know - you may need to configure the CPU binding.

There is no right or wrong answer. We either make the defaults work better for one use case or the other.

chideat · 2023-12-27T15:34:55Z

chideat
Dec 27, 2023
Author

@lukebakken @mkuratczyk

I believe that using db as the default value is suitable for where RabbitMQ is deployed on a dedicated machine. This setting allows RabbitMQ to bind to specific CPU cores, optimizing performance by reducing context switches and enhancing cache utilization.

However, in containerized environments, this approach might not be as appropriate. In containerization, even in production settings, each machine typically runs multiple pods. It's unlikely that an entire machine would be allocated to a single RabbitMQ replica. Moreover, containerization often aligns with microservices architectural design, where business data is largely isolated. For data security and isolation, it's improbable to use a single RabbitMQ cluster for everything. Therefore, multiple RabbitMQ clusters need to be deployed. Each RabbitMQ cluster can be configured with anti-affinity to ensure that replicas do not co-locate on the same machine. However, pods from different RabbitMQ clusters might still end up on the same server，for we don't want each RabbitMQ replica to occupy an entire machine, as this would be too wasteful of resources.

In this scenario, if db is used as the default value, the pods from these two RabbitMQ clusters would interfere with each other. If they are configured with 8 cores and 8GB each, they would both always bind to the CPU cores from cpu0 to cpu7, not utilizing other available CPU cores.

Besides, from erlang 23, it takes CPU quotas into account in containerized environments, which limited schedulers of every rabbitmq node in pod. Using db as the default parameter in RabbitMQ conflicts with the principles of containerization again.

2 replies

lukebakken Dec 27, 2023
Maintainer

I'm very surprised that the Erlang VM has any notion of the physical cores on the machine when run via a container. Again, @mkuratczyk has much more experience in this area.

mkuratczyk Dec 27, 2023
Maintainer

Plenty of people use "a single RabbitMQ cluster" and/or dedicated Kubernetes nodes for major parts of their architecture (like RMQ, a database and so on), even on Kubernetes. I'm happy to leave this issue open for some time and see if others chime in. Given you've found a solution for your needs, I don't think we need to resolve this urgently.

Choosing the defaults is hard. Changing the defaults is super hard - we regularly get angry comments whenever we change the default for anything, even if we believe the new value is clearly better.

michaelklishin · 2023-12-28T01:44:51Z

michaelklishin
Dec 28, 2023
Maintainer

No too long ago we've had many examples of the opposite problem: RabbitMQ using "too many" cores on Kubernetes. RabbitMQ was called names such as "a resource hog", "a CPU killer" and so on.

Give it 128 cores, it uses 128, someone is very unhappy. Give it 128 cores, it uses 8, someone is very unhappy.

This can be added as a note to the runtime guide as well as one of the Kubernetes Operator guides.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset erlang scheduler cpu bind type to "unbound" #1518

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

reset erlang scheduler cpu bind type to "unbound" #1518

chideat Dec 27, 2023

Replies: 4 comments · 4 replies

lukebakken Dec 27, 2023 Maintainer

mkuratczyk Dec 27, 2023 Maintainer

chideat Dec 27, 2023 Author

mkuratczyk Dec 27, 2023 Maintainer

chideat Dec 27, 2023 Author

lukebakken Dec 27, 2023 Maintainer

mkuratczyk Dec 27, 2023 Maintainer

michaelklishin Dec 28, 2023 Maintainer

chideat
Dec 27, 2023

Replies: 4 comments 4 replies

lukebakken
Dec 27, 2023
Maintainer

mkuratczyk Dec 27, 2023
Maintainer

chideat
Dec 27, 2023
Author

mkuratczyk Dec 27, 2023
Maintainer

chideat
Dec 27, 2023
Author

lukebakken Dec 27, 2023
Maintainer

mkuratczyk Dec 27, 2023
Maintainer

michaelklishin
Dec 28, 2023
Maintainer