Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP API endpoint that returns queue metrics runs into an exception #11886

Open
alexnkdev opened this issue Aug 2, 2024 · 5 comments
Open

HTTP API endpoint that returns queue metrics runs into an exception #11886

alexnkdev opened this issue Aug 2, 2024 · 5 comments
Assignees
Labels

Comments

@alexnkdev
Copy link

alexnkdev commented Aug 2, 2024

Describe the bug

We noticed increasing amount of these exceptions in our brokers.

024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>   crasher:
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>     initial call: cowboy_stream_h:request_process/3
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>     pid: <0.511016.0>
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>     registered_name: []
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>     exception error: no case clause matching
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                      {[{consumers,0},{consumers,0}],
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                       [{memory,7224},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {policy,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            <<"AWS-DEFAULT-CLASSIC-QUEUES-POLICY-CLUSTER-MULTI-AZ">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {operator_policy,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            <<"default_operator_policy_AWS_managed">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {effective_policy_definition,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            [{<<"ha-mode">>,<<"all">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"ha-sync-mode">>,<<"automatic">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"max-length">>,8000000},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"overflow">>,<<"reject-publish">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"queue-mode">>,<<"lazy">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"queue-version">>,2}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {exclusive_consumer_tag,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {single_active_consumer_tag,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {consumer_utilisation,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {consumer_capacity,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {slave_nodes,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-13-3.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {synchronised_slave_nodes,[]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {recoverable_slaves,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-13-3.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {garbage_collection,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            [{max_heap_size,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {min_bin_vheap_size,46422},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {min_heap_size,233},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {fullsweep_after,65535},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {minor_gcs,2}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ready_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_unacknowledged_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_persistent,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_ready,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_unacknowledged,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_persistent,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {head_message_timestamp,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {storage_version,2},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_paged_out,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_paged_out,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {idle_since,<<"2024-07-31T10:04:29.571+00:00">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {policy,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            <<"AWS-DEFAULT-CLASSIC-QUEUES-POLICY-CLUSTER-MULTI-AZ">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {operator_policy,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            <<"default_operator_policy_AWS_managed">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {effective_policy_definition,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            [{<<"ha-mode">>,<<"all">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"ha-sync-mode">>,<<"automatic">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"max-length">>,8000000},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"overflow">>,<<"reject-publish">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"queue-mode">>,<<"lazy">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {<<"queue-version">>,2}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {exclusive_consumer_tag,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {single_active_consumer_tag,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {consumer_utilisation,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {consumer_capacity,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {slave_nodes,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-22-45.eu-west-1.compute.internal',
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             'rabbit@ip-10-0-6-78.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {synchronised_slave_nodes,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-6-78.eu-west-1.compute.internal',
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             'rabbit@ip-10-0-22-45.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {recoverable_slaves,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-22-45.eu-west-1.compute.internal',
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             'rabbit@ip-10-0-6-78.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {state,running},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {garbage_collection,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            [{max_heap_size,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {min_bin_vheap_size,46422},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {min_heap_size,233},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {fullsweep_after,65535},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             {minor_gcs,36}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ready_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_unacknowledged_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_persistent,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_ready,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_unacknowledged,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_ram,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_persistent,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {head_message_timestamp,''},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {storage_version,2},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_paged_out,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {message_bytes_paged_out,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {name,<<"workflows_v2.team.28974_queue">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {vhost,<<"/">>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {durable,true},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {auto_delete,false},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {exclusive,false},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {owner_pid,none},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {arguments,#{}},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {pid,<14878.6221558.0>},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {type,classic},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {state,running},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {slave_nodes,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-22-45.eu-west-1.compute.internal',
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             'rabbit@ip-10-0-6-78.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {synchronised_slave_nodes,
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                            ['rabbit@ip-10-0-6-78.eu-west-1.compute.internal',
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                             'rabbit@ip-10-0-22-45.eu-west-1.compute.internal']},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {node,'rabbit@ip-10-0-13-3.eu-west-1.compute.internal'},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {reductions,95303},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {reductions_details,[{rate,0.0}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ready,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_ready_details,[{rate,0.0}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_unacknowledged,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_unacknowledged_details,[{rate,0.0}]},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages,0},
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>                        {messages_details,[{rate,0.0}]}]}
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in function  rabbit_mgmt_util:pget_bin/3 (rabbit_mgmt_util.erl, line 611)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:sort_key/2 (rabbit_mgmt_util.erl, line 595)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:'-sort_list/4-lc$^0/1-1-'/2 (rabbit_mgmt_util.erl, line 468)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:'-sort_list/4-lc$^0/1-1-'/2 (rabbit_mgmt_util.erl, line 468)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:sort_list/4 (rabbit_mgmt_util.erl, line 468)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:sort/2 (rabbit_mgmt_util.erl, line 412)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:run_augmentation/2 (rabbit_mgmt_util.erl, line 405)
2024-07-31 10:15:11.083037+00:00 [error] <0.511016.0>       in call from rabbit_mgmt_util:augment_resources0/6 (rabbit_mgmt_util.erl, line 394)

Reproduction steps

Unknown

Expected behavior

No crash

Additional context

Broker is running rabbit 3.13.3

Policy AWS-DEFAULT-CLASSIC-QUEUES-POLICY-CLUSTER-MULTI-AZ definition:

"definition": "{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\",\"max-length\":8000000,
\"overflow\":\"reject-publish\",
\"queue-mode\":\"lazy\"}"
@alexnkdev alexnkdev added the bug label Aug 2, 2024
@michaelklishin michaelklishin changed the title Case clause crash in mgmt plugin HTTP API endpoint that returns queue metrics runs into an exception Aug 2, 2024
@mkuratczyk
Copy link
Contributor

Sounds similar to #10931. Perhaps this PR solved this problem for some, but not all metrics?

@dcorbacho
Copy link
Contributor

@mkuratczyk There is a duplicated 'consumers' key, so not the same issue. #10931 deals with elements that contain lists and sorts them. I'm having a look at this one

@SimonUnge
Copy link
Member

SimonUnge commented Aug 5, 2024

@dcorbacho
We have seen similar issues with the metrics, due to network partitions, that could be related? We have seen occurrences where the mgmt queue api - /api/queues/VHOST/QUEUENAME - responds with duplicated entires/keys, such as having more than one slave_nodes entries, with different values. I.e results like this:

{
"consumer_details": [],
"arguments": {
"x-queue-type": "classic"
},
"auto_delete": false,
"consumer_capacity": 0,
"consumer_utilisation": 0,
"consumers": 0,
"deliveries": [],
"slave_nodes": [nodeX, nodeY],
"slave_nodes": [nodeY],
...
}

Which ofc is faulty json and just weird. I assume its somewhere in the logic to query the different nodes (who due to a partition, ended up with different world views), and merging the results, something goes wrong? i.e in say rabbit_mgmt_db:augment_queues

Unfortunately I have not been able to reproduce, or debug while its actually happing.

@dcorbacho
Copy link
Contributor

@alexnkdev @SimonUnge If you can provide the steps to reproduce, we might be able to look into this. The metrics subsystem is quite complex and is not obvious to me where we are introducing the duplicates. I could not reproduce it either. Also, a patch would be very welcome ;)

@alexnkdev
Copy link
Author

@dcorbacho we have more logs, I'll try to understand what happened and see if we can reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants