Instrument mod_mam #4224

chrzaszcz · 2024-02-15T09:27:52Z

This PR updates mod_mam to use mongoose_instrument and adds instrumentation tests.

`mod_mam`

All metrics are updated to use mongoose_instrument. The naming scheme is simple: they start with mod_mam_pm or mod_mam_muc, followed by the event name.

Measurements include some data like params and jid. This is mostly to differentiate between the events in big tests. IMO we could add items here on demand. So far I limited the additions, but there doesn't seem to be a any real cost, because they are not copied anywhere. I decided neither not to put them in separate metadata, nor to put the metrics in a separate sub-map, but we could do that if we see benefit.

Apart from that, I tried not to modify what is instrumented, but only how it's done - some parts of MAM could be instrumented in a better (more consistent) way.

Tests

Because it is the first time we are testing instrumentation, new test helpers were needed. The main helper module instrument_helper is started and stopped with the whole test suite.

In init_per_suite, you call instrument_helper:start(DeclaredEvents), providing a list of declared events ({EventName, Labels} tuples). It injects and starts instrument_event_table on the tested node (currently mim()). Two ETS tables are initialized:

instrument_event_table will collect declared events with measurements on mim(). Non-declared events are skipped.
instrument_event_status_table contains status (tested/untested) for each of the declared event tuples{EventName, Labels}.

In the tests, there are calls to instrument_helper:assert/3, which assert that a matching event with measurements is present, and mark {EventName, Labels} as tested.

In end_per_suite, you call instrument_helper:stop(), which checks the status of tested/untested events.

If any events were declared and logged (collected in instrument_event_table), but were not tested with assert/3, end_per_suite would fail.

Events that were declared, but not logged, indicate functionality, that is not covered by any tests. We have such functionality, and this is why currently there is no fail in this case. IMO, we could change this later. Alternatively, we could explicitly list all uncovered instrumentation, but it would be a bit tedious.

Metric tests, which checked exometer metrics replaced by the new instrumentation, are now removed.

Current test coverage

This is the current report from end_per_suite. There is no fail, because the untested events were not logged, i.e. the functionality is not covered by the tests.

[🔗](file:///.../big_tests.tests.mam_SUITE.logs/run.2024-03-18_09.58.00/mam_suite.end_per_suite.html#e-35252)

Tested instrumentation events:
 [{mod_mam_muc_archive_message,#{host_type => <<"localhost">>}},
  {mod_mam_muc_dropped_iq,#{host_type => <<"localhost">>}},
  {mod_mam_muc_flushed,#{host_type => <<"localhost">>}},
  {mod_mam_muc_lookup,#{host_type => <<"localhost">>}},
  {mod_mam_muc_remove_archive,#{host_type => <<"localhost">>}},
  {mod_mam_pm_archive_message,#{host_type => <<"localhost">>}},
  {mod_mam_pm_dropped_iq,#{host_type => <<"localhost">>}},
  {mod_mam_pm_flushed,#{host_type => <<"localhost">>}},
  {mod_mam_pm_get_prefs,#{host_type => <<"localhost">>}},
  {mod_mam_pm_lookup,#{host_type => <<"localhost">>}},
  {mod_mam_pm_remove_archive,#{host_type => <<"localhost">>}},
  {mod_mam_pm_set_prefs,#{host_type => <<"localhost">>}}]


*** User 2024-03-18 09:58:40.225 ***
[🔗](file://.../big_tests/ct_report/ct_run.test@m.2024-03-18_09.57.59/big_tests.tests.mam_SUITE.logs/run.2024-03-18_09.58.00/mam_suite.end_per_suite.html#e-35253)

Instrumentation events that were not logged - functionality not covered by tests:
[{mod_mam_muc_dropped,#{host_type => <<"localhost">>}},
 {mod_mam_muc_get_prefs,#{host_type => <<"localhost">>}},
 {mod_mam_muc_set_prefs,#{host_type => <<"localhost">>}},
 {mod_mam_pm_dropped,#{host_type => <<"localhost">>}}]

The following changes will be needed to cover all events, but they are excluded from this PR:

Testing prefs for MUC. The tests seem to be completely absent. It requires a separate story.
Sorting out and testing the dropped events. The events are inconsistent, e.g. they occur only for async writers. They should be amended and tested.

codecov · 2024-02-15T09:36:01Z

Codecov Report

Attention: Patch coverage is 72.22222% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 84.43%. Comparing base (7f5ee86) to head (ce475da).

Files	Patch %	Lines
src/mam/mod_mam_muc.erl	63.15%	7 Missing ⚠️
src/mam/mod_mam_muc_rdbms_arch_async.erl	60.00%	2 Missing ⚠️
src/mam/mod_mam_pm.erl	91.30%	2 Missing ⚠️
src/mam/mod_mam_rdbms_arch_async.erl	60.00%	2 Missing ⚠️
src/mam/mod_mam_elasticsearch_arch.erl	0.00%	1 Missing ⚠️
src/mam/mod_mam_muc_elasticsearch_arch.erl	0.00%	1 Missing ⚠️

Additional details and impacted files

@@                  Coverage Diff                   @@
##           feature/instrument    #4224      +/-   ##
======================================================
- Coverage               84.44%   84.43%   -0.02%     
======================================================
  Files                     557      556       -1     
  Lines                   33677    33633      -44     
======================================================
- Hits                    28440    28399      -41     
+ Misses                   5237     5234       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Testing procedure: 1. In init_per_suite, call instrument_helper:start(DeclaredEvents). The helper installs instrument_event_table on mim(), which will collect events with matching EventName and Labels. 2. In the tests, call instrument_helper:assert/3. The function fails if there is no matching event. 3. In end_per_suite, call instrument_helper:stop/0, which deletes the collected events, and logs a summary of event status (tested/untested). Don't delete

Unify event names: mod_mam_pm_* and mod_mam_muc_* No functional changes other than metric renaming

All instrumentation is done with mongoose_instrument now.

Purpose: - More information in logs - Ability tomatch specific events in big tests

The checks in mongoose_instrument would cause an exception.

Only events, that are already triggered, are instrumented. Archive deletion was triggered only in end_per_testcase, so I decided to add it to two test cases, which already tested instrumentation. The plan is to extend the tests to all events in a separate task.

Also: get rid of a debug printout

mongoose-im · 2024-03-18T09:11:14Z

elasticsearch_and_cassandra_26 / elasticsearch_and_cassandra_mnesia / ce475da
Reports root/ big
OK: 435 / Failed: 0 / User-skipped: 41 / Auto-skipped: 0

small_tests_25 / small_tests / ce475da
Reports root / small

small_tests_26_arm64 / small_tests / ce475da
Reports root / small

small_tests_26 / small_tests / ce475da
Reports root / small

ldap_mnesia_25 / ldap_mnesia / ce475da
Reports root/ big
OK: 2275 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0

dynamic_domains_pgsql_mnesia_26 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4522 / Failed: 0 / User-skipped: 105 / Auto-skipped: 0

ldap_mnesia_26 / ldap_mnesia / ce475da
Reports root/ big
OK: 2275 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0

dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4522 / Failed: 0 / User-skipped: 105 / Auto-skipped: 0

dynamic_domains_mysql_redis_26 / mysql_redis / ce475da
Reports root/ big
OK: 4489 / Failed: 0 / User-skipped: 138 / Auto-skipped: 0

internal_mnesia_26 / internal_mnesia / ce475da
Reports root/ big
OK: 2415 / Failed: 0 / User-skipped: 755 / Auto-skipped: 0

mysql_redis_26 / mysql_redis / ce475da
Reports root/ big
OK: 4890 / Failed: 0 / User-skipped: 133 / Auto-skipped: 0

pgsql_cets_26 / pgsql_cets / ce475da
Reports root/ big
OK: 4439 / Failed: 0 / User-skipped: 174 / Auto-skipped: 0

dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4517 / Failed: 2 / User-skipped: 108 / Auto-skipped: 0

graphql_muc_light_SUITE:admin_http:admin_muc_light:end_per_group

{error,
 {{unregistering_failed,
   {amount,1},
   {unregistered_items,
  [{{<<"_admin_send_message_to_room_errors_1043">>,
     [{escalus_event_mgr,<0.24567.0>},
    {tc_name,admin_send_message_to_room_errors},
    {escalus_cleaner,<0.24566.0>},
    {watchdog,<0.24565.0>},
    {muc_light_host,<<"muclight.domain.example.com">>},
    {secondary_muc_light_host,<<"muclight.domain.example.org">>},
    {protocol,http},
    {schema_endpoint,admin},
    {listener_opts,
     #{module => mongoose_graphql_handler,path => "/api/graphql",
       host => "localhost",username => <<"admin">>,
       password => <<"secret">>,sse_idle_timeout => 3600000,
       schema_endpoint => admin}},
    {{ejabberd_cwd,mongooseim@localhost},
     "/home/circleci/project/_build/mim1/rel/mongooseim"},
    {preset,"odbc_mssql_mnesia"},
    {mim_data_dir,
     "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE_data"},
    {tc_logfile,
     "https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.tests.graphql_muc_light_SUITE.logs/run.2024-03-18_09.12.27/graphql_muc_light_suite.admin_send_message_to_room_errors.html"},
    {tc_group_properties,[{name,admin_muc_light}]},
    {tc_group_path,[[{name,admin_http}]]},
    {data_dir,
     "/home/circleci/project/big_tests/_build/default/lib/mongoose_tests/ebin/graphql_muc_light_SUITE_data/"},
    {priv_dir,
     "https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.te...

Report log

graphql_muc_light_SUITE:admin_cli:admin_muc_light:admin_create_room_with_unprepped_id

{error,
  {#{what => invalid_response_code,expected_type => ok,
     response_code => {exit_status,1}},
   [{graphql_helper,assert_response_code,2,
      [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
       {line,256}]},
    {graphql_helper,get_ok_value,2,
      [{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
       {line,239}]},
    {graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
      [{file,
         "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
       {line,1156}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1302}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,1234}]}]}}

Report log

graphql_muc_light_SUITE:domain_admin:domain_admin_muc_light:admin_create_room_with_unprepped_id

{error,
  {{badmatch,null},
   [{graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
      [{file,
         "/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
       {line,1155}]},
    {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
    {test_server,run_test_case_eval1,6,
      [{file,"test_server.erl"},{line,1302}]},
    {test_server,run_test_case_eval,9,
      [{file,"test_server.erl"},{line,1234}]}]}}

Report log

pgsql_mnesia_25 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4911 / Failed: 0 / User-skipped: 112 / Auto-skipped: 0

pgsql_mnesia_26 / pgsql_mnesia / ce475da
Reports root/ big
OK: 4911 / Failed: 0 / User-skipped: 112 / Auto-skipped: 0

mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4908 / Failed: 0 / User-skipped: 115 / Auto-skipped: 0

dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da
Reports root/ big
OK: 4519 / Failed: 0 / User-skipped: 108 / Auto-skipped: 0

NelsonVides

A few small comments, but mostly reduced to already documenting any breaking change together with the PR introducing them (otherwise we're going to forget or potentially get confused down the road), and bypassing meck.

NelsonVides · 2024-03-19T16:39:49Z

src/mam/mod_mam_rdbms_arch_async.erl

+                                     #{time => Time,
+                                       time_per_message => round(Time / MessageCount),
+                                       count => MessageCount}


I'm thinking that this time_per_message key is too specific here, whoever will collect this data can already calculate so from the time and count keys, so maybe it is redundant, wdyt?

It might be. I just made the assumption of not removing anything that was there already, and I would like to do such changes separately. Of course it makes sense, and I thought about the same actually.

Sounds good to me, I just hope not to bypass any of the things we've discovered here and there once we arrive to the moment of touching it all. Make sure to have notes about these points.

We have follow-up stories for MAM already, so we can add this. But actually someone might use time_per_message, and it would be easier to switch the metric name than to also have to change the calculations? I think that "don't delete metrics - only rename or add" is still a good rule of thumb.

NelsonVides · 2024-03-19T16:47:02Z

src/mam/mod_mam_pm.erl

@@ -150,11 +147,13 @@ process_mam_iq(Acc, From, To, IQ, _Extra) ->
                    ?LOG_WARNING(#{what => mam_max_delay_reached,
                                   text => <<"Return max_delay_reached error IQ from MAM">>,
                                   action => Action, acc => Acc}),
-                    mongoose_metrics:update(HostType, modMamDroppedIQ, 1),
+                    mongoose_instrument:execute(mod_mam_pm_dropped_iq,


I understand that this will change already the name of the metric in exometer, right? Don't we want to document these kinds of breaking changes already with the PR introducing them?

I thought we could just document all metrics at once as well, because almost everything will change - that was my reasoning.

Same as above 👍🏽

We have the story for docs already, and it was my plan to do it there.

NelsonVides · 2024-03-19T19:09:11Z

big_tests/tests/mam_SUITE.erl

@@ -1851,10 +1861,6 @@ archived(Config) ->
        #forwarded_message{result_id=ArcId} = parse_forwarded_message(ArcMsg),
        ?assert_equal(Id, ArcId),
        ok


We could also remove that ok if we're already removing dead code here 🤷🏽‍♂️

NelsonVides · 2024-03-19T19:12:35Z

big_tests/tests/instrument_event_table.erl

+    meck:new(mongoose_instrument_log, [no_link, passthrough]),
+    meck:expect(mongoose_instrument_log, handle_event,


Hmm... There should be a better way than meck here. If we're already rpc-inserting a module here, and we're creating an ets table in the remote node, we can just rpc-insert a module that implements the mongoose_instrument behaviour already and does the ets inserts on its own, entirely bypassing meck.

There are many ways without meck, and I tried the following:

Implementing mongoose_instrument_* is not easy, because all instrumentation is set up on startup. You would need to restart mongooseim.

Adding a special list of global handlers. This adds extra code executed for each event, that is only for tests, so I don't like it.

Log handler. This required a lot of work to silence the debug logs by reconfiguring existing handlers, and looked really messy. But actually a debug handler is what I would do if I had to get rid of meck.

I implemented all three of them, and the current one is the fourth attempt. I really don't see a cleaner solution.

This is unfortunate. Points 2 and 3 aren't good at all from my point of view, so I'd discard them before even trying. Point 1 is the one that sounds best to me, wonder if there's a way to rework how handlers are initialised so that they can be injected dynamically without restarting MIM. I'll make a note for this one, it can be in a separate PR, it's well encapsulated in the instrumentation helper modules.

I didn't add this functionality, because it's not needed during normal operation, and I didn't want something just for tests. Also, we want to avoid the situation where configuration (mongoose_config) is different from actual setup (an ad-hoc added handler). So it would be something like dynamic_modules, but for handlers. It's doable, but do we really need it?

This was referenced Feb 15, 2024

Instrumentation PoC (WIP) #4216

Closed

Instrument PoC #4223

Merged