-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrument mod_mam #4224
Instrument mod_mam #4224
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/instrument #4224 +/- ##
======================================================
- Coverage 84.44% 84.43% -0.02%
======================================================
Files 557 556 -1
Lines 33677 33633 -44
======================================================
- Hits 28440 28399 -41
+ Misses 5237 5234 -3 ☔ View full report in Codecov by Sentry. |
be310ee
to
35204ed
Compare
5f6ebe4
to
184dad2
Compare
This comment was marked as outdated.
This comment was marked as outdated.
184dad2
to
034a519
Compare
This comment was marked as outdated.
This comment was marked as outdated.
582da60
to
277171b
Compare
034a519
to
3c333fb
Compare
This comment was marked as outdated.
This comment was marked as outdated.
3c333fb
to
5ab548c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
72a514f
to
031fb81
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
5fcfd8c
to
337eb9c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
337eb9c
to
59399cc
Compare
This comment was marked as outdated.
This comment was marked as outdated.
0970da4
to
c4bfdff
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Testing procedure: 1. In init_per_suite, call instrument_helper:start(DeclaredEvents). The helper installs instrument_event_table on mim(), which will collect events with matching EventName and Labels. 2. In the tests, call instrument_helper:assert/3. The function fails if there is no matching event. 3. In end_per_suite, call instrument_helper:stop/0, which deletes the collected events, and logs a summary of event status (tested/untested). Don't delete
Unify event names: mod_mam_pm_* and mod_mam_muc_* No functional changes other than metric renaming
All instrumentation is done with mongoose_instrument now.
Purpose: - More information in logs - Ability tomatch specific events in big tests
The checks in mongoose_instrument would cause an exception.
Only events, that are already triggered, are instrumented. Archive deletion was triggered only in end_per_testcase, so I decided to add it to two test cases, which already tested instrumentation. The plan is to extend the tests to all events in a separate task.
Also: get rid of a debug printout
c4bfdff
to
ce475da
Compare
elasticsearch_and_cassandra_26 / elasticsearch_and_cassandra_mnesia / ce475da small_tests_25 / small_tests / ce475da small_tests_26_arm64 / small_tests / ce475da small_tests_26 / small_tests / ce475da ldap_mnesia_25 / ldap_mnesia / ce475da dynamic_domains_pgsql_mnesia_26 / pgsql_mnesia / ce475da ldap_mnesia_26 / ldap_mnesia / ce475da dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / ce475da dynamic_domains_mysql_redis_26 / mysql_redis / ce475da internal_mnesia_26 / internal_mnesia / ce475da mysql_redis_26 / mysql_redis / ce475da pgsql_cets_26 / pgsql_cets / ce475da dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da graphql_muc_light_SUITE:admin_http:admin_muc_light:end_per_group{error,
{{unregistering_failed,
{amount,1},
{unregistered_items,
[{{<<"_admin_send_message_to_room_errors_1043">>,
[{escalus_event_mgr,<0.24567.0>},
{tc_name,admin_send_message_to_room_errors},
{escalus_cleaner,<0.24566.0>},
{watchdog,<0.24565.0>},
{muc_light_host,<<"muclight.domain.example.com">>},
{secondary_muc_light_host,<<"muclight.domain.example.org">>},
{protocol,http},
{schema_endpoint,admin},
{listener_opts,
#{module => mongoose_graphql_handler,path => "/api/graphql",
host => "localhost",username => <<"admin">>,
password => <<"secret">>,sse_idle_timeout => 3600000,
schema_endpoint => admin}},
{{ejabberd_cwd,mongooseim@localhost},
"/home/circleci/project/_build/mim1/rel/mongooseim"},
{preset,"odbc_mssql_mnesia"},
{mim_data_dir,
"/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE_data"},
{tc_logfile,
"https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.tests.graphql_muc_light_SUITE.logs/run.2024-03-18_09.12.27/graphql_muc_light_suite.admin_send_message_to_room_errors.html"},
{tc_group_properties,[{name,admin_muc_light}]},
{tc_group_path,[[{name,admin_http}]]},
{data_dir,
"/home/circleci/project/big_tests/_build/default/lib/mongoose_tests/ebin/graphql_muc_light_SUITE_data/"},
{priv_dir,
"https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4224/212820/odbc_mssql_mnesia.26.1.2-amd64/big/ct_run.test%4096a0c1a898e6.2024-03-18_09.08.02/big_tests.te... graphql_muc_light_SUITE:admin_cli:admin_muc_light:admin_create_room_with_unprepped_id{error,
{#{what => invalid_response_code,expected_type => ok,
response_code => {exit_status,1}},
[{graphql_helper,assert_response_code,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,256}]},
{graphql_helper,get_ok_value,2,
[{file,"/home/circleci/project/big_tests/tests/graphql_helper.erl"},
{line,239}]},
{graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
[{file,
"/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
{line,1156}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1302}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1234}]}]}} graphql_muc_light_SUITE:domain_admin:domain_admin_muc_light:admin_create_room_with_unprepped_id{error,
{{badmatch,null},
[{graphql_muc_light_SUITE,admin_create_room_with_unprepped_id,1,
[{file,
"/home/circleci/project/big_tests/tests/graphql_muc_light_SUITE.erl"},
{line,1155}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1302}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1234}]}]}} pgsql_mnesia_25 / pgsql_mnesia / ce475da pgsql_mnesia_26 / pgsql_mnesia / ce475da mssql_mnesia_26 / odbc_mssql_mnesia / ce475da dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / ce475da |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, but mostly reduced to already documenting any breaking change together with the PR introducing them (otherwise we're going to forget or potentially get confused down the road), and bypassing meck.
#{time => Time, | ||
time_per_message => round(Time / MessageCount), | ||
count => MessageCount} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that this time_per_message
key is too specific here, whoever will collect this data can already calculate so from the time
and count
keys, so maybe it is redundant, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be. I just made the assumption of not removing anything that was there already, and I would like to do such changes separately. Of course it makes sense, and I thought about the same actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I just hope not to bypass any of the things we've discovered here and there once we arrive to the moment of touching it all. Make sure to have notes about these points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have follow-up stories for MAM already, so we can add this. But actually someone might use time_per_message
, and it would be easier to switch the metric name than to also have to change the calculations? I think that "don't delete metrics - only rename or add" is still a good rule of thumb.
@@ -150,11 +147,13 @@ process_mam_iq(Acc, From, To, IQ, _Extra) -> | |||
?LOG_WARNING(#{what => mam_max_delay_reached, | |||
text => <<"Return max_delay_reached error IQ from MAM">>, | |||
action => Action, acc => Acc}), | |||
mongoose_metrics:update(HostType, modMamDroppedIQ, 1), | |||
mongoose_instrument:execute(mod_mam_pm_dropped_iq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this will change already the name of the metric in exometer, right? Don't we want to document these kinds of breaking changes already with the PR introducing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we could just document all metrics at once as well, because almost everything will change - that was my reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the story for docs already, and it was my plan to do it there.
@@ -1851,10 +1861,6 @@ archived(Config) -> | |||
#forwarded_message{result_id=ArcId} = parse_forwarded_message(ArcMsg), | |||
?assert_equal(Id, ArcId), | |||
ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also remove that ok
if we're already removing dead code here 🤷🏽♂️
meck:new(mongoose_instrument_log, [no_link, passthrough]), | ||
meck:expect(mongoose_instrument_log, handle_event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... There should be a better way than meck here. If we're already rpc-inserting a module here, and we're creating an ets table in the remote node, we can just rpc-insert a module that implements the mongoose_instrument
behaviour already and does the ets inserts on its own, entirely bypassing meck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many ways without meck, and I tried the following:
- Implementing
mongoose_instrument_*
is not easy, because all instrumentation is set up on startup. You would need to restartmongooseim
. - Adding a special list of global handlers. This adds extra code executed for each event, that is only for tests, so I don't like it.
- Log handler. This required a lot of work to silence the debug logs by reconfiguring existing handlers, and looked really messy. But actually a debug handler is what I would do if I had to get rid of meck.
I implemented all three of them, and the current one is the fourth attempt. I really don't see a cleaner solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunate. Points 2 and 3 aren't good at all from my point of view, so I'd discard them before even trying. Point 1 is the one that sounds best to me, wonder if there's a way to rework how handlers are initialised so that they can be injected dynamically without restarting MIM. I'll make a note for this one, it can be in a separate PR, it's well encapsulated in the instrumentation helper modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add this functionality, because it's not needed during normal operation, and I didn't want something just for tests. Also, we want to avoid the situation where configuration (mongoose_config
) is different from actual setup (an ad-hoc added handler). So it would be something like dynamic_modules
, but for handlers. It's doable, but do we really need it?
This PR updates
mod_mam
to usemongoose_instrument
and adds instrumentation tests.mod_mam
All metrics are updated to use
mongoose_instrument
. The naming scheme is simple: they start withmod_mam_pm
ormod_mam_muc
, followed by the event name.Measurements include some data like
params
andjid
. This is mostly to differentiate between the events in big tests. IMO we could add items here on demand. So far I limited the additions, but there doesn't seem to be a any real cost, because they are not copied anywhere. I decided neither not to put them in separate metadata, nor to put the metrics in a separate sub-map, but we could do that if we see benefit.Apart from that, I tried not to modify what is instrumented, but only how it's done - some parts of MAM could be instrumented in a better (more consistent) way.
Tests
Because it is the first time we are testing instrumentation, new test helpers were needed. The main helper module
instrument_helper
is started and stopped with the whole test suite.In
init_per_suite
, you callinstrument_helper:start(DeclaredEvents)
, providing a list of declared events ({EventName, Labels}
tuples). It injects and startsinstrument_event_table
on the tested node (currentlymim()
). Two ETS tables are initialized:instrument_event_table
will collect declared events with measurements onmim()
. Non-declared events are skipped.instrument_event_status_table
contains status (tested/untested) for each of the declared event tuples{EventName, Labels}
.In the tests, there are calls to
instrument_helper:assert/3
, which assert that a matching event with measurements is present, and mark{EventName, Labels}
as tested.In
end_per_suite
, you callinstrument_helper:stop()
, which checks the status of tested/untested events.If any events were declared and logged (collected in
instrument_event_table
), but were not tested withassert/3
,end_per_suite
would fail.Events that were declared, but not logged, indicate functionality, that is not covered by any tests. We have such functionality, and this is why currently there is no fail in this case. IMO, we could change this later. Alternatively, we could explicitly list all uncovered instrumentation, but it would be a bit tedious.
Metric tests, which checked exometer metrics replaced by the new instrumentation, are now removed.
Current test coverage
This is the current report from
end_per_suite
. There is no fail, because the untested events were not logged, i.e. the functionality is not covered by the tests.The following changes will be needed to cover all events, but they are excluded from this PR:
prefs
for MUC. The tests seem to be completely absent. It requires a separate story.dropped
events. The events are inconsistent, e.g. they occur only for async writers. They should be amended and tested.