Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add metrics for event processing lag #3090

Merged
merged 9 commits into from
Apr 12, 2018
Merged

Conversation

erikjohnston
Copy link
Member

Adds metrics for tracking how far behind the appservice and federation sending loops are. There are threee:

  1. The stream positions of the last processed events, and stream position for last persisted event. This allows getting the lag in terms of delta of event stream positions.
  2. The received_ts of the last processed event. This allows us to get an estimate of where the processing is up to time wise.
  3. How long it took from being written to the db to being processed for the last event processed, i.e. its difference between the time the event was processed and its received_ts.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good. a few minor comments.

# dict[list[str]]: value for each set of label values. the keys are the
# label values, in the same order as the labels in self.labels.
#
# (if the metric is a scalar, the (single) key is the empty list).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/list/tuple/, fwiw


self.guages[values] = v

def render(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like a bit of a shame not to share this with CounterMetric with a baseclass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I think in this case its more hassle than its worth. Another option would be to make the counter and gauge class the same and have it support both set and inc, but then you lose the easy ability to see if a metric is meant to be a counter or gauge. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like them being separate. Happy for you to leave the code c&ped for now.

event_id (str)

Returns:
Deferred[int|None]: Timstamp in milliseconds, or None for events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timstamp


if events:
now = self.clock.time_msec()
ts = yield self.store.get_received_ts(events[-1].event_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a shame to have to do a db hit for this, but I guess working around it is a faff.

@@ -239,6 +239,21 @@ def handle_room_events(events):
"events", next_token
)

synapse.metrics.event_processing_positions.set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the events_processed_counter update here so that all the metrics updates are collected together?

@@ -111,6 +111,20 @@ def notify_interested_services(self, current_id):
events_processed_counter.inc_by(len(events))

yield self.store.set_appservice_last_pos(upper_bound)

synapse.metrics.event_processing_positions.set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again with events_processed_counter

Returns:
Deferred[int|None]: Timstamp in milliseconds, or None for events
that were persisted before received_ts was implemented.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the event id is unknown? exception, presumably?

@richvdh richvdh assigned erikjohnston and unassigned richvdh Apr 12, 2018
@erikjohnston erikjohnston assigned richvdh and unassigned erikjohnston Apr 12, 2018
@@ -52,13 +52,14 @@

class EventsWorkerStore(SQLBaseStore):
def get_received_ts(self, event_id):
"""Get received_ts (when it was persisted) for the event
"""Get received_ts (when it was persisted) for the event. Raises an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could have done with being a separate paragraph, ftr, since the first line is slightly special (https://www.python.org/dev/peps/pep-0257/#multi-line-docstrings)

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@richvdh richvdh assigned erikjohnston and unassigned richvdh Apr 12, 2018
@erikjohnston erikjohnston merged commit 0f13f30 into develop Apr 12, 2018
neilisfragile added a commit that referenced this pull request Apr 27, 2018
Changes in synapse v0.28.0-rc1 (2018-04-26)
===========================================

Bug Fixes:

* Fix quarantine media admin API and search reindex (PR #3130)
* Fix media admin APIs (PR #3134)

Changes in synapse v0.28.0-rc1 (2018-04-24)
===========================================

Minor performance improvement to federation sending and bug fixes.

(Note: This release does not include state resolutions discussed in matrix live)

Features:

* Add metrics for event processing lag (PR #3090)
* Add metrics for ResponseCache (PR #3092)

Changes:

* Synapse on PyPy (PR #2760) Thanks to @Valodim!
* move handling of auto_join_rooms to RegisterHandler (PR #2996) Thanks to @krombel!
* Improve handling of SRV records for federation connections (PR #3016) Thanks to @silkeh!
* Document the behaviour of ResponseCache (PR #3059)
* Preparation for py3 (PR #3061, #3073, #3074, #3075, #3103, #3104, #3106, #3107, #3109, #3110) Thanks to @NotAFile!
* update prometheus dashboard to use new metric names (PR #3069) Thanks to @krombel!
* use python3-compatible prints (PR #3074) Thanks to @NotAFile!
* Send federation events concurrently (PR #3078)
* Limit concurrent event sends for a room (PR #3079)
* Improve R30 stat definition (PR #3086)
* Send events to ASes concurrently (PR #3088)
* Refactor ResponseCache usage (PR #3093)
* Clarify that SRV may not point to a CNAME (PR #3100) Thanks to @silkeh!
* Use str(e) instead of e.message (PR #3103) Thanks to @NotAFile!
* Use six.itervalues in some places (PR #3106) Thanks to @NotAFile!
* Refactor store.have_events (PR #3117)

Bug Fixes:

* Return 401 for invalid access_token on logout (PR #2938) Thanks to @dklug!
* Return a 404 rather than a 500 on rejoining empty rooms (PR #3080)
* fix federation_domain_whitelist (PR #3099)
* Avoid creating events with huge numbers of prev_events (PR #3113)
* Reject events which have lots of prev_events (PR #3118)
@erikjohnston erikjohnston deleted the erikj/processed_event_lag branch September 20, 2018 13:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants