-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for find-syncs and fix reconciliation queue history #3378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whole thing is super clean, i like what you did with GAUGE_INC
sadly per my coment, i'm not sure this will scale with the cardinality of primary-secondary-result combinations
lmk what you think, but likely makes sense to simplify
creator-node/src/services/stateMachineManager/makeOnCompleteCallback.js
Outdated
Show resolved
Hide resolved
creator-node/src/services/prometheusMonitoring/prometheus.constants.js
Outdated
Show resolved
Hide resolved
creator-node/src/services/prometheusMonitoring/prometheus.constants.js
Outdated
Show resolved
Hide resolved
creator-node/src/services/stateMachineManager/stateMachineUtils.js
Outdated
Show resolved
Hide resolved
creator-node/src/services/stateMachineManager/stateMachineUtils.js
Outdated
Show resolved
Hide resolved
creator-node/src/services/stateMachineManager/stateReconciliation/index.js
Show resolved
Hide resolved
@SidSethi thanks for the review! I responded to the cardinality issue -- thinking if we should try to at least expose some info about which nodes are causing the metrics. maybe logging when a metric is recorded would be a good middle ground to let us dive into why/where certain values happen without overloading Prometheus? lmk what you think and I'll make some changes 🙂 |
agreed - i think as best practice, every time a metric is recorded there should be a corresponding log, so that seems like the right place to log the primary-secondary info. wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff looks good - main thing is the cardinality / logging
creator-node/src/services/stateMachineManager/stateMonitoring/findSyncRequests.jobProcessor.js
Outdated
Show resolved
Hide resolved
accidentally got some reformatting in the findSyncRequests test file but it should be safe to ignore as long as it's passing. no big changes made to it -- just resolving merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice - glad tests are passing - if major changes, maybe worth running through PR description "Tests" again, but can do async, excited to get these released
creator-node/src/services/stateMachineManager/stateMonitoring/findSyncRequests.jobProcessor.js
Outdated
Show resolved
Hide resolved
[9855fd2] Update locks (#3397) Dylan Jeffers [aecef5f] [PAY-1144] [PAY-1182] [PAY-1147] DMs: Delete chat, message permissions (#3390) Marcus Pasell [64ccd9d] Add stylelint to ci (#3373) Dylan Jeffers [b642a16] [C-2518, C-2523, C-2611] Improve playlist create sagas (#3378) Dylan Jeffers [5620d53] [PAY-1197] Mobile inbox unavailable modal from profile screen (#3376) Reed [2ebaef5] [PAY-1248] Initial changes to get ready for upcoming PRs that include track and playlist tiles in DMs (#3391) Saliou Diallo [c2fabf6] [PAY-1218] Mobile block dms drawer switches block/unblock (#3387) Reed [0acad29] [C-2575] Match length of related artists user list to preview (#3395) Andrew Mendelsohn [40525ae] [C-2615] Fix favorite tracks error due to empty track entries on web mobile (#3386) Kyle Shanks [93a8077] [PAY-1145] DMs: Add InboxUnavailableModal (#3369) Marcus Pasell [633484d] Update edit playlist flow and components (#3361) Kyle Shanks [334ea0d] Special case ios safari for stem download (#3385) Raymond Jacobson [d38238d] [C-2607] Pagination wrapper hooks for audius-query (#3375) Andrew Mendelsohn [0a332d8] Hotfix: Fix stems downloads on mobile web (#3382) Marcus Pasell [81da65e] Clean up artist_pick_track_id in APIUser (#3381) Michelle Brier [64d89ef] Fix lint (#3380) Raymond Jacobson [59129b4] [C-2614] Fix download stems mobile web (#3379) Raymond Jacobson [df68727] [PAY-1032][PAY-892] Mobile DMs unread indicator, prefetch chats (#3352) Reed [4a00fad] [PAY-1151] Handle chat reactions near top of screen on mobile (#3370) Reed [bff316f] [PAY-1139] Throttle calls to fetchMessages on web scroll (#3372) Michael Piazza [e723cec] C-2483 Fix queue overshot empty track player bug (#3353) nicoback2 [472a41d] [C-2596] Add disabled option to audius-query hooks (#3367) Andrew Mendelsohn [f518962] [C-2602] Improve playlist library layout (#3364) Dylan Jeffers [b3db8fa] Fix debounce on notif reaction (#3362) Raymond Jacobson [2104b2b] [PAY-1183] Make clicking ChatUser handle/displayname lead to profile (#3368) Michael Piazza [21cb6fc] DMs: Fix click handler in search user list for users you can't chat (#3358) Marcus Pasell [24e7d8a] DMs: Update copy, scroll inbox-settings modal (#3359) Marcus Pasell [e5e8d90] [PAY-941] Fix "1 new messages" unread tag (web) (#3366) Michael Piazza [cfa1b2a] [C-2603] Fix readonly object error in audius-query reducer (#3365) Andrew Mendelsohn [978c993] Fix invite reward claimable state on mobile (#3363) Reed [a174fcc] [C-2556, C-2557] Address AI Attribution QA (#3349) Dylan Jeffers [3c9b0f1] [PAY-1202] Refactor saved collections fetching (#3337) Randy Schott [a0bdad5] Get call to action for chat permissions (#3325) Marcus Pasell [03a2721] DMs: Use the optimistic unread count if applicable (#3354) Marcus Pasell [8b99c2b] [C-2550] Left-nav fixes and improvements (#3357) Dylan Jeffers [e7b0aab] [PAY-1215] Fix create new message crash (#3356) Reed [6e7ece9] [PAY-1196] Mobile dms search users empty state (#3355) Reed [315ae4f] [PAY-1219] Fix mobile chat reactions popup message getting cut off (#3342) Reed
Description
audius_cn_find_sync_request_counts
for finding syncs. This will give us insight into when monitoring jobs find or don't find syncs for any of the following reasons:not_checked
: Default value -- means the logic short-circuited before checking if the primary should sync to the secondary. This can be expected if this node wasn't the user's primaryno_sync_already_marked_unhealthy
; Sync not found because the secondary was marked unhealthy before being passed to the find-sync-requests jobno_sync_sp_id_mismatch
: Sync not found because the secondary's spID mismatched what the chain reportedno_sync_success_rate_too_low
: Sync not found because the success rate of syncing to this secondary is below the acceptable thresholdno_sync_secondary_data_matches_primary
: Sync not found because the secondary's clock value was greater than or equal to the primary's clock valueno_sync_unexpected_error
: Sync not found because some uncaught error was thrownnew_sync_request_enqueued_primary_to_secondary
: Sync was found from primary->secondary because all other conditions were met and primary clock value was greater than secondarynew_sync_request_enqueued_secondary_to_primary
: Sync was found from secondary->primary because all other conditions were met and secondary clock value was greater than primarysync_request_already_enqueued
: Sync was found but a duplicate request has already been enqueued so no need to enqueue anothernew_sync_request_unable_to_enqueue
: Sync was found but something prevented a new request from being createdTests
Updated tests to pass and verified that metrics were found locally with a user whose replica set needed updating due to me taking down a node in its replica set. Context + what the metric screenshots show in this case:
User's replica set was primary=cn4, secondaries=CN2 and CN3
audius_cn_state_machine_update_replica_set_job_duration_seconds_bucket{le="30",uncaughtError="false",issuedReconfig="true",reconfigType="one_secondary"} 1
audius_cn_find_sync_request_counts{primary="http://cn4_creator-node_1:4003",secondary="http://cn1_creator-node_1:4000",result="no_sync_secondary_clock_gte_primary"} 3
primary
orsecondary
as labels because that would create too many time series -- see discussion below for details.Monitoring - How will this change be monitored? Are there sufficient logs / alerts?
Metric label has invalid
orError processing job
errors/prometheus_metrics
endpoint to see what reasons syncs are being found or not found -- in the metricaudius_cn_find_sync_request_counts
Recorded findSyncRequests from
to see which primary/secondary caused the metric to be incremented