Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for find-syncs and fix reconciliation queue history #3378

Merged
merged 7 commits into from
Jul 6, 2022

Conversation

theoilie
Copy link
Contributor

@theoilie theoilie commented Jul 5, 2022

Description

  • Add prometheus gauge audius_cn_find_sync_request_counts for finding syncs. This will give us insight into when monitoring jobs find or don't find syncs for any of the following reasons:
    • not_checked: Default value -- means the logic short-circuited before checking if the primary should sync to the secondary. This can be expected if this node wasn't the user's primary
    • no_sync_already_marked_unhealthy; Sync not found because the secondary was marked unhealthy before being passed to the find-sync-requests job
    • no_sync_sp_id_mismatch: Sync not found because the secondary's spID mismatched what the chain reported
    • no_sync_success_rate_too_low: Sync not found because the success rate of syncing to this secondary is below the acceptable threshold
    • no_sync_secondary_data_matches_primary: Sync not found because the secondary's clock value was greater than or equal to the primary's clock value
    • no_sync_unexpected_error: Sync not found because some uncaught error was thrown
    • new_sync_request_enqueued_primary_to_secondary: Sync was found from primary->secondary because all other conditions were met and primary clock value was greater than secondary
    • new_sync_request_enqueued_secondary_to_primary: Sync was found from secondary->primary because all other conditions were met and secondary clock value was greater than primary
    • sync_request_already_enqueued: Sync was found but a duplicate request has already been enqueued so no need to enqueue another
    • new_sync_request_unable_to_enqueue: Sync was found but something prevented a new request from being created
  • Fix reconciliation queue not deleting completed job history (this bug was introduced during a renaming)

Tests

Updated tests to pass and verified that metrics were found locally with a user whose replica set needed updating due to me taking down a node in its replica set. Context + what the metric screenshots show in this case:
User's replica set was primary=cn4, secondaries=CN2 and CN3

  1. I took CN2 offline
  2. CN3 was first to notice the offline node and issued a reconfig
    • Verified by seeing this line in the screenshot of CN3 logs: audius_cn_state_machine_update_replica_set_job_duration_seconds_bucket{le="30",uncaughtError="false",issuedReconfig="true",reconfigType="one_secondary"} 1
  3. Sync was successful (clock_status route showed its value updated on CN1)
  4. CN4 looked for syncs and saw that it didn't need to sync to CN1 because its clock value was already updated
    • Verified by seeing this line in the screenshot of CN4 logs: audius_cn_find_sync_request_counts{primary="http://cn4_creator-node_1:4003",secondary="http://cn1_creator-node_1:4000",result="no_sync_secondary_clock_gte_primary"} 3
    • Note that this metric no longer has primary or secondary as labels because that would create too many time series -- see discussion below for details.

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

  • Monitor logs for Metric label has invalid or Error processing job errors
  • Monitor the /prometheus_metrics endpoint to see what reasons syncs are being found or not found -- in the metric audius_cn_find_sync_request_counts
  • Debug any unexpected metric results by searching the logs for Recorded findSyncRequests from to see which primary/secondary caused the metric to be incremented

@theoilie theoilie added the content-node Content Node (previously known as Creator Node) label Jul 5, 2022
@theoilie theoilie requested a review from SidSethi July 5, 2022 15:25
Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whole thing is super clean, i like what you did with GAUGE_INC
sadly per my coment, i'm not sure this will scale with the cardinality of primary-secondary-result combinations

lmk what you think, but likely makes sense to simplify

@theoilie
Copy link
Contributor Author

theoilie commented Jul 5, 2022

@SidSethi thanks for the review! I responded to the cardinality issue -- thinking if we should try to at least expose some info about which nodes are causing the metrics. maybe logging when a metric is recorded would be a good middle ground to let us dive into why/where certain values happen without overloading Prometheus? lmk what you think and I'll make some changes 🙂

@SidSethi
Copy link
Contributor

SidSethi commented Jul 5, 2022

@SidSethi thanks for the review! I responded to the cardinality issue -- thinking if we should try to at least expose some info about which nodes are causing the metrics. maybe logging when a metric is recorded would be a good middle ground to let us dive into why/where certain values happen without overloading Prometheus? lmk what you think and I'll make some changes 🙂

agreed - i think as best practice, every time a metric is recorded there should be a corresponding log, so that seems like the right place to log the primary-secondary info. wdyt?

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff looks good - main thing is the cardinality / logging

@theoilie theoilie requested a review from SidSethi July 5, 2022 21:26
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Jul 6, 2022
@theoilie
Copy link
Contributor Author

theoilie commented Jul 6, 2022

accidentally got some reformatting in the findSyncRequests test file but it should be safe to ignore as long as it's passing. no big changes made to it -- just resolving merge conflicts

@theoilie theoilie requested a review from SidSethi July 6, 2022 19:49
Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - glad tests are passing - if major changes, maybe worth running through PR description "Tests" again, but can do async, excited to get these released

@theoilie theoilie merged commit 5b6acdd into master Jul 6, 2022
@theoilie theoilie deleted the theo-add-find-syncs-metric branch July 6, 2022 20:26
sliptype pushed a commit that referenced this pull request Sep 10, 2023
[9855fd2] Update locks (#3397) Dylan Jeffers
[aecef5f] [PAY-1144] [PAY-1182] [PAY-1147] DMs: Delete chat, message permissions (#3390) Marcus Pasell
[64ccd9d] Add stylelint to ci (#3373) Dylan Jeffers
[b642a16] [C-2518, C-2523, C-2611] Improve playlist create sagas (#3378) Dylan Jeffers
[5620d53] [PAY-1197] Mobile inbox unavailable modal from profile screen (#3376) Reed
[2ebaef5] [PAY-1248] Initial changes to get ready for upcoming PRs that include track and playlist tiles in DMs (#3391) Saliou Diallo
[c2fabf6] [PAY-1218] Mobile block dms drawer switches block/unblock (#3387) Reed
[0acad29] [C-2575] Match length of related artists user list to preview (#3395) Andrew Mendelsohn
[40525ae] [C-2615] Fix favorite tracks error due to empty track entries on web mobile (#3386) Kyle Shanks
[93a8077] [PAY-1145] DMs: Add InboxUnavailableModal (#3369) Marcus Pasell
[633484d] Update edit playlist flow and components (#3361) Kyle Shanks
[334ea0d] Special case ios safari for stem download (#3385) Raymond Jacobson
[d38238d] [C-2607] Pagination wrapper hooks for audius-query (#3375) Andrew Mendelsohn
[0a332d8] Hotfix: Fix stems downloads on mobile web (#3382) Marcus Pasell
[81da65e] Clean up artist_pick_track_id in APIUser (#3381) Michelle Brier
[64d89ef] Fix lint (#3380) Raymond Jacobson
[59129b4] [C-2614] Fix download stems mobile web (#3379) Raymond Jacobson
[df68727] [PAY-1032][PAY-892] Mobile DMs unread indicator, prefetch chats (#3352) Reed
[4a00fad] [PAY-1151] Handle chat reactions near top of screen on mobile (#3370) Reed
[bff316f] [PAY-1139] Throttle calls to fetchMessages on web scroll (#3372) Michael Piazza
[e723cec] C-2483 Fix queue overshot empty track player bug (#3353) nicoback2
[472a41d] [C-2596] Add disabled option to audius-query hooks (#3367) Andrew Mendelsohn
[f518962] [C-2602] Improve playlist library layout (#3364) Dylan Jeffers
[b3db8fa] Fix debounce on notif reaction (#3362) Raymond Jacobson
[2104b2b] [PAY-1183] Make clicking ChatUser handle/displayname lead to profile (#3368) Michael Piazza
[21cb6fc] DMs: Fix click handler in search user list for users you can't chat (#3358) Marcus Pasell
[24e7d8a] DMs: Update copy, scroll inbox-settings modal (#3359) Marcus Pasell
[e5e8d90] [PAY-941] Fix "1 new messages" unread tag (web) (#3366) Michael Piazza
[cfa1b2a] [C-2603] Fix readonly object error in audius-query reducer (#3365) Andrew Mendelsohn
[978c993] Fix invite reward claimable state on mobile (#3363) Reed
[a174fcc] [C-2556, C-2557] Address AI Attribution QA (#3349) Dylan Jeffers
[3c9b0f1] [PAY-1202] Refactor saved collections fetching (#3337) Randy Schott
[a0bdad5] Get call to action for chat permissions (#3325) Marcus Pasell
[03a2721] DMs: Use the optimistic unread count if applicable (#3354) Marcus Pasell
[8b99c2b] [C-2550] Left-nav fixes and improvements (#3357) Dylan Jeffers
[e7b0aab] [PAY-1215] Fix create new message crash (#3356) Reed
[6e7ece9] [PAY-1196] Mobile dms search users empty state (#3355) Reed
[315ae4f] [PAY-1219] Fix mobile chat reactions popup message getting cut off (#3342) Reed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants