Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CON-234] Enable write quorum #3408

Merged
merged 14 commits into from
Jul 13, 2022
Merged

[CON-234] Enable write quorum #3408

merged 14 commits into from
Jul 13, 2022

Conversation

theoilie
Copy link
Contributor

@theoilie theoilie commented Jul 8, 2022

Description

Libs / SDK

Add writeQuorumEnabled flag to libs creatorNodeConfig passed in via libs/SDK constructor. This flag is used to pass the Enforce-Write-Quorum header in every call from the libs/SDK CreatorNode API. The routes on Content Node can then decide how they want to consume the header (including just ignoring it when not applicable).

Content Node

  • Change enforceWriteQuorum env var default from false to true
  • Make file consumers of the issueAndWaitForSecondarySyncRequests not await it because discovery only indexes metadata, so write quorum isn't needed on files
    • This applies to the routes: /image_upload, /audius_users, /playlists, /tracks
  • Make metadata consumers of the issueAndWaitForSecondarySyncRequests middleware await it so that the middleware can actually block the request from succeeding until a 2/3 write quorum is reached
    • This makes writes await replication in all scenarios, but the write can still succeed if replication fails and write quorum is disabled
    • This applies to the routes: /audius_users/metadata, /playlists/metadata, /tracks/metadata
  • Change order of precedence for enforcing client header vs server env var:
    • If client passes Enforce-Write-Quorum header as false, disable write quorum regardless of the enforceWriteQuorum env var
    • If client passes Enforce-Write-Quorum header as true, enable write quorum regardless of the enforceWriteQuorum env var
    • If client doesn't pass Enforce-Write-Quorum header, write quorum = enforceWriteQuorum env var
  • Add audius_cn_write_quorum_duration_seconds histogram to track write quorum success/failure

Histogram data after changing profile picture and uploading a track once with write quorum enabled and then changing profile picture and uploading a track once with write quorum disabled:
Screen Shot 2022-07-12 at 10 09 10 PM

Tests

Env Var + Header Order of Precedence

I tested all the env var (true/false) + header (true/'true'/false/'false'/null/'null'/undefined) + ignoreWriteQuorum (true/false) combos to make sure the final enforceWriteQuorum variable used in the middleware is correct.

Manual E2E Test Setup

Before each test:

  1. (remote box) Disable snapback (set disableSnapback env var to true)
  2. (remote box) Disable refactored state monitoring (set stateMonitoringQueueRateLimitJobsPerInterval env var to 0) so that we can test only seeing manual syncs and not recurring
  3. (remote box) Check out this branch of the protocol and restart: git checkout theo-write-quorum && A down && A up
  4. (local) Check out this branch of the protocol and run libs so the client can consume its changes: git checkout theo-write-quorum && cd libs && npm link && npm run dev
  5. (local) Check out latest from client, link local libs, and start it:
    • cd audius-client/packages/web
    • npm link @audius/sdk
    • scp -r <REMOTE_BOX>:.audius ~/
    • npm run start:dev:cloud

Manual E2E Tests

Set local storage to make the feature flag enable/disable write quorum by entering the following into the Console:
localStorage.setItem('FeatureFlagOverride:write_quorum_enabled', 'enabled');
OR
localStorage.setItem('FeatureFlagOverride:write_quorum_enabled', 'disabled');

  1. Write quorum disabled + default processSync behavior:
    • POST request to /audius_users/metadata when uploading track enqueued+awaited manual sync without enforcing failing the POST request if the sync failed
    • Successfully logged completion before returning 200: issueAndWaitForSecondarySyncRequests - At least one secondary successfully replicated content for user 0xd7903b72f7e92f68ab14e61ef65689a201c4d741 in 534ms","time":"2022-07-08T00:20:17.617Z" "requestMethod":"POST","requestHostname":"cn4_creator-node_1","requestUrl":"/audius_users/metadata","requestWallet":"0xd7903b72f7e92f68ab14e61ef65689a201c4d741","requestBlockchainUserId":"2","duration":627,"statusCode":200
  2. Write quorum enabled + processSync temporarily changed to always throw:
    • Progress bar looks super glitchy – it bounced back and forth - and then eventually turns into heavy load error page
    • Primary eventually returned 500 for the metadata POST request with the expected error: Error processing request: Error: issueAndWaitForSecondarySyncRequests Error - Failed to reach 2/3 write quorum for user
  3. Write quorum enabled + manualSyncsDisabled=true:
    • Throws error issueAndWaitForSecondarySyncRequests Error - Cannot proceed due to manualSyncsDisabled
    • There haven't been any occurrences of this in Loggly for the past 7 days
  4. Write quorum enabled + missing req.session.wallet:
    • Throws error issueAndWaitForSecondarySyncRequests Error - req.session.wallet missing
    • There haven't been any occurrences of this in Loggly for the past 7 days
  5. Write quorum enabled + node is not primary or invalid creatorNodeEndpoints:
    • Throws error issueAndWaitForSecondarySyncRequests Error - Cannot process sync op - this node is not primary or invalid creatorNodeEndpoints
    • There haven't been any occurrences of this in Loggly for the past 7 days

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

  • Search for any errors starting with issueAndWaitForSecondarySyncRequests Error and read the test scenarios and issueAndWaitForSecondarySyncRequests code for more context
  • Keep an eye on the new audius_cn_write_quorum_duration_seconds metric -- look for the result label equaling anything other than succeeded

@theoilie theoilie added the content-node Content Node (previously known as Creator Node) label Jul 8, 2022
@theoilie theoilie requested review from dmanjunath and SidSethi July 8, 2022 07:47
@theoilie
Copy link
Contributor Author

theoilie commented Jul 8, 2022

just a draft to make sure this is on the right track. I bolded a couple TODO questions in the PR description if someone has an answer to those

Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first off thank you so much for testing this out and the great writeup. just had a few small suggestions but overall changes look good so far and i'm glad we're finally trying to get this out.

creator-node/src/config.js Show resolved Hide resolved
creator-node/src/middlewares.js Outdated Show resolved Hide resolved
creator-node/src/routes/audiusUsers.js Show resolved Hide resolved
@SidSethi
Copy link
Contributor

SidSethi commented Jul 8, 2022

thank you for putting in so much thought here and the thorough PR description 🙌

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jul 11, 2022
@theoilie
Copy link
Contributor Author

note that there's a bug in the axios version of libs that changes headers from null to [object Object]. this appears to be fixed in v0.20.0, but I'd rather not risk updating since I've read other users having issues after upgrading. the workaround I'm using is to pass the null as a string

see axios solution PR and problem issue 1 and issue 2

@theoilie theoilie marked this pull request as ready for review July 13, 2022 05:17
@theoilie
Copy link
Contributor Author

@SidSethi or @dmanjunath ready for review! I did some more testing, updated the PR description, and added a metric so we can track the rollout on staging and then prod. since there are only 7 routes that use write quorum, I think it's worth the slightly higher cardinality to have the route label that I added. If you're strongly opposed to this, though, I can change it to just be an ignoreWriteQuorum label. note that the ignoreWriteQuorum and enforceWriteQuorum labels should be the same for the same route on each request, so that lack of permutation should keep the cardinality down to a reasonable number

also I'll look into why the tests are failing tomorrow. I suspect it's because I changed the other routes to await write quorum

Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thanks for making these changes. I'm actually fine with the cardinality, don't see a problem with that at all, I love the observability 😃 . I'll let @SidSethi give the final approve but nothing else on my end

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!! excited to get this out - i appreciate the very detailed PR description and testing, and all the thought you put in to reason through all the underlying complexity here

also interesting point about the ignoreWriteQuorum thing - your point that metadata must be indexed but the associate routes don't need to be is a good one. my initial logic was that the associate routes are what actually cause a chain write which leads to indexing, but it makes sense that the chain write just writes a metadata CID which should already have been replicated by the first route. tricky to reason through, but i think it makes sense!

btw - i think our standard is to have separate PRs for each service e.g. libs and CN here (i know this defeats the point of a monorepo lol)
there'll also need to be a separate libs PR with new version (yeah i know)
idk if @dmanjunath cares

libs/src/services/creatorNode/CreatorNode.ts Outdated Show resolved Hide resolved
creator-node/src/middlewares.js Show resolved Hide resolved
@dmanjunath
Copy link
Contributor

@theoilie @SidSethi i think we can have libs code merged as part of this PR. we should just bump and merge the libs version separately

@theoilie
Copy link
Contributor Author

sounds good, just waiting to confirm that tests now pass on CI and then I'll merge!

@theoilie theoilie merged commit 10a85a8 into master Jul 13, 2022
@theoilie theoilie deleted the theo-write-quorum branch July 13, 2022 17:06
@theoilie theoilie mentioned this pull request Jul 13, 2022
hareeshnagaraj pushed a commit that referenced this pull request Jul 14, 2022
## Changelog

- #3463 by nicoback2 bump
- #3420 by nicoback2 Use
- #2988 by Cheran [INF-76]
- #3458 by Johannes Naylor
- #3456 by Sid Sethi
- libs by Johannes Naylor
- #3451 by Sebastian Klingler
- #3373 by Sebastian Klingler
- #3449 by Theo Ilie
- #3408 by Theo Ilie
- #3445 by Sebastian Klingler
- #3435 by Reed Added
- #3438 by Raymond Jacobson
- #3434 by Hareesh Nagaraj
- #3433 by Sebastian Klingler
- #3406 by nicoback2 Make
- #3402 by Hareesh Nagaraj
- #3394 by Raymond Jacobson
- #3392 by Raymond Jacobson
- #3363 by Michael Piazza
- #3366 by Joseph Lee
- #3354 by Joseph Lee
- #3362 by Michael Piazza
- #3361 by Raymond Jacobson
- #3360 by Raymond Jacobson
- #3347 by Raymond Jacobson
- #3345 by Raymond Jacobson
- #3337 by Sebastian Klingler
- #3336 by Sid Sethi
hareeshnagaraj pushed a commit that referenced this pull request Jul 14, 2022
## Changelog

- #3463 by nicoback2 bump
- #3420 by nicoback2 Use
- #2988 by Cheran [INF-76]
- #3458 by Johannes Naylor
- #3456 by Sid Sethi
- libs by Johannes Naylor
- #3451 by Sebastian Klingler
- #3373 by Sebastian Klingler
- #3449 by Theo Ilie
- #3408 by Theo Ilie
- #3445 by Sebastian Klingler
- #3435 by Reed Added
- #3438 by Raymond Jacobson
- #3434 by Hareesh Nagaraj
- #3433 by Sebastian Klingler
- #3406 by nicoback2 Make
- #3402 by Hareesh Nagaraj
- #3394 by Raymond Jacobson
- #3392 by Raymond Jacobson
- #3363 by Michael Piazza
- #3366 by Joseph Lee
- #3354 by Joseph Lee
- #3362 by Michael Piazza
- #3361 by Raymond Jacobson
- #3360 by Raymond Jacobson
- #3347 by Raymond Jacobson
- #3345 by Raymond Jacobson
- #3337 by Sebastian Klingler
- #3336 by Sid Sethi
hareeshnagaraj pushed a commit that referenced this pull request Jul 14, 2022
## Changelog

- #3463 by nicoback2 bump
- #3420 by nicoback2 Use
- #2988 by Cheran [INF-76]
- #3458 by Johannes Naylor
- #3456 by Sid Sethi
- libs by Johannes Naylor
- #3451 by Sebastian Klingler
- #3373 by Sebastian Klingler
- #3449 by Theo Ilie
- #3408 by Theo Ilie
- #3445 by Sebastian Klingler
- #3435 by Reed Added
- #3438 by Raymond Jacobson
- #3434 by Hareesh Nagaraj
- #3433 by Sebastian Klingler
- #3406 by nicoback2 Make
- #3402 by Hareesh Nagaraj
- #3394 by Raymond Jacobson
- #3392 by Raymond Jacobson
- #3363 by Michael Piazza
- #3366 by Joseph Lee
- #3354 by Joseph Lee
- #3362 by Michael Piazza
- #3361 by Raymond Jacobson
- #3360 by Raymond Jacobson
- #3347 by Raymond Jacobson
- #3345 by Raymond Jacobson
- #3337 by Sebastian Klingler
- #3336 by Sid Sethi
dmanjunath pushed a commit that referenced this pull request Jul 15, 2022
dmanjunath pushed a commit that referenced this pull request Jul 15, 2022
sliptype pushed a commit that referenced this pull request Sep 10, 2023
sliptype pushed a commit that referenced this pull request Sep 10, 2023
[a16abc2] Revert force mobile full staging build (#3564) nicoback2
[27f0a1b] [PAY-1386] DMs: Mark chat as read on client when new message from self comes through websocket (#3562) Marcus Pasell
[ab41a52] Force mobile build again (#3561) nicoback2
[ac383dc] DMs: Fix broken messages (#3559) Marcus Pasell
[29b49ee] Force mobile staging build - Android (#3560) nicoback2
[57ff5c5] Force mobile staging build (#3558) nicoback2
[5456271] Update plist to 1.1.66 (#3557) Marcus Pasell
[48c70b1] [PAY-1398] Fix ChatUserListScreen scrolling with keyboard (#3552) Reed
[ba19c1d] [PAY-1423] Fix empty chat message box with currently playing track (#3553) Reed
[970224f] Mobile ChatMessageListItem uses message id + selector (#3551) Reed
[49ff351] [PAY-1426] DMs: Simplify unfurl styles for mobile, fix border radius shadow (#3555) Marcus Pasell
[63faad0] [PAY-1403] Fix chat track unfurl reactions (#3550) Reed
[a8e92b7] [PAY-1425][PAY-1424] DMs: Fix image unfurls, fix purple halo around unfurls (#3554) Marcus Pasell
[940aebe] [PAY-1367][PAY-1400][PAY-1401][PAY-1402][PAY-1419] Fix chat tile unfurl bugs (#3541) Saliou Diallo
[61b6735] [PAY-1378] Cap image size on mobile DMs (#3548) Reed
[a495c35] [PAY-1409] DMs: Deep link into chat from notification (#3549) Marcus Pasell
[a4e6426] [PAY-1393] DMs: Hide message underneath popover (#3547) Marcus Pasell
[a6711f1] [C-2691] Carry values through the form fields (#3539) Andrew Mendelsohn
[d653f5c] [C-2747] Fix track title layout with offline + ai attribution (#3546) Andrew Mendelsohn
[8e54ff7] [PAY-1408][PAY-1351][PAY-1383][PAY-1417][PAY-1240][PAY-1345] DMs: Web UI polishes and bugfixes (#3543) Marcus Pasell
[cc1bc46] [C-2746] Fix favorites album tab styling (#3544) Kyle Shanks
[1dbb753] [PAY-1406][PAY-1374][PAY-1353][PAY-1411] Fix DMs UI Issues (#3542) Michael Piazza
[b76eef2] [PAY-1348] clean up ChatUserListScreen and use followers as default list (#3538) Randy Schott
[3ed58ac] DMs: Wait for tip confirmation then refetch permissions (#3537) Marcus Pasell
[bb4f317] [PAY-1366] Fix web chat tile artwork and mobile link preview bugs (#3528) Saliou Diallo
[ea47ecf] DMs: Fix deletions not causing perms rechecks (#3540) Marcus Pasell
[0b8c993] [PAY-1356] Mobile chats scroll to bottom when re-entering chat screen (#3536) Reed
[52f7b94] [PAY-1363] DMs: Add CTA to Create Chat user rows (#3530) Marcus Pasell
[4c22b12] [C-2461, C-2476] Prevent user from dragging hidden tracks to public playlists (#3535) Kyle Shanks
[f2ad1ce] [C-1873, C-2551, C-2552, C-2553] Add hidden track logic for playlists on web (#3529) Kyle Shanks
[083b0ff] [PAY-1315] Fix mobile chat search screen takes 2 presses (#3534) Reed
[da64d93] [PAY-1347] DMs: Don't show unread indicator for first messages (#3532) Marcus Pasell
[7ae88cb] [PAY-1394] DMs: Show global unread indicator if there's unread messages in a loaded chat (#3533) Marcus Pasell
[40abf58] [PAY-1350] Update send button styling on web (#3523) Michael Piazza
[cd1e238] [PAY-1149] Mobile chat reactions add android offset (#3531) Reed
[3595ce8] [PAY-1362] Mobile press chat message closes reaction popup (#3527) Reed
[06d10e2] [PAY-1361] Fix reactions button hit area (#3524) Michael Piazza
[e97ed82] [PAY-1309] Use proper tracking of hasMore for chats list (#3517) Randy Schott
[fe72398] [PAY-1365] Disable fullscreen gestures on mobile chat screen (#3526) Reed
[03cbcac] [PAY-1392] Mobile mark chat as read onscroll, on navigate away (#3522) Reed
[e6af9d7] [PAY-919][PAY-1312] Chat track and playlist tiles on mobile (#3455) Saliou Diallo
[e2896fe] [C-2719] Clean up and document non v1 usage (#3520) Raymond Jacobson
[2cf5bb7] Add reason to account slice, handle fetch account failure with sentry (#3525) Raymond Jacobson
[4451447] [PAY-1342][PAY-1360] DMs: Show users as chat unlocked by default, optimistic chat navigation (#3514) Marcus Pasell
[8262221] [PAY-1391] ensure all collections are fetched before attempting to download (#3521) Randy Schott
[e29d35f] [PAY-1385] Mobile chat send icon updates (#3516) Reed
[8cda7b2] [PAY-1390] fix inverted condition for computing track count (#3519) Randy Schott
[3a3b206] [PAY-1333] Mobile dms end of history only shows with >10 messages (#3515) Reed
[f1f5dc0] [PAY-1370] Fix reaction touch targets (#3513) Reed
[9b81487] Upgrade sdk to 2.0.3-beta.36 (#3512) Dylan Jeffers
[b48777a] Remove content node cid query route (#3408) Michelle Brier
[69e5656] [C-2717] Fix playlist create tile bug (#3510) Kyle Shanks
[828015f] Remove yield* in js files (#3511) Sebastian Klingler
[7d26233] Add mobile hidden track playlist logic (#3505) Kyle Shanks
[cf9c5d5] [PAY-1137][PAY-1132] Infinite loading of albums on mobile favorites screen (#3398) Randy Schott
@AudiusProject AudiusProject deleted a comment from linear bot Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants