-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track buckets in DB instead of in-memory #609
Track buckets in DB instead of in-memory #609
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if we can add a new buckets table which will have an bucket uuid andevery CID is then assigned to one of these bucket which can either be a dedicated bucket (per user) or a global bucket.
The bucket is then processed like a batch job with different trigger parameters:
1 - size of the bucket, meaning if it reaches a threshold (3.5GB), it'll get processed.
2 - time (we can schedule it everyday) - cron job.
Agreed we may need a global staging zone concept, and a dedicated zones table may end up being necessary. IMO the latter would also clean up some confusion for new contributors around the contents table. But for now, I want to keep this PR to the minimal change-set required to address the in-memory state issue. |
Currently writing up a task to recompute staging zone sizes at startup and update the DB if they mismatch what is in the aggregate content's row. This is a migration task and is required to move to tracking size incrementally in DB, since there staging zones will already exist when this gets deployed but their sizes will not have been accounted for in the incremental tracking. |
@en0ma I've validated that I can successfully consolidate and aggregate buckets across an API node and one shuttle. Currently the API node is not selectable as a consolidation destination in the code, so all the consolidations I tested went onto the shuttle. Included some restarts in the tests and tried adding data between consolidation and aggregation and confirmed it re-consolidated then eventually aggregated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Addresses #556 by removing the Buckets in-memory state from the ContentManager, and replacing it with DB queries
After #535, most of the staging zone state was removed. Staging zone size is now tracked incrementally as contents are added and removed. This allows readiness to be computed on the fly in constant time with a simple
size >= MinDealContentSize
check.The ContentManager now only tracks which zones are consolidating in memory. If this state is lost in a restart, all zones will be considered not consolidating, causing them to be reattempted.
recompute task validation:
Uploaded via
dev
:Checked out feature branch and started estuary:
(First image also had hashes.txt, just cut it off on accident in the screenshot)
Uploaded 3GB more files and removed 1GB on the feature branch to get it over the threshold and it moved into dealmaking successfully with the correct size metadata.
Follow Ups: