Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track buckets in DB instead of in-memory #609

Merged

Conversation

neelvirdy
Copy link
Contributor

@neelvirdy neelvirdy commented Nov 28, 2022

Addresses #556 by removing the Buckets in-memory state from the ContentManager, and replacing it with DB queries

After #535, most of the staging zone state was removed. Staging zone size is now tracked incrementally as contents are added and removed. This allows readiness to be computed on the fly in constant time with a simple size >= MinDealContentSize check.

The ContentManager now only tracks which zones are consolidating in memory. If this state is lost in a restart, all zones will be considered not consolidating, causing them to be reattempted.


recompute task validation:
Uploaded via dev:
image

Checked out feature branch and started estuary:
image

(First image also had hashes.txt, just cut it off on accident in the screenshot)

Uploaded 3GB more files and removed 1GB on the feature branch to get it over the threshold and it moved into dealmaking successfully with the correct size metadata.


Follow Ups:

  • Hide the recompute size task behind a startup flag so it doesn't run every time. it should only need to run once ever.
  • Readiness doesn't need to be checked on an interval anymore - it's knowable in constant time as soon as content is added. Can be simply added to a queue for the staging bucket worker to process
  • FE should remove its concept of readiness, since we don't send any readiness metadata anymore it is always "No" with a blank reason in the FE after this change

@neelvirdy neelvirdy requested a review from en0ma November 28, 2022 20:44
@neelvirdy neelvirdy marked this pull request as ready for review November 28, 2022 21:09
Copy link
Contributor

@alvin-reyes alvin-reyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we can add a new buckets table which will have an bucket uuid andevery CID is then assigned to one of these bucket which can either be a dedicated bucket (per user) or a global bucket.

The bucket is then processed like a batch job with different trigger parameters:
1 - size of the bucket, meaning if it reaches a threshold (3.5GB), it'll get processed.
2 - time (we can schedule it everyday) - cron job.

@neelvirdy
Copy link
Contributor Author

I was thinking if we can add a new buckets table which will have an bucket uuid andevery CID is then assigned to one of these bucket which can either be a dedicated bucket (per user) or a global bucket.

The bucket is then processed like a batch job with different trigger parameters: 1 - size of the bucket, meaning if it reaches a threshold (3.5GB), it'll get processed. 2 - time (we can schedule it everyday) - cron job.

Agreed we may need a global staging zone concept, and a dedicated zones table may end up being necessary. IMO the latter would also clean up some confusion for new contributors around the contents table. But for now, I want to keep this PR to the minimal change-set required to address the in-memory state issue.

@neelvirdy
Copy link
Contributor Author

Currently writing up a task to recompute staging zone sizes at startup and update the DB if they mismatch what is in the aggregate content's row. This is a migration task and is required to move to tracking size incrementally in DB, since there staging zones will already exist when this gets deployed but their sizes will not have been accounted for in the incremental tracking.

contentmgr/gc.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
contentmgr/pinning.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
@neelvirdy neelvirdy requested a review from en0ma November 30, 2022 15:49
@neelvirdy neelvirdy requested a review from en0ma November 30, 2022 17:01
contentmgr/gc.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
contentmgr/replication.go Outdated Show resolved Hide resolved
pinning.go Outdated Show resolved Hide resolved
@neelvirdy
Copy link
Contributor Author

@en0ma I've validated that I can successfully consolidate and aggregate buckets across an API node and one shuttle. Currently the API node is not selectable as a consolidation destination in the code, so all the consolidations I tested went onto the shuttle. Included some restarts in the tests and tried adding data between consolidation and aggregation and confirmed it re-consolidated then eventually aggregated.

Copy link
Contributor

@en0ma en0ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alvin-reyes alvin-reyes merged commit ed97da8 into application-research:dev Dec 5, 2022
@neelvirdy neelvirdy deleted the nvirdy/start-db-buckets branch December 5, 2022 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants