Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journal compaction overhead with KeyChanges #306

Closed
martinsumner opened this issue Mar 6, 2020 · 7 comments
Closed

Journal compaction overhead with KeyChanges #306

martinsumner opened this issue Mar 6, 2020 · 7 comments

Comments

@martinsumner
Copy link
Owner

There is technical debt in leveled when using the retain recovery strategy. This permanently retains a history of an object in the Journal to help with tracking key changes (i.e. index updates).

When used in Riak this can be cleared up through handoffs (e.g. join/leave operations).

This garbage has more side effects than originally anticipated. Especially when using spinning HDDs and regular journal compaction - it creates a background load of random read activity.

This needs to be improved.

@martinsumner
Copy link
Owner Author

There is some initial mitigation to this issue.

#307

If a Journal evolves with large numbers of KeyChanges objects, then scoring is expensive and this is in part as the list of file positions to score for each file is unordered and randomly distributed. This means that the average jump is beyond any read-ahead, and there's a 50% chance the next jump will be behind not ahead.

To improve this, rather than fetching 100 positions 100 * SomeAdjustmentFactor positions could be selected, then sorted, and then an adjacent (within the sample) subset of 100 positions be chosen. This would ensure that the next position is always ahead, and is smaller in ratio to the SomeAdjustmentFactor. The process of fetching positions takes more CPU, but there exists the potential for the process of fetching keys for scoring to require less disk head movement and be more friendly to OS cache-filling behaviour.

@martinsumner
Copy link
Owner Author

There is also a PR to provide some further building blocks for improvement:

#308

This PR changes the penciller request to fetch the sequence number so that it returns current/replaced/missing rather than true/false - so that in the future alternative action may be possible based on replaced/mssing.

There is also a tech debt related to rebuilding a ledger from a journal. In this case there is a cost per file in the Journal related to never updating the MinSQN - so that the process has to loop aggressively at the start of each file before the first SQN occurs in a batch. This is corrected here.

The tests have also been expanded in this PR to ensure that there is better coverage of the combinations of updates, deletes, compaction and rebuilds.

@martinsumner
Copy link
Owner Author

martinsumner commented Mar 10, 2020

In order to properly resolve this issue, there is potentially a need for another compaction type.

It is safe to compact away KeyChanges objects for a key, once the key has been removed. However it is only safe if all the history for a give key is removed in the same compaction event.

If we have partial removal, the problem would be that another part may not be removed (due to not being in a compactable file, due to a new object being created in the future with the same key). Then on a rebuild event (i.e. due to a corrupted ledger being wiped) there would be rogue 2i entries.

There are three ways forward here:

  • Address the scenario created by partial removal, and partially remove during standard compaction. Addressing the problem would mean some form of AAE between secondary index terms and object state.

  • Have an alternative compaction event which is a complete rebuild of the Journal based on removal of object history from the journal where the object is missing at a given snapshot. In this alternative compaction the process must first snapshot, then score based on the potential to clean history, then compact the full journal then switch to the new journal.

  • Revisit the recalc model of journal compaction. Originally there were intended to be 3 mechanisms for Journal compaction - recovr, retain and recalc. The recalc mechanism would actually avoid this problem. The recalc mechanism had been ignored as it requires the injection of logic from the domain into the store (i.e. how to create diff'd IndexSpecs). However, this may be better than existing bodges. It would also resolve the issue for replaced objects as well as missing objects.

@martinsumner
Copy link
Owner Author

martinsumner commented Mar 10, 2020

Currently, of the three options, the alternative compaction seems easier to implement. The scenario where this has an overhead of the system tends to take months to build up, so it shouldn't be necessary for this compaction to run frequently. It should be enough for there to be an additional config parameter like:

leveled.journal_missingcompact_perc which should be an integer 0..100. This would determine the proportion of compaction events what would run the special "compact for missing objects" compactions - and could default to 0. If an operator has this problem, this percentage can be increased (and in most cases 1 would be a high enough value).

This would be a more expensive compaction (although scoring would be the same cost), but run at a very low frequency, with natural protection against concurrent compaction events, it is expected the overhead should not be overly expensive.

@martinsumner
Copy link
Owner Author

The recalc option is potentially the most complete solution.

Some issues:

  • It is possible to migrate forwards (it appears to be theoretically possible to switch from running a Journal in retain mode to recalc mode - as you simply have to ignore additional information in recalc mode), but not back (without relying on external anti-entropy).

  • The load time on startup, and on rebuild of the ledger will be slower - as a ledger read would be required before each write.

@martinsumner
Copy link
Owner Author

When moving away from using the recalc mechanism, at the time this was in-part as a desire for leveled to remain independent on Riak.

Since then, the need to merge Riak logic into the leveled database was deemed to be unavoidable, and so the leveled_head module was introduced - which allowed for riak logic to be merged, but also potentially other logic to be user-defined for other applications.

Implementing recalc may well be considered to be a natural progression from this change - it would simply require another function (for diffing index specs) in the leveled_head module.

@martinsumner
Copy link
Owner Author

#310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant