Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Retrospection and enhanced deduplication #25

Open
hcpadkins opened this issue Jul 6, 2023 · 0 comments
Open

FR: Retrospection and enhanced deduplication #25

hcpadkins opened this issue Jul 6, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@hcpadkins
Copy link
Contributor

Overview

As highlighted by #17, some services may integrate events into their log streams at unknown and unpredictable intervals. This is not an issue if all log entries are delayed by a consistent amount of time, however, this is a challenge when only a subset of these log entries are integrated into the log stream after this unpredictable delay.

If a single pointer is used to track the status of collected logs, delayed log events may result in dropped or missed data.

In order to mitigate this issue for vendors where this is known to be an issue - such as Google Workspaces, and GitHub - a lag parameter (#20) was previously added into Grove. This parameter delays the collection of all logs by the configured number of minutes, allowing the vendor's log stream the opportunity to become consistent. However, the use of lag results in logs which are available immediately being delayed to account for the slowest log entries. In the case of Google Workspaces, this may be in the realm of hours for login events.

Proposal

In order to ensure that logs are collected when they are available, this feature request is to implement a retrospection feature in Grove.

This feature will allow periodic retrospection of collected logs, and deduplication of collected log events which have already been collected. This allows for collection to be performed aggressively, resulting in logs which are made available immediately being collected as soon as possible, while allowing for "slow" log entries to be collected rather than missed.

Retrospection will be implemented as a generic feature which can be turned on or off for a given connector as required. This is to ensure that future vendors with these constraints can be handled consistently, and without the need for special "once-off" treatment.

Considerations

Deduplication will need to be performed on a per log entry basis. This will increase the amount of data stored in cache, and the volume of read / writes to the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant