-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of Google Audit reports as implemented presents a risk of missing logs. #17
Comments
Hey there, Thank you for raising this issue, and for wanting to help make Grove better! Reading the the linked Google documentation, it's a little unclear whether lagged events will occur within a single event type, or between event types. As an example, as "calendar" events are slower than "admin" logs, a single pointer used across both types logs would result in missed events. However, it's unclear to me whether events may be back-filled within the same log event type - where an "admin" event from 3 hours ago is injected into the existing "admin" log event stream 3-hours late, while other "admin" events have been arriving on schedule. That said, the use of "connector instances" should assist in reducing the likelihood of events being missed under the "between event types" case. For delayed events within the same event type, I'll do a retrospect over the last few months to try and quantify how much of an issue this may be today. Although this would only apply to our usage patterns, it should at least provide us a window into this. Additionally, thank you for the reference to the channel based approach! Today Grove is designed to run in a 'batch' mode of operation, rather than a long lived service which accepts external notifications, this may be a challenge to implement :) Connector Instances.A sample of our configuration for collection of Google related audit events looks something like the following - with the most relevant part being the As each configuration results in a connector instance, a given Admin Events
Login Events
Wider Issues.This said, you've touched on an important point may apply to many connectors across different vendors: We usually have little visibility into how vendors make their log data available. As a result of this, if a vendor's architecture results in logs being aggregated at different rates but do not provide an additional filter, which can be used to specify a Grove However, this also extends to situations where a vendor delivers a single set of events late within an otherwise punctual stream of the same type - such that events from a given geographic region for "login" events take N minutes, where "login" events from another region only takes N seconds. Future Improvements.In order to assist with these edge cases, one feature we'd like to investigate adding to Grove in future is periodic reconciliation of events. This would allow a connector author to define a period of time over which events would be recollected and de-duplicated, with only previously unseen events being recorded. However, this comes at the cost of increased cache sizes, and a greater number of API calls - which may pose a challenge with high volume sources which have aggressive rate-limits. There is also a challenge of selecting an appropriate period for reconciliation, as not all vendors will make this information available nor will it always be reliable; some events may still be missed in exceptional cases. -- Please let me know if I've missed the mark with your originally posted issue, or you'd like to discuss further, and |
Hi @hcpadkins thanks for the detailed response. To better understand this issue it may be valuable to re-collect Google Workspace logs and do a comparison of sample sizes of "live" vs "historic" collections over a period of time. For example, I played around with Grove when raising this issue and it failed to collect 18.3% of the Google Workspace logs over a quiet weekend, this problem is worse during a working week. Please forgive the graph without much context but this represents the maximum time lag per service over the past 30 days we have observed.
This is the case if two admin events occur on different Google Shards they will arrive out of order and be backfilled with no indication that this has occurred as the event timestamp is the time the event occoured , not when it was eventually recorded.
A single pointer used between event types would result in a loss of data on services with a high replication time , Google , Login , Drive & Calendar are the most susceptible for this. |
Hi @gavinelder , This is great information, thank you for providing this, and for such a detailed issue! Let me have a dig through our dataset to compare the findings with what we're seeing as well. In the mean time, I'll open a task to investigate genericising and implementing a retrospection mechanism to catch missed events due to this sort of behaviour. Although Google push notification driven collection would resolve this for the Google connector, it is likely that this issue may occur with other products which Grove implements connectors for. As a result, it'd be good to understand whether we can do this in a way which can be applied to other connectors as well. Thank you again for both raising this issue, and your time taken to report this with such detail. |
In-line with feedback in hashicorp-forge#17 Github also makes some logs available before others. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. However, for now, this enables the operator to modify delay based on their appeteite for potentially missed logs - due to no lag timeframes being provided by Github, and log entries being back-filled based on this unknown interval.
In-line with feedback in hashicorp-forge#17, Google activities have a published 'lag' time before which logs cannot be considered consistent. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter.
* Add configurable collection delay for Github. In-line with feedback in #17 Github also makes some logs available before others. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. However, for now, this enables the operator to modify delay based on their appeteite for potentially missed logs - due to no lag timeframes being provided by Github, and log entries being back-filled based on this unknown interval. * Add configurable collection delay for GSuite. In-line with feedback in #17, Google activities have a published 'lag' time before which logs cannot be considered consistent. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. * Add delay to GSuite activities configuration example. * Github query does not require 'AND' * Keep mypy happy. Non-optional fields are explicitly marked as 'type: ignore' as although they're optional due to hydration from environment variables without defaults, they're enforced by validation and later tests that they're not None.
Hey there, Thank you again for raising this issue! We've previously integrated a set of changes which allows specification of a The downside to this approach is that it means that log entries which are available quickly are also lagged, despite being available for collection. This also still has an opportunity to miss logs, although much narrower, if they are delayed significantly longer than the measured and configured As this is an issue which has been found to affect two vendors so far, we have created a feature request to implement a general mechanism to better account for this (#25). This has been proposed over use of vendor notifications of when log data is available for two reasons:
I'll close this issue out for now as an interim workaround has been added, and as we have a feature-request item created for a more robust long-term solution. Thank you again for this detailed report, and for raising the issue to us, it's greatly appreciated! |
Looking at the Grove codebase the current polling for Google Workspace audit logs using the reports API presents a risk of audit logs being dropped or missed.
Currently Google offers no promises that events are returned sequentially and using the last timestamp of an event as a cursor will result in log events which occurred before the cursor but were made available at a later stage as per Data retention and lag times would not be collected
The likelihood of missing logs is based on a number of factors such as how geographically dispersed your workforce is , the frequency of events and your polling rate.
For an alternative approach please see https://github.com/ryandeivert/terraform-aws-gsuite-reports-channeler
The text was updated successfully, but these errors were encountered: