Use of Google Audit reports as implemented presents a risk of missing logs. #17

gavinelder · 2023-02-24T23:35:56Z

Looking at the Grove codebase the current polling for Google Workspace audit logs using the reports API presents a risk of audit logs being dropped or missed.

Currently Google offers no promises that events are returned sequentially and using the last timestamp of an event as a cursor will result in log events which occurred before the cursor but were made available at a later stage as per Data retention and lag times would not be collected

The likelihood of missing logs is based on a number of factors such as how geographically dispersed your workforce is , the frequency of events and your polling rate.

For an alternative approach please see https://github.com/ryandeivert/terraform-aws-gsuite-reports-channeler

hcpadkins · 2023-02-27T11:28:14Z

Hey there,

Thank you for raising this issue, and for wanting to help make Grove better!

Reading the the linked Google documentation, it's a little unclear whether lagged events will occur within a single event type, or between event types. As an example, as "calendar" events are slower than "admin" logs, a single pointer used across both types logs would result in missed events.

However, it's unclear to me whether events may be back-filled within the same log event type - where an "admin" event from 3 hours ago is injected into the existing "admin" log event stream 3-hours late, while other "admin" events have been arriving on schedule.

That said, the use of "connector instances" should assist in reducing the likelihood of events being missed under the "between event types" case. For delayed events within the same event type, I'll do a retrospect over the last few months to try and quantify how much of an issue this may be today. Although this would only apply to our usage patterns, it should at least provide us a window into this.

Additionally, thank you for the reference to the channel based approach! Today Grove is designed to run in a 'batch' mode of operation, rather than a long lived service which accepts external notifications, this may be a challenge to implement :)

Connector Instances.

A sample of our configuration for collection of Google related audit events looks something like the following - with the most relevant part being the operation field.

As each configuration results in a connector instance, a given gsuite_activities instance will only collect the log types requested in the operation field, as this field is used as an additional filter in requests to the API. As pointers for tracking the most recently seen events are unique to connector instances rather than connectors, the last seen admin event will be tracked independently of the last seen login event.

Admin Events

{
    "connector": "gsuite_activities",
    "encoding": {
        "key": "base64"
    },
    "identity": "<REMOVED>",
    "name": "<REMOVED>-admin",
    "operation": "admin",
    "secrets": {
        "key": "/path/to/key..."
    }
}

Login Events

{
    "connector": "gsuite_activities",
    "encoding": {
        "key": "base64"
    },
    "identity": "<REMOVED>",
    "name": "<REMOVED>-login",
    "operation": "login",
    "secrets": {
        "key": "/path/to/key..."
    }
}

Wider Issues.

This said, you've touched on an important point may apply to many connectors across different vendors: We usually have little visibility into how vendors make their log data available.

As a result of this, if a vendor's architecture results in logs being aggregated at different rates but do not provide an additional filter, which can be used to specify a Grove operation to separate collections into a set of different instances, then events may be missed. This said, most of the APIs we are working with today do provide this filtering mechanism.

However, this also extends to situations where a vendor delivers a single set of events late within an otherwise punctual stream of the same type - such that events from a given geographic region for "login" events take N minutes, where "login" events from another region only takes N seconds.

Future Improvements.

In order to assist with these edge cases, one feature we'd like to investigate adding to Grove in future is periodic reconciliation of events. This would allow a connector author to define a period of time over which events would be recollected and de-duplicated, with only previously unseen events being recorded.

However, this comes at the cost of increased cache sizes, and a greater number of API calls - which may pose a challenge with high volume sources which have aggressive rate-limits. There is also a challenge of selecting an appropriate period for reconciliation, as not all vendors will make this information available nor will it always be reliable; some events may still be missed in exceptional cases.

--

Please let me know if I've missed the mark with your originally posted issue, or you'd like to discuss further, and
thank you again for raising this issue! :)

gavinelder · 2023-02-27T13:54:30Z

Hi @hcpadkins thanks for the detailed response.

To better understand this issue it may be valuable to re-collect Google Workspace logs and do a comparison of sample sizes of "live" vs "historic" collections over a period of time.

For example, I played around with Grove when raising this issue and it failed to collect 18.3% of the Google Workspace logs over a quiet weekend, this problem is worse during a working week.

Please forgive the graph without much context but this represents the maximum time lag per service over the past 30 days we have observed.

However, it's unclear to me whether events may be back-filled within the same log event type - where an "admin" event from 3 hours ago is injected into the existing "admin" log event stream 3-hours late, while other "admin" events have been arriving on schedule.

This is the case if two admin events occur on different Google Shards they will arrive out of order and be backfilled with no indication that this has occurred as the event timestamp is the time the event occoured , not when it was eventually recorded.

Reading the the linked Google documentation, it's a little unclear whether lagged events will occur within a single event type, or between event types. As an example, as "calendar" events are slower than "admin" logs, a single pointer used across both types logs would result in missed events.

A single pointer used between event types would result in a loss of data on services with a high replication time , Google , Login , Drive & Calendar are the most susceptible for this.

hcpadkins · 2023-02-27T14:38:53Z

Hi @gavinelder ,

This is great information, thank you for providing this, and for such a detailed issue! Let me have a dig through our dataset to compare the findings with what we're seeing as well.

In the mean time, I'll open a task to investigate genericising and implementing a retrospection mechanism to catch missed events due to this sort of behaviour. Although Google push notification driven collection would resolve this for the Google connector, it is likely that this issue may occur with other products which Grove implements connectors for. As a result, it'd be good to understand whether we can do this in a way which can be applied to other connectors as well.

Thank you again for both raising this issue, and your time taken to report this with such detail.

In-line with feedback in hashicorp-forge#17 Github also makes some logs available before others. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. However, for now, this enables the operator to modify delay based on their appeteite for potentially missed logs - due to no lag timeframes being provided by Github, and log entries being back-filled based on this unknown interval.

In-line with feedback in hashicorp-forge#17, Google activities have a published 'lag' time before which logs cannot be considered consistent. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter.

* Add configurable collection delay for Github. In-line with feedback in #17 Github also makes some logs available before others. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. However, for now, this enables the operator to modify delay based on their appeteite for potentially missed logs - due to no lag timeframes being provided by Github, and log entries being back-filled based on this unknown interval. * Add configurable collection delay for GSuite. In-line with feedback in #17, Google activities have a published 'lag' time before which logs cannot be considered consistent. As a result, a delay configuration has been added to allow operations to preference consistency over collection time. In future, more in-depth retrospection and deduplication will be added which should remove the need for this parameter. * Add delay to GSuite activities configuration example. * Github query does not require 'AND' * Keep mypy happy. Non-optional fields are explicitly marked as 'type: ignore' as although they're optional due to hydration from environment variables without defaults, they're enforced by validation and later tests that they're not None.

hcpadkins · 2023-07-06T16:12:45Z

Hey there,

Thank you again for raising this issue! We've previously integrated a set of changes which allows specification of a lag parameter in the connector configuration for affected log sources (#25). This has the effect of delaying the collection of log entries until the log has had an opportunity to become consistent.

The downside to this approach is that it means that log entries which are available quickly are also lagged, despite being available for collection. This also still has an opportunity to miss logs, although much narrower, if they are delayed significantly longer than the measured and configured lag time frames.

As this is an issue which has been found to affect two vendors so far, we have created a feature request to implement a general mechanism to better account for this (#25). This has been proposed over use of vendor notifications of when log data is available for two reasons:

Not all vendors provide a streaming or notification mechanism for log data.
Grove does not expose long lived socket listeners for receiving data from external services today.
Log notification and delivery from certain vendors is treated as best effort with "no guarantees" or metrics provided around the success of the operation.

I'll close this issue out for now as an interim workaround has been added, and as we have a feature-request item created for a more robust long-term solution.

Thank you again for this detailed report, and for raising the issue to us, it's greatly appreciated!

hcpadkins added the enhancement New feature or request label Feb 27, 2023

hcpadkins self-assigned this Feb 27, 2023

hcpadkins mentioned this issue Mar 7, 2023

Add configurable delay for Google and Github. #20

Merged

hcpadkins mentioned this issue Jul 6, 2023

FR: Retrospection and enhanced deduplication #25

Open

hcpadkins closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of Google Audit reports as implemented presents a risk of missing logs. #17

Use of Google Audit reports as implemented presents a risk of missing logs. #17

gavinelder commented Feb 24, 2023

hcpadkins commented Feb 27, 2023 •

edited

Loading

gavinelder commented Feb 27, 2023 •

edited

Loading

hcpadkins commented Feb 27, 2023 •

edited

Loading

hcpadkins commented Jul 6, 2023

Use of Google Audit reports as implemented presents a risk of missing logs. #17

Use of Google Audit reports as implemented presents a risk of missing logs. #17

Comments

gavinelder commented Feb 24, 2023

hcpadkins commented Feb 27, 2023 • edited Loading

Connector Instances.

Wider Issues.

Future Improvements.

gavinelder commented Feb 27, 2023 • edited Loading

hcpadkins commented Feb 27, 2023 • edited Loading

hcpadkins commented Jul 6, 2023

hcpadkins commented Feb 27, 2023 •

edited

Loading

gavinelder commented Feb 27, 2023 •

edited

Loading

hcpadkins commented Feb 27, 2023 •

edited

Loading