Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

Closed
3 of 11 tasks
opauloh opened this issue Mar 28, 2023 · 9 comments
Closed
3 of 11 tasks
Assignees
Labels
8.11 candidate Team:Cloud Security Cloud Security team related technical debt Improvement of the software architecture and operational architecture

Comments

@opauloh
Copy link
Contributor

opauloh commented Mar 28, 2023

Summary

This is a proposal to deprecate the use of Elasticsearch transform in the Cloud security posture plugin. Currently, the transform is used to generate the latest findings for each resource.id + rule id, which is then stored in the logs-cloud_security_posture.findings_latest-* index.

However, the use of transforms adds a layer of complexity to test and maintain it. In addition, we have been facing some issues that the transform doesn't recover itself when upgrading Elastic stack versions.

Our transform has a max_age of 26h, with resource.id and rule.id as unique keys. However, we can achieve the same results using Elasticsearch queries directly in the logs-cloud_security_posture.findings-* index by using an @timestamp filter with aggregation to group the findings by resource.id and rule.id and retrieve the latest finding for each group.

Benefits:

  • Simplify the Cloud security posture plugin codebase and reduce maintenance costs
  • Reduce the potential for issues when upgrading Elastic stack versions
  • Reduce the number of moving parts in the system, which could lead to increased reliability

Approach

There's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

Suggestion: We can use event.code to the unique field

Tasks

@opauloh opauloh added technical debt Improvement of the software architecture and operational architecture Team:Cloud Security Cloud Security team related labels Mar 28, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-cloud-security-posture (Team:Cloud Security)

@CohenIdo
Copy link
Contributor

@opauloh
Copy link
Contributor Author

opauloh commented Apr 1, 2023

Hey @opauloh, we had similar discussion recently, please go over the following issues:

Thanks @CohenIdo. After reading each issue carefully, I identified there's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

@kfirpeled
Copy link
Contributor

@opauloh when working on this I would say we have 3 big components to examine besides the happy flow:

  1. Grouping by resource
  2. The Dashboard
  3. Score calculation

@CohenIdo , @JordanSh am I missing something in regard of this task?

@eyalkraft
Copy link
Contributor

Very interesting Paulo!
Deprecating the transforms will indeed have a great benefit in the terms of reduced complexity of our solution.
It cloud help us with namespaces for example.

Can't wait to see the results!

@eyalkraft
Copy link
Contributor

eyalkraft commented Aug 31, 2023

Depending on when we ship this, it could solve a problem we have with ILMs on serverless

@kfirpeled
Copy link
Contributor

@opauloh can we also track backporting event.code creation to previous packages?

@opauloh
Copy link
Contributor Author

opauloh commented Sep 8, 2023

I'm closing this ticket since we conducted a POC making use of collapse to query data directly from the data stream index and concluded that a few issues with this approach.

Summary of our learnings:

Collapse API works great for tables, as it can collapse data by an identifier key:

Before collapse:

image

After collapse:

  collapse: {
    field: 'event.code',
    inner_hits: {
      name: 'latest_result_evaluation',
      size: 1,
      sort: [{ '@timestamp': 'desc' }],
    },
  },
image

However, two problems was found:

Issue 1: Limit of aggregated data for dashboards and grouped table:

In order to have our Dashboard show the correct information we need to perform an aggregation on the identifier key, and then a sub aggregation on the top_hits of the latest event:

  unique_event_code: {
    terms: {
      field: 'event.code',
      size: 65000,
    },
    aggs: {
      latest_result_evaluation: {
        top_hits: {
          _source: ['result.evaluation'],
          size: 1,
          sort: [{ '@timestamp': 'desc' }],
        },
      },
    },
  },

That query with a time range filter of now - 26 hours, however here we hit the limit of 65k records for the first aggregation for event_code, and since we also need to look for the latest hits in order to the dashboard calculate the correct number of failed findings, we are limited by 65k findings in total (counting the duplicated records).

This means that when attempting to insert 70k findings records (with 51k unique findings), the ungrouped table worked as expected using collapse:

image

But the dashboard and grouped by resource table didn't work:

image

Throwing the following error on the logs:

too_many_buckets_exception: Trying to create too many buckets. Must be less than or equal to: [65536] but this number of buckets was exceeded. This limit can be set by changing the [search.max_buckets] cluster level setting.

Issue 2: filtering by result.evaluation would not guarantee showing the most up-to-date data:

If there are multiple findings that were remediated, or past from a passed state to failed state, adding a filter would potentially show deprecated data

Example: The most up-to-date finding for this unique key is failed:

image

But when filtering for result.evaluation: passed, since the query is now filtering out the failed findings, it would incorrectly show an old finding record with the passed finding:

image

Conclusion

These 2 issues bring up a big showstopper to move forward with this approach using collapse. Even if we can think of a solution for problem number 2, using telemetry data we already know in advance that the limit of 65k findings in a time range of 26 hours won't work for some users while preventing us from future enhancements as adding more Grouped by visualizations to the findings.

The final conclusion is that we don't currently have a way of querying directly from the data stream index with the current model without hitting memory limits for a large data set, this means that the use of transforms is currently the best approach.

The code used during the attempt is on this PR which is now closed.

@kfirpeled
Copy link
Contributor

Thank you @opauloh for taking the time to summerize your conclusions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.11 candidate Team:Cloud Security Cloud Security team related technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants