[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

opauloh · 2023-03-28T14:32:18Z

Summary

This is a proposal to deprecate the use of Elasticsearch transform in the Cloud security posture plugin. Currently, the transform is used to generate the latest findings for each resource.id + rule id, which is then stored in the logs-cloud_security_posture.findings_latest-* index.

However, the use of transforms adds a layer of complexity to test and maintain it. In addition, we have been facing some issues that the transform doesn't recover itself when upgrading Elastic stack versions.

Our transform has a max_age of 26h, with resource.id and rule.id as unique keys. However, we can achieve the same results using Elasticsearch queries directly in the logs-cloud_security_posture.findings-* index by using an @timestamp filter with aggregation to group the findings by resource.id and rule.id and retrieve the latest finding for each group.

Benefits:

Simplify the Cloud security posture plugin codebase and reduce maintenance costs
Reduce the potential for issues when upgrading Elastic stack versions
Reduce the number of moving parts in the system, which could lead to increased reliability

Approach

There's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

Suggestion: We can use event.code to the unique field

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-03-28T14:32:36Z

Pinging @elastic/kibana-cloud-security-posture (Team:Cloud Security)

CohenIdo · 2023-03-28T15:45:16Z

Hey @opauloh, we had similar discussion recently, please go over the following issues:

opauloh · 2023-04-01T09:33:45Z

Hey @opauloh, we had similar discussion recently, please go over the following issues:

[Proposal] Replace Transform by indexing to the latest findings index security-team#5301

[Deprecate Transfrom POC] index latest findings using ingest-pipelines security-team#5495

[Deprecate Transfrom POC] Querying findings index security-team#5496

Thanks @CohenIdo. After reading each issue carefully, I identified there's one solution that wasn't explored yet: Using one hash field for rule.id + resource.id combined with the use of the collapse query option in the Elasticsearch queries. The benefit of using collapse is that it doesn't affect sorting or aggregations like aggregations do, so we don't have any regression on the experience we provide in the dashboard or the findings table.

Back in the AWP team, we have used collapse for the session viewer plugin, as it can be seen here to aggregate the Linux events by unique sessions, and it was a performant solution, working properly even with the index having millions of records.

The only requirement is that collapse works as desired with a single field only, so that's why we would need a new field for making the group (in this case a hash of rule.id + resource.id)

kfirpeled · 2023-04-19T23:09:00Z

@opauloh when working on this I would say we have 3 big components to examine besides the happy flow:

Grouping by resource
The Dashboard
Score calculation

@CohenIdo , @JordanSh am I missing something in regard of this task?

eyalkraft · 2023-07-17T11:51:47Z

Very interesting Paulo!
Deprecating the transforms will indeed have a great benefit in the terms of reduced complexity of our solution.
It cloud help us with namespaces for example.

Can't wait to see the results!

eyalkraft · 2023-08-31T07:28:01Z

Depending on when we ship this, it could solve a problem we have with ILMs on serverless

https://github.com/elastic/security-team/issues/7441
@CohenIdo

kfirpeled · 2023-09-05T15:32:21Z

@opauloh can we also track backporting event.code creation to previous packages?

opauloh · 2023-09-08T18:35:48Z

I'm closing this ticket since we conducted a POC making use of collapse to query data directly from the data stream index and concluded that a few issues with this approach.

Summary of our learnings:

Collapse API works great for tables, as it can collapse data by an identifier key:

Before collapse:

After collapse:

  collapse: {
    field: 'event.code',
    inner_hits: {
      name: 'latest_result_evaluation',
      size: 1,
      sort: [{ '@timestamp': 'desc' }],
    },
  },

However, two problems was found:

Issue 1: Limit of aggregated data for dashboards and grouped table:

In order to have our Dashboard show the correct information we need to perform an aggregation on the identifier key, and then a sub aggregation on the top_hits of the latest event:

  unique_event_code: {
    terms: {
      field: 'event.code',
      size: 65000,
    },
    aggs: {
      latest_result_evaluation: {
        top_hits: {
          _source: ['result.evaluation'],
          size: 1,
          sort: [{ '@timestamp': 'desc' }],
        },
      },
    },
  },

That query with a time range filter of now - 26 hours, however here we hit the limit of 65k records for the first aggregation for event_code, and since we also need to look for the latest hits in order to the dashboard calculate the correct number of failed findings, we are limited by 65k findings in total (counting the duplicated records).

This means that when attempting to insert 70k findings records (with 51k unique findings), the ungrouped table worked as expected using collapse:

But the dashboard and grouped by resource table didn't work:

Throwing the following error on the logs:

too_many_buckets_exception: Trying to create too many buckets. Must be less than or equal to: [65536] but this number of buckets was exceeded. This limit can be set by changing the [search.max_buckets] cluster level setting.

Issue 2: filtering by result.evaluation would not guarantee showing the most up-to-date data:

If there are multiple findings that were remediated, or past from a passed state to failed state, adding a filter would potentially show deprecated data

Example: The most up-to-date finding for this unique key is failed:

But when filtering for result.evaluation: passed, since the query is now filtering out the failed findings, it would incorrectly show an old finding record with the passed finding:

Conclusion

These 2 issues bring up a big showstopper to move forward with this approach using collapse. Even if we can think of a solution for problem number 2, using telemetry data we already know in advance that the limit of 65k findings in a time range of 26 hours won't work for some users while preventing us from future enhancements as adding more Grouped by visualizations to the findings.

The final conclusion is that we don't currently have a way of querying directly from the data stream index with the current model without hitting memory limits for a large data set, this means that the use of transforms is currently the best approach.

The code used during the attempt is on this PR which is now closed.

kfirpeled · 2023-09-10T11:45:55Z

Thank you @opauloh for taking the time to summerize your conclusions!

opauloh added technical debt Improvement of the software architecture and operational architecture Team:Cloud Security Cloud Security team related labels Mar 28, 2023

kfirpeled mentioned this issue Apr 25, 2023

Move transforms from the CSP plugin to the CSP integration package #130086

Closed

2 tasks

kfirpeled added the 8.9 candidate label May 2, 2023

kfirpeled assigned opauloh May 2, 2023

kfirpeled added 8.10 candidate and removed 8.9 candidate labels May 22, 2023

CohenIdo mentioned this issue May 28, 2023

[Cloud Posture] Install Transforms using package assets #151860

Open

3 tasks

tehilashn added 8.11 candidate and removed 8.10 candidate labels Jul 17, 2023

kfirpeled linked a pull request Sep 5, 2023 that will close this issue

[Cloud Security] [WIP] Deprecate findings-latest-* transform #165543

Closed

opauloh closed this as completed Sep 8, 2023

opauloh mentioned this issue Sep 8, 2023

[Cloud Security] [WIP] Deprecate findings-latest-* transform #165543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

opauloh commented Mar 28, 2023 •

edited

Loading

Tasks

elasticmachine commented Mar 28, 2023

CohenIdo commented Mar 28, 2023

opauloh commented Apr 1, 2023

kfirpeled commented Apr 19, 2023

eyalkraft commented Jul 17, 2023

eyalkraft commented Aug 31, 2023 •

edited

Loading

kfirpeled commented Sep 5, 2023

opauloh commented Sep 8, 2023

kfirpeled commented Sep 10, 2023

[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

[Cloud Posture] Deprecate Elasticsearch transform in Cloud security posture plugin #153875

Comments

opauloh commented Mar 28, 2023 • edited Loading

Summary

Approach

Tasks

elasticmachine commented Mar 28, 2023

CohenIdo commented Mar 28, 2023

opauloh commented Apr 1, 2023

kfirpeled commented Apr 19, 2023

eyalkraft commented Jul 17, 2023

eyalkraft commented Aug 31, 2023 • edited Loading

kfirpeled commented Sep 5, 2023

opauloh commented Sep 8, 2023

Issue 1: Limit of aggregated data for dashboards and grouped table:

Issue 2: filtering by result.evaluation would not guarantee showing the most up-to-date data:

Conclusion

kfirpeled commented Sep 10, 2023

opauloh commented Mar 28, 2023 •

edited

Loading

eyalkraft commented Aug 31, 2023 •

edited

Loading