-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Elastic Security Indicator Match Rule tuning and optimizations #64746
Comments
That is the configurable part that we are wondering if we should expose to users to "fiddle with" but we are also wondering if there are conditions in one or two "knobs" are better than another or how we can tune this efficiently. Maybe even a way to query Elastic Search node information to help auto-tune it? We are kind of looking for any helpful advice on all of this. At very large scale we are expecting every 5 minutes these things happen when our rule/alert runs:
For example, if you have 900k indicator/threat items which are ip/port's/host name's like the above screen shot example from @spong, they would look like this below in the indicator/threat index. These represent bad things/needles in a haystack we are looking for which are rare finds: indicator/threat-index-example {
"@timestamp": "2020-11-07T15:47:55.204Z",
"source": { "ip": "127.0.0.1", "port": 1 },
"host": { "name": "computer-1" }
},
{
"@timestamp": "2020-11-07T15:48:55.204Z",
"source": { "ip": "127.0.0.1", "port": 2 },
"host": { "name": "computer-2" }
} ...etc... up to 100, 1k, 10k, 100k, ~500k for the threats/indicators they are looking for. We are looking to see if we get a match against any of these threat/indicator records using the "OR" against our large volume of source documents. If we set the "knobs" like this (our current default): "concurrent_searches": 1,
"items_per_search": 9000, We will execute exactly 1 search using 9k of the indicator/threat items against the source documents within 5 minutes which could be in the thousands/millions/billions of documents from Kibana to Elastic. If we get 100 matches/signals, we stop as that is our "circuit breaker". In reality we expect a well tuned rule to find 0 or near 0. Once it is done searching or times out, it will return to our indicator/threat list and grab the next 9k indicator/threat items and then continue until it has completed the number of items from the list. If we want to, we can change these "knobs" to something else such as this: "concurrent_searches": 10,
"items_per_search": 100, And now we will execute 10 searches at once using 100 of the indicator/threat items against the source documents within 5 minutes which could be in the thousands/millions/billions of documents from Kibana to Elastic. Each search now has a limit of 100 items, meaning we could find up to 100 matches/signals per each concurrent search but again we expect in reality a well tuned rule will find near 0. Once each concurrent search is done searching or times out, we will return to our indicator/threat list and grab the "next" 100 items, construct another 10 concurrent searches with 100 search items and then continue until it has completed the number of indicator/threat items from the list. To limit things you can be as light weight with the searches and items as this: "concurrent_searches": 1,
"items_per_search": 1, And now you are sending only 1 list item at a time, waiting to see if it gets a positive match or not before sending the next 1 indicator/threat list item. Obviously now you are searching the thousands/millions/billions of records 1 indicator/threat list item at a time from the filter we construct from that single indicator/threat item which would be 9k round trips from Kibana to Elastic Search if you have 9k items in your indicator/threat list. |
Pinging @elastic/es-search (Team:Search) |
@jimczi @giladgal Thanks for taking the time to brainstorm with us and discuss options going forward. To summarize, our efforts here are going to be broken up into two:
We'll sync next once we complete the POC and can provide feedback on if this would be a suitable mid/long term solution for our needs (hopefully mid-way of the 7.12 feature development cycle). Thanks again! 🙂 |
What is the limitation of these indicator rules? I have 6 million indicators in total and some of the indicator rules need to go through 1-2 million indicators. Just some quick math [i hope the math is correct], Now, if it takes 90x[10 concurrent queries] to get through a single page, then its going to take 19800x[10 concurrent queries] to get through all 2 million IOCs - which is about 199800 queries in total. If my rules run every 5mins, Kibana needs to be able to finish these queries by the next run otherwise it will just cascade out of control. |
Yeah I've had Kibana struggle when importing small datasets (30k indicators), albeit with a low powered cluster |
I actually have a decent cluster with 6 data dedicated nodes with 64GB / 8 CPU and 2 x kibana nodes with 16/4. With the way I currently think this works, I feel that’s it actually an inefficient way of doing it. In this solution, we are querying the same set of siem data 100’s or 1000’s of times and filtering it with different items each time. Wouldn’t we be better off getting the IOC’s (with adjustability of how many doc’s you can store in pages) and querying the siem data once and storing that in memory and then working your way through IOC’s against the initial index results? This may not be technically feasible with the way elasticsearch currently works, but there must be a better way of doing this. |
I don't know the internals well enough to comment but there must be a better way to do it. |
@hilt86 @ayedem thank you! Indicator match rules are a relatively new rule type, and we're always looking at ways to iterate and improve upon these features, so your feedback is incredibly helpful. I believe it's been inferred above, but to state explicitly: indicator match rules are currently optimized for large event datasets. Due to this, a situation with a large number of indicators but a relatively small set of events is not going to produce the most optimal query (which I believe was your assertion above, @ayedem?). For now, the biggest performance improvements will be seen by limiting the number of indicators your rule uses. A few examples:
We are looking at ways to optimize your use case as well. To that end, I have a few questions for you both (and for anyone else who happens upon this!):
|
Closing as this issue was used for discussion and there is currently nothing left to do on ES side |
In Elastic Security 7.10 we're introducing a new Detection Engine Rule type called
Indicator/Threat Matching
(elastic/kibana#77395, elastic/kibana#78955), which will allow users to use the results from querying one index (threat index) to filter/query data in a second index (source index). This feature is being released as beta, and can be quite resource intensive, so the hope here is to get a better understanding of what we can do to optimize and tune our current algorithm/search strategy for optimal performance on the Elasticsearch side.I'll try to keep the detections/security language to a minimum, but gist here is that every 5 minutes a search will be performed against the previous 5 minutes to see if a combination of fields from one index (threat index) exists in a second index (source index).
Input
The user configuration is essentially as follows:
Where users have the ability to specify any number of
Source Indices
, aSource Query
, any number ofThreat Indices
, aThreat Query
, and aThreat Mapping
object, specifying the field mapping between the source and threat indices.A sample configuration would look like the following:
Search Strategy
We begin by querying the threat index and storing the results in memory within Kibana as we work our way through the list. Lists are batched into memory in buckets of 9000 documents at a time (a large threat list could be ~400k-600k documents).
Once the list is in memory, we use the above
items_per_search
andconcurrent_searches
settings to chunk the processing. For the above configuration, we'll create 10 queries, each with 10 threat items as filters, and then execute them all at once. Once all requests have returned, we check for results (which for the majority of the time will be 0), and continue searching through the next block of 100 items (10x10), pulling more into memory as needed, until we've searched for all items.Sample query with `items_per_search:10`
Tuning/Optimizations
With the above algorithm, our open questions mostly lie around the usefulness of batching like this, and if smaller batches of filters or one large batch per query would prove optimal (or if it really just depends on the data set, cluster configuration, etc). This feature is intended to be used with CCS.
Also of question is if there is anything we can do to better leverage caching with regards to the time windows we're querying. As it stands, a daterange filter is constructed (below) with
to
/from
being calculated at query-time, so not static between queries (the result of some other upstream logic we'll need to address).Daterange filter:
Hopefully this is enough information to provide an idea of what we're doing here, and please do let me know if I can clarify any aspect of the above.
The text was updated successfully, but these errors were encountered: