Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate optimization of rule matching (May, 2024) #2063

Closed
williballenthin opened this issue May 6, 2024 · 15 comments
Closed

investigate optimization of rule matching (May, 2024) #2063

williballenthin opened this issue May 6, 2024 · 15 comments
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. performance Related to capa's performance

Comments

@williballenthin
Copy link
Collaborator

As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.

In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.

@williballenthin williballenthin added the enhancement New feature or request label May 6, 2024
@williballenthin
Copy link
Collaborator Author

Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.

profile-be2.speedscope.zip

For example, use the sandwich diagram to identify routines that take up a lot of runtime:

image

@williballenthin

This comment was marked as resolved.

@williballenthin

This comment was marked as outdated.

@williballenthin

This comment was marked as outdated.

@mike-hunhoff

This comment was marked as resolved.

@mike-hunhoff
Copy link
Collaborator

@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .

@williballenthin

This comment was marked as resolved.

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 8, 2024

slow mimikatz function

DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s                                                                                            

Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.


Ok, well that is gross:
image

Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.

6964 total features.
852 basic blocks!
4442 instructions.

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 8, 2024

FeatureSet size distributions

From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.

instruction

image

basic block

image

function

image

We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).


(notes for future willi)

image

bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2

@williballenthin

This comment was marked as outdated.

@williballenthin

This comment was marked as outdated.

@fariss

This comment was marked as outdated.

@mr-tz mr-tz added gsoc Work related to Google Summer of Code project. performance Related to capa's performance labels May 22, 2024
@mr-tz mr-tz moved this to In progress in @s-ff GSoC 2024 May 22, 2024
williballenthin added a commit that referenced this issue Jun 6, 2024
Implement the "tighten rule pre-selection" algorithm described here:
#2063 (comment)

In summary:

> Rather than indexing all features from all rules,
> we should pick and index the minimal set (ideally, one) of
> features from each rule that must be present for the rule to match.
> When we have multiple candidates, pick the feature that is
> probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against
mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around
3x more functions per second (wow wow).

When doing large scale runs, capa is about 25% faster when using the
vivisect backend (analysis heavy) or 3x faster when using the
upcoming BinExport2 backend (minimal analysis).
williballenthin added a commit that referenced this issue Jun 7, 2024
Implement the "tighten rule pre-selection" algorithm described here:
#2063 (comment)

In summary:

> Rather than indexing all features from all rules,
> we should pick and index the minimal set (ideally, one) of
> features from each rule that must be present for the rule to match.
> When we have multiple candidates, pick the feature that is
> probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against
mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around
3x more functions per second (wow wow).

When doing large scale runs, capa is about 25% faster when using the
vivisect backend (analysis heavy) or 3x faster when using the
upcoming BinExport2 backend (minimal analysis).
@williballenthin
Copy link
Collaborator Author

candidate enhancements have been broken out into their own issues. closing this thread of investigation.

@github-project-automation github-project-automation bot moved this from In progress to Done in @s-ff GSoC 2024 Jul 23, 2024
@Dextera0007
Copy link

Collection of rules to identify capabilities within a program; has anyone ever experience capa times out? Also is there a some configuration settings where only invoke the rules that match the file features ?

@williballenthin
Copy link
Collaborator Author

@Dextera0007 would you mind creating a new issue so that we can have a separate thread of conversation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. performance Related to capa's performance
Projects
Status: Done
Development

No branches or pull requests

5 participants