-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tighten rule pre-selection #2080
Conversation
@property | ||
def file_rules(self): | ||
return self.rules_by_scope[Scope.FILE] | ||
|
||
@property | ||
def process_rules(self): | ||
return self.rules_by_scope[Scope.PROCESS] | ||
|
||
@property | ||
def thread_rules(self): | ||
return self.rules_by_scope[Scope.THREAD] | ||
|
||
@property | ||
def call_rules(self): | ||
return self.rules_by_scope[Scope.CALL] | ||
|
||
@property | ||
def function_rules(self): | ||
return self.rules_by_scope[Scope.FUNCTION] | ||
|
||
@property | ||
def basic_block_rules(self): | ||
return self.rules_by_scope[Scope.BASIC_BLOCK] | ||
|
||
@property | ||
def instruction_rules(self): | ||
return self.rules_by_scope[Scope.INSTRUCTION] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for backwards compatibility. during a major version, we can probably remove these with preference to rules_by_scope
.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work, we should do extensive tests comparing the results before and after to ensure everything works as expected. the speedup looks promising!
I plan to run this implementation side by side with the |
CHANGELOG updated or no update needed, thanks! 😄
…to perf-rule-pre-selection
…to perf-rule-pre-selection
string_features = [ | ||
feature | ||
for feature in features | ||
if isinstance(feature, (capa.features.common.Substring, capa.features.common.Regex)) | ||
] | ||
bytes_features = [feature for feature in features if isinstance(feature, capa.features.common.Bytes)] | ||
hashable_features = [ | ||
feature | ||
for feature in features | ||
if not isinstance( | ||
feature, (capa.features.common.Substring, capa.features.common.Regex, capa.features.common.Bytes) | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be optimized? We're looping and calling isinstance
on every feature three times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me try and then run some benchmarks. I agree it looks wasteful, but I'm not sure if it has a real world effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @williballenthin - I'm pumped about the improved efficiency. The logic and code that you've implemented here appears sound. Let's get this merged pending successful paranoid invocation across a wide range of samples
Yes let's rebase on master so we can get this to our users ASAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work! noted a few minor things I've noticed and the paranoid run will provide a lot of value
# We may want to try to pre-evaluate these strings, based on their presence in the file, | ||
# to reduce the number of evaluations we do here. | ||
# See: https://github.com/mandiant/capa/issues/2063#issuecomment-2095639672 | ||
# | ||
# We may also want to specialize case-insensitive strings, which would enable them to | ||
# be indexed, and therefore skip the scanning here, improving performance. | ||
# This strategy is described here: | ||
# https://github.com/mandiant/capa/issues/2063#issuecomment-2107083068 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add TODOs for these notes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, and i'll spin off the original issue comments into dedicated issues we can use to track the idea.
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good improvmenets, I just have question below.
Unrelated to this PR, I think we can replace
Line 459 in 960ee86
b = codecs.decode(s.replace(" ", "").encode("ascii"), "hex") |
with:
b = bytes.fromhex(s)
https://docs.python.org/3/library/stdtypes.html#bytes.fromhex
paranoid linting succeeded!
So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis. |
awesome, big performance improvement! |
new PR that's rebased against master: #2125 |
closes #2074
ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"
Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.
This PR implements the "tighten rule pre-selection" algorithm described here: #2063 (comment) . In summary:
This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around 3x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.
TODO: