-
-
Notifications
You must be signed in to change notification settings - Fork 373
Heuristics and their explanation
Diaphora uses, as of 3th January 2018, a total of 51 heuristics. Find below the explanation of each one in the specific order they are executed (as the order really matters).
Not strictly a heuristic, but used nevertheless. If all functions' attributes are equal, both databases are considered equal.
The Relative Virtual Address and the function's bytes hash (a MD5 hash of all the bytes of the function) is the same in both databases. This is considered a very good match.
Match quality: Very good. Performance: Very good.
The order of the function in both databases and the function's bytes hash (a MD5 hash of all the bytes of the function) is the same.
Match quality: Very good. Performance: Very good.
The function's hash (a MD5 hash of the non relative bytes) is the same in both databases.
Match quality: Very good. Performance: Very good.
The function's bytes hash (a MD5 hash of all the bytes of the function) and the import functions, global variable names, etc... are the same in both databases.
Match quality: Very good. Performance: Very good.
The function's bytes hash (a MD5 hash of all the bytes of the function) is the same in both databases.
Match quality: Very good. Performance: Very good.
The function names in both databases are the same. This is considered a perfect match.
Match quality: Good. Performance: Very good.
The assembly and/or pseudo-code is the same (for non trivial functions) in both databases.
Match quality: Very good. Performance: Very good.
The cleaned-up version of the assembly and/or the pseudo-code is the same in both databases. It will ignore sub_XXX and other similar IDA's autogenerated names.
Match quality: Very good. Performance: Very good.
The Relative Virtual Address, the number of nodes, edges and the instructions' mnemonics in the exact same order are the same in both databases.
Match quality: Very good. Performance: Very good.
The MD-Index of the functions is the same and is rare (1 or 2 appearances) in both databases.
Match quality: Good. Performance: Very good.
The MD-Index and the constants used in the function are the same.
Match quality: Good. Performance: Very good.
All or most function attributes (graph and non graph related ones) are the same.
Match quality: Good. Performance: Good.
The constants used by both functions and their order is the same in both databases.
Match quality: Good. Performance: Good.
The number of instructions, the RVA, the number of nodes, edges and the primes product generated by multiplying primes assigned to each different instruction of the whole architecture's instruction set are the same.
Match quality: Good. Performance: Good.
The constants, global variables, their order, as well as the MD-Index is the same in both databases.
Match quality: Good. Performance: Good.
The number of nodes, edges, the cyclomatic complexity (naturally), the mnemonics (in the actual order they appear), the names (constants and global variable names in the actual order they appear), the function's prototype and the in/out degrees are the same.
Match quality: "Good". Performance: Good.
The functions mnemonics, constants and global variable names, all of them in the exact same order they appear, are the same in both databases.
Match quality: "Good". Performance: Good.
The Small-Primes-Product of the functions mnemonics are the same in both databases.
Match quality: Good. Performance: Good.
Match functions by calculating the edit distance of "names" (constants and global variable names) for similar graphs.
Match quality: "Good", when it happens to find anything... Performance: "Good".
The main fuzzy hash generated by DeepToad is the same for both functions.
Match quality: Good. Performance: Good.
The pseudo-code for both functions is similar and the constants and global variable names are the same in both databases.
Match quality: Good. Performance: Good.
The topological sort (using the Tarjan's algorithm) is the same for both databases.
Match quality: Good. Performance: Good.
The cyclomatic complexity is the same and high for both functions, and the prototypes and constants and global variable names the same in both databases.
Match quality: Good. Performance: Good.
The Small-Primes-Product for the strongly connected components of the function's flow graph for functions with more than 10 basic blocks is the same.
Match quality: "Good". Performance: Good.
The Small-Primes-Product for the strongly connected components of the function's flow graph for functions with more than 10 basic blocks is the same and the constants and global variable names are the same.
Match quality: "Good". Performance: Good.
This heuristic finds, recursively, the callers and callees of the previously discovered "Best" and "Partial" matches.
Match quality: Good. Performance: Slow.
The switch instructions in both functions share the same number of cases and cases values.
Match quality: Good. Performance: Slow.
At least one of the 3 fuzzy hashes calculated by DeepToad are the same.
Match quality: "Good". Performance: Slow.
The pseudo-code for both functions is similar.
Match quality: "Good". Performance: Slow.
The pseudo-code's fuzzy AST (Abstract Syntax Tree) hash is the same for both databases.
Match quality: Good. Performance: Slow.
The first 16 bytes of any of the DeepToad generated fuzzy hashes is the same.
Match quality: "Good". Performance: Slow.
The cyclomatic complexity is the same and high for both functions, and the constants and global variable names the same in both databases.
Match quality: "Good". Performance: Slow.
The Small-Primes-Product for the strongly connected components of the function's flow graph is the same.
Match quality: "Good". Performance: Slow.
The number of loops for functions with at least 4 basic blocks and more than one single basic blocks is the same.
Match quality: Poor. Performance: Slow.
NOTE: Most of these heuristics were written somewhere around 2015. They made sense at that time, on my mind. They will likely be removed in the future.
The number of nodes, edges and the strongly connected components is the same for both functions.
Match quality: Poor. Performance: Good.
The pseudo-code for both functions is the same, but it matches even too small functions.
Match quality: Poor. Performance: Slow.
Same as the "Pseudo-code Fuzzy AST" hash but applied to functions of any size.
Match quality: Poor. Performance: Slow.
The non cleaned-up pseudo-code for small functions is the same.
Match quality: Very poor. Performance: Slow.
The low cyclomatic complexity is the same, as well as the function's prototype and the constants and global variable names.
Match quality: Poor. Performance: Good.
The low cyclomatic complexity is the same, as well as the constants and global variable names.
Match quality: Very poor. Performance: Good.
The value for the number of nodes, edges, in-degree, out-degree, cyclomatic complexity, strongly connected components, loops, Tarjan's algorithm calculated topologic sort, strongly connected components Small-Primes-Product for functions with more than 5 basic blocks, is the same.
Match quality: Good. Performance: Very slow.
NOTES: For large databases (>25k functions) it might fail, for a reason, with the following error: OperationalError: database or disk is full
.
Find good matches (ratio > 5.0) for every non matched function in both databases.
Match quality: Good. Performance: Slow.
The summatory of the function's bytes is the same in both databases.
Match quality: Unrealible. Performance: Good.
The strongly connected components of both functions are the same.
Match quality: Unrealible. Performance: Slow.
The number of loops for any function with at least one is the same.
Match quality: Unreliable. Performance: Slow.
Same as "Strongly connected components SPP and names" but applied to functions of any size.
Match quality: Unreliable. Performance: Good.