Heuristics and their explanation

Diaphora uses, as of 3th January 2018, a total of 51 heuristics. Find below the explanation of each one in the specific order they are executed (as the order really matters).

Default set of heuristics

Equal database

Not strictly a heuristic, but used nevertheless. If all functions' attributes are equal, both databases are considered equal.

Same RVA and hash

The Relative Virtual Address and the function's bytes hash (a MD5 hash of all the bytes of the function) is the same in both databases. This is considered a very good match.

Match quality: Very good. Performance: Very good.

Same order and hash

The order of the function in both databases and the function's bytes hash (a MD5 hash of all the bytes of the function) is the same.

Match quality: Very good. Performance: Very good.

Function hash

The function's hash (a MD5 hash of the non relative bytes) is the same in both databases.

Match quality: Very good. Performance: Very good.

Bytes hash and names

The function's bytes hash (a MD5 hash of all the bytes of the function) and the import functions, global variable names, etc... are the same in both databases.

Match quality: Very good. Performance: Very good.

Bytes hash

The function's bytes hash (a MD5 hash of all the bytes of the function) is the same in both databases.

Match quality: Very good. Performance: Very good.

Same name

The function names in both databases are the same. This is considered a perfect match.

Match quality: Good. Performance: Very good.

Equal assembly or pseudo-code

The assembly and/or pseudo-code is the same (for non trivial functions) in both databases.

Match quality: Very good. Performance: Very good.

Same cleaned up assembly or pseudo-code

The cleaned-up version of the assembly and/or the pseudo-code is the same in both databases. It will ignore sub_XXX and other similar IDA's autogenerated names.

Match quality: Very good. Performance: Very good.

Same address, nodes, edges and mnemonics

The Relative Virtual Address, the number of nodes, edges and the instructions' mnemonics in the exact same order are the same in both databases.

Match quality: Very good. Performance: Very good.

Same rare MD Index

The MD-Index of the functions is the same and is rare (1 or 2 appearances) in both databases.

Match quality: Good. Performance: Very good.

Same MD Index and constants

The MD-Index and the constants used in the function are the same.

Match quality: Good. Performance: Very good.

All or most attributes

All or most function attributes (graph and non graph related ones) are the same.

Match quality: Good. Performance: Good.

Same constants

The constants used by both functions and their order is the same in both databases.

Match quality: Good. Performance: Good.

Same address, nodes, edges and primes (re-ordered instructions)

The number of instructions, the RVA, the number of nodes, edges and the primes product generated by multiplying primes assigned to each different instruction of the whole architecture's instruction set are the same.

Match quality: Good. Performance: Good.

Import names hash

The constants, global variables, their order, as well as the MD-Index is the same in both databases.

Match quality: Good. Performance: Good.

Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree

The number of nodes, edges, the cyclomatic complexity (naturally), the mnemonics (in the actual order they appear), the names (constants and global variable names in the actual order they appear), the function's prototype and the in/out degrees are the same.

Match quality: "Good". Performance: Good.

Mnemonics and names

The functions mnemonics, constants and global variable names, all of them in the exact same order they appear, are the same in both databases.

Match quality: "Good". Performance: Good.

Mnemonics small-primes-product

The Small-Primes-Product of the functions mnemonics are the same in both databases.

Match quality: Good. Performance: Good.

Small names difference

Match functions by calculating the edit distance of "names" (constants and global variable names) for similar graphs.

Match quality: "Good", when it happens to find anything... Performance: "Good".

Pseudo-code fuzzy hash

The main fuzzy hash generated by DeepToad is the same for both functions.

Match quality: Good. Performance: Good.

Similar pseudo-code and names

The pseudo-code for both functions is similar and the constants and global variable names are the same in both databases.

Match quality: Good. Performance: Good.

Topological sort hash

The topological sort (using the Tarjan's algorithm) is the same for both databases.

Match quality: Good. Performance: Good.

Same high complexity, prototype and names

The cyclomatic complexity is the same and high for both functions, and the prototypes and constants and global variable names the same in both databases.

Match quality: Good. Performance: Good.

Strongly connected components

The Small-Primes-Product for the strongly connected components of the function's flow graph for functions with more than 10 basic blocks is the same.

Match quality: "Good". Performance: Good.

Strongly connected components SPP and names

The Small-Primes-Product for the strongly connected components of the function's flow graph for functions with more than 10 basic blocks is the same and the constants and global variable names are the same.

Match quality: "Good". Performance: Good.

Callgraph matches

This heuristic finds, recursively, the callers and callees of the previously discovered "Best" and "Partial" matches.

Match quality: Good. Performance: Slow.

Slow heuristics

Switch structures

The switch instructions in both functions share the same number of cases and cases values.

Match quality: Good. Performance: Slow.

Pseudo-code fuzzy hashes

At least one of the 3 fuzzy hashes calculated by DeepToad are the same.

Match quality: "Good". Performance: Slow.

Similar pseudo-code

The pseudo-code for both functions is similar.

Match quality: "Good". Performance: Slow.

Pseudo-code fuzzy AST hash

The pseudo-code's fuzzy AST (Abstract Syntax Tree) hash is the same for both databases.

Match quality: Good. Performance: Slow.

Partial pseudo-code fuzzy hash

The first 16 bytes of any of the DeepToad generated fuzzy hashes is the same.

Match quality: "Good". Performance: Slow.

Same high complexity and names

The cyclomatic complexity is the same and high for both functions, and the constants and global variable names the same in both databases.

Match quality: "Good". Performance: Slow.

Strongly connected components

The Small-Primes-Product for the strongly connected components of the function's flow graph is the same.

Match quality: "Good". Performance: Slow.

Loop count

The number of loops for functions with at least 4 basic blocks and more than one single basic blocks is the same.

Match quality: Poor. Performance: Slow.

Experimental heuristics

NOTE: Most of these heuristics were written somewhere around 2015. They made sense at that time, on my mind. They will likely be removed in the future.

Same nodes, edges and strongly connected components

The number of nodes, edges and the strongly connected components is the same for both functions.

Match quality: Poor. Performance: Good.

Similar small pseudo-code

The pseudo-code for both functions is the same, but it matches even too small functions.

Match quality: Poor. Performance: Slow.

Small pseudo-code fuzzy AST hash

Same as the "Pseudo-code Fuzzy AST" hash but applied to functions of any size.

Match quality: Poor. Performance: Slow.

Equal small pseudo-code

The non cleaned-up pseudo-code for small functions is the same.

Match quality: Very poor. Performance: Slow.

Same low complexity, prototype and names

The low cyclomatic complexity is the same, as well as the function's prototype and the constants and global variable names.

Match quality: Poor. Performance: Good.

Same low complexity and names

The low cyclomatic complexity is the same, as well as the constants and global variable names.

Match quality: Very poor. Performance: Good.

Same graph

The value for the number of nodes, edges, in-degree, out-degree, cyclomatic complexity, strongly connected components, loops, Tarjan's algorithm calculated topologic sort, strongly connected components Small-Primes-Product for functions with more than 5 basic blocks, is the same.

Match quality: Good. Performance: Very slow.

NOTES: For large databases (>25k functions) it might fail, for a reason, with the following error: OperationalError: database or disk is full.

Brute-forcing

Find good matches (ratio > 5.0) for every non matched function in both databases.

Match quality: Good. Performance: Slow.

Unreliable heuristics

Bytes sum

The summatory of the function's bytes is the same in both databases.

Match quality: Unrealible. Performance: Good.

Strongly connected components

The strongly connected components of both functions are the same.

Match quality: Unrealible. Performance: Slow.

Loop count

The number of loops for any function with at least one is the same.

Match quality: Unreliable. Performance: Slow.

Strongly connected components SPP and names

Same as "Strongly connected components SPP and names" but applied to functions of any size.

Match quality: Unreliable. Performance: Good.

Heuristics and their explanation

Default set of heuristics

Equal database

Same RVA and hash

Same order and hash

Function hash

Bytes hash and names

Bytes hash

Same name

Equal assembly or pseudo-code

Same cleaned up assembly or pseudo-code

Same address, nodes, edges and mnemonics

Same rare MD Index

Same MD Index and constants

All or most attributes

Same constants

Same address, nodes, edges and primes (re-ordered instructions)

Import names hash

Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree

Mnemonics and names

Mnemonics small-primes-product

Small names difference

Pseudo-code fuzzy hash

Similar pseudo-code and names

Topological sort hash

Same high complexity, prototype and names

Strongly connected components

Strongly connected components SPP and names

Callgraph matches

Slow heuristics

Switch structures

Pseudo-code fuzzy hashes

Similar pseudo-code

Pseudo-code fuzzy AST hash

Partial pseudo-code fuzzy hash

Same high complexity and names

Strongly connected components

Loop count

Experimental heuristics

Same nodes, edges and strongly connected components

Similar small pseudo-code

Small pseudo-code fuzzy AST hash

Equal small pseudo-code

Same low complexity, prototype and names

Same low complexity and names

Same graph

Brute-forcing

Unreliable heuristics

Bytes sum

Strongly connected components

Loop count

Strongly connected components SPP and names

Clone this wiki locally