-
Notifications
You must be signed in to change notification settings - Fork 5
Internals
libsigscan was developed to provide a highly optimized way of scanning for binary signatures e.g. those used in the headers of file formats. Initially the ideas behind libsigscan were implemented in Python and were used in the dfVFS and Plaso project. Since Python was not the most optimal language for this purpose it was rewritten as a C library with Python bindings.
In libsigscan terminology:
-
pattern; is a binary string containing a "pattern" of bytes unique for the (file) format, e.g. "\x89PNG\x0d\x0a\x1a\x0a" for PNG.
-
signature; is a binary pattern, sometimes with a fixed location, that identifies a specific (file) format;
-
bounded signature; is a signature that has a fixed location either relative from the start or the end of (file) data;
-
unbounded signature; is a signature that has no fixed location and can appear anywhere in the (file) data.
There are various approaches to signature, this paragraph discusses several of them that were involved in the design of libsigscan.
The brute-force signature scanning approach consists of doing a per signature byte-by-byte comparison of the (file) data.
A string search algorithm more efficient than byte-by-byte comparison is that of Boyer-Moore-Horspool search. It uses a skip value to ignore byte values that do not need to be compared. For very small string sizes the Boyer-Moore-Horspool search offers no advantage over a byte-by-byte comparison.
The scanning tree is used to scan for all signatures at once. The idea of the scanning tree is similar to that of a suffix tree (or trie as used in regular expression engines).
Let’s take the file header signatures of Windows NT Registry File (REGF) format, the Extensible Storage Engine (ESE) Database File (EDB) format, the Windows Shortcut File (LNK) format and some imaginary format we name XYZ.
Note
|
For signature scanning we do not necessarily have to restrict ourselves to actual file format signatures. E.g. in this example we include the LNK (file) header size 0x4c (L) as part of the signature. |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
|
regf |
r |
e |
g |
f |
||||||||||||
esedb |
0xef |
0xcd |
0xab |
0x89 |
||||||||||||
lnk |
L |
0x00 |
0x00 |
0x00 |
0x01 |
0x14 |
0x02 |
0x00 |
0x00 |
0x00 |
0x00 |
0x00 |
0xc0 |
0x00 |
0x00 |
0x00 |
XYZ |
f |
0x02 |
0x89 |
0x00 |
Note that XYZ contains gaps in the pattern, this is primarily of the purpose of this example and currently assumed to be occur very rarely in actual (file) formats.
There a multiple aspects of the signatures we could use in scanning:
-
the signature offset;
-
the signature pattern;
-
the likelihood of a byte value in the pattern to appear in the (file) data.
The relevance of a pattern byte value and offset can be determined using the following weights:
-
Similarity weight; a weight to indicate how often a specific pattern byte value at a specific pattern offset is shared across the signatures;
-
Occurrence weight; a weight to indicate how often a specific specific pattern offset is shared across the signatures;
-
Byte value weight; a weight to indicate how "common" a byte value is likely to be, e.g. byte values used in text [A-Za-z0-9 \t\n\r] or byte values like 0x00, 0x01 and 0xff are considered more "common" and thus get weight lower.
Based on the weights we could state that the higher the weight of a specific pattern byte value and offset the more relevant that value is when comparing the (file) data e.g.
Offset |
Value(s) |
Similarity weight |
Occurrence weight |
Byte value weight |
0 |
r, L |
0 |
2 |
0 |
1 |
e, 0x00 |
0 |
2 |
0 |
2 |
g, 0x00 |
0 |
2 |
0 |
3 |
f, 0x00, f |
2 |
2 |
0 |
4 |
0xef, 0x01 |
0 |
2 |
1 |
5 |
0xcf, 0x14 |
0 |
2 |
2 |
6 |
0xab, 0x02, 0x02 |
2 |
2 |
2 |
7 |
0x89, 0x00, 0x89 |
2 |
2 |
1 |
8 |
0x00 |
0 |
0 |
0 |
9 |
0x00 |
0 |
0 |
0 |
10 |
0x00 |
0 |
0 |
0 |
11 |
0x00 |
0 |
0 |
0 |
12 |
0xc0 |
0 |
0 |
1 |
13 |
0x00, 0x00 |
2 |
0 |
0 |
14 |
0x00 |
0 |
0 |
0 |
15 |
0x00 |
0 |
0 |
0 |
Use the similarity weight, occurrence weight, byte value weight and pattern offset, in the order mentioned, to determine the highest weight. The pattern byte value and offset with the highest weights is considered the most relevant. In our example this would be the pattern byte values at offset 6.
We now could make a "decision tree" based on the weights, e.g. for the pattern byte values at offset 6:
ScanTreeNode {
pattern_offset = 6
byte_values = [
0x02: ['lnk', 'XYZ'],
0xab: ['esedb']
]
no_match = ['regf']
}
Where:
-
if the (file) data matches 0x02 we only have to consider lnk and XYZ being present, and can rule out esedb and regf.
-
if the (file) data matches 0xab we only have to consider esedb being present;
-
if non of the pattern bytes value matched we only have to consider regf being present.
After building the first node we can split the list of signatures into matching and non-matching signatures.
If there are more than 2 signatures left in the list we repeat the weighing process to determine the most relevant pattern byte value and offset in our new lists.
Note
|
Signatures that are not covered by nodes in the scan tree so far have to be considered in subsequent byte value checks. |
If there is 1 signature left we create a pattern validation chain in which the longest possible pattern is checked at once.
ScanTreePattern {
pattern = [0xef, 0xcd, 0xab, 0x89]
pattern_offset = 4
next = None
}
Note
|
Signatures that contain patterns with gaps should be "chained" together, hence the term pattern validation chain. |
TODO
In bound signature scanning only the parts of the (file) data where the signatures are expected to be are scanned.