An algorithm to calculate malware phylogeny, an example and counterexample

Source code for malware can easily be had by googling, searching on pastebin or running a WordPress honey pot. Much of it is interpreted languages, like PHP or Perl.

I will use "Web Shell by oRb" (a.k.a "WSO", "FilesMan") and the omnipresent IRC Perl bot (Perl/Shellbot) as examples.

"Web Shell by oRb" (also called "WSO", "FilesMan" and "AntiChat shell") is by far the most popular web shell downloaded to hacked CMS web sites today. It's often reformatted, obfuscated or otherwise slightly customized before hiding it in a WordPress installation. Many distinct versions exist, including one under active development.

The omnipresent IRC Perl bot is referred to by antivirus firms as "Perl/Shellbot" or some form of that indicating a derivative. It's a piece of Perl code that signs in to specific IRC servers and channels, and listens for commands from its botherder. A typical variant can be had from Pastebin.

Casual reading of several variants of either WSO or Perl/Shellbot will convince you that the variants are "related". It seems that one variant "evolved" into another via a process of spammers making small changes to suit their needs. Perhaps a selection of the fittest happens when one script kiddy copys the succssful code of another script kiddy. Collection and comparison of variants should allow the construction of a "family tree" of variants. This project is a description and demonstration of one method of constructing the family tree, or phylogeny, of WSO, and less succesfully, of Perl/Shellbot.

Source code malware evolution

Ordinary biological evolution analogies are ill-suited to reconstructing malware phylogenies. Biological phylogenies assume binary splits between two clades, and assume that we know the sister clades, but can't identify the the common ancestor. Malware can split many times from a single, identifiable common ancester. Biological phylogenies assume a single common ancestor. Malware often, maybe almost always, has more than a single ancestor.

Souce code malware seems to evolve by feature addition or subtraction. People modifiying malware add small chunks of code, code that might phone home with URL, user ID and password, or just evaluate some code when executed with a special CGI-BIN parameter. Sometimes entire functions or subsystems are transplanted from another program.

When the "source code" (DNA) of biological families is available, it rarely shows evolution by adding or subtracting entire "features". Biological evolution appears to happen by gradual changes over large extents of a clade's DNA. The closest biological analogy of the way source code malware evolves would seem to be horizontal gene transfer.

Biology-influenced phylogenies

Traditional biology phylogenies are sometimes calculated with a "distance" metric, and tree construction algorithms like Neighbor Joining. This section illustrates such a calculation.

Normalized Compression Distance

I used the NCD algorithm from Common Pitfalls Using The Normalized Compression Distance it seems to give closer-to-zero "distance" of a piece of code to itself. The code is in distance_calc.php

NCD(x,y) = max(C(xy) - C(x), C(yx) - C(y)}/max(C(x), C(y))

NCD(x,y) is the normalized compression distance from string x to string y, max(a, b) is the numerical maximum of numbers a and b, and C(e) is the size in bytes of compressed string e. xy is string x with string y appended. I used PHP's bzcompress() function. File distance_calc.php implements the NCD calculations.

The following phylogeny is for a group of versions of "Web Shell by oRb" (WSO), described futher below, except for lfiscanner.php and fx29, which are secondary ancestors of some versions of WSO. All WSO versions are files in exemplars/, PHP source code, normalized by pretty printing, and removing ASCII carriage return characters. I did an NCD calculation on all files in exemplars. Using the neighbor joining algorithm, I calculated the following phylogeny. You pick an "outgroup" to get a rooted tree from neighbor joining. I chose lfiscanner.php.

Horizontal lengths of branches are proportional to the normalized compression distance separating source files. You can re-generate the above image by invoking make nj_phylogeny.png. The file makefile describes how to rebuild nj_phylogeny.png from scratch.

NCD and neighbor joining get some parts of the phylogeny correct. Although I chose lfiscanner as the out group, it correctly found the fx29 web shell as the next most distantly-related piece of code. It gets the basic split between WSO 2.4 and WSO 2.5 clades, but mixes version 2.8 and 3.1 in the 2.4-based clade - version 2.8 is clearly a minor hack of version 2.5.1, differing only by yet another phone home. Version 3.1 descends from 2.5 by the verbatim incorporation of a local-file-include scanner from lfiscanner.php.

The assumptions inherent in neighbor joining make version 2.4 into an outgroup of a clade. In addition to comprising the immediate ancestor of a number of species, version 2.4 should appear as the ancestor of a linear progression of 2.5 to 2.5.1 to 2.8. The entire issue of multiple ancestry doesn't appear at all. fx29 and lfiscanner.php should show up as auxilliary ancestors to multiple variants that have a WSO variant as a primary ancestor.

All told, this phylogeny is roughly correct, but has too many flaws to be more than a general guide.

Algorithm

Because malware evolves by feature addition and subtraction, and not by changes in genomes, I tried to use the pieces of source code that differ between species, and the pieces of source code that remain common between all members of a clade to derive a phylogeny. I explictly allow for multiple ancestry.

Determine traits

We need to distinguish between feature and trait for the following. I use "feature" to mean some distinguishable action the source code carries out, for example storing application state in an HTTP cookie. I use "trait" for the code fragment chosen to represent this feature: function WSOsetcookie.

Determining which traits to use is important, and not amenable to automation. The algorithm needs at least one trait (line or lines of source code) common between an ancestor and all of its desendants. It will calculate multiple independent trees if this condition is not met.

The algorithm needs each pair of species to differ in at least one feature, and descendants must differ from immediate ancestors by at least one feature.

Description

Data Structure

Species are represented by instances of a data structure that can contain a set of traits, and allow each trait to be marked. Sets of traits in two species must be comparable to determine a subset relationship. The data structure must be able to keep references to immediate descendant species. The data structure must be capable of being marked to indicate that it has at least one ancestor. This project has such a data structure in file phylogeny/Species.php

Details of algorithm

Start with an empty set of root species, and a separate empty list of species.
Order species in a list by decreasing number of traits.
Remove species with largest number of traits from ordered species list. This species becomes the current species.
Compare current species featuries to next species' traits in decreasing order of number of traits.
When a species on the list has traits that are a strict subset of current species' traits list, and contain a trait previously unmarked in the current species, assign current species as child of other species.
Mark all traits of the current species that are common to the new parent.
If more than one trait is unmarked, repeat from (3)
If no more than one trait is unmarked, go to (2).
If you have compared traits of the current species to all species left on the list, and the current species hasn't been added to any species as a child, add the current species to the set of root species. Go to (2)
When the ordered list is empty, stop.
Starting from the set of root species created in step (8), use each root node's children to draw the resulting phylogeny.

I implemented the algorithm in file phylogeny/p2.php in this repository.

Performance

The algorithm above always terminates. Every species gets popped off ordered list of species, and there's only a finite number of species on that ordered list. There's only a finite number of comparisons to other species made for each current species.

Worst case

Consider a phylogenetic tree with one root, many children, and no grandchildren. Only one trait differs between any two child species.
The first of the N species gets compared to N - 1 child species before being assigned to the root species. The second species gets compared to N - 2 species, the third species gets compared to N - 3 species, and so forth. The number of species-to-species comparisons is:

(N-1)+(N-2)+(N-3)+...+(N - (N-2))(N - (N-1))

This is just N things taken two at a time, which is: N!/(2!(N - 2)!) This formula reduces to N(N-1)/2 or (N² - N)/2. Thus the algorithm takes O(N²) in the worst case.

Effect of ill-chosen traits

Two species without at least one trait to differentiate them show up in a line of descent. The order is dependent on the sort algorithm used in step (1) above. Correctly chosing traits could cause them to show up as sister species.

Using a trait that's not in the real ancester for one of two sister species can lead to inversions of descent.

Mistaking a primitive trait as a trait added later can cause false sister clades, when in reality, one species of one of the clades is the ancestor of the other clade.

Example: Web Shell by oRb

Description

WSO is a "web shell", a single PHP file, invoked by a web browser, that can perform system administration tasks. It can examine system configuration, manipulate files, upload or download files. It also has "hacker" oriented features, like MySQL and Postgres database manipulation, remote code execution, password guessing and string format manipulation.

One of the interesting things about WSO code is that the original author chose to include a version string, "2.0" or "2.5" or "3.1". This gives a natural trait to do initial categorization of files into versions. The code versions 2.0 through 2.5 looks like a diligent developer wrote it. Each version has bugs fixed and features added over the version before. The versions other than 2.0 through 2.5 are quite different. They don't include bug fixes, but rather include whole chunks of code. These chunks range from quietly emailing home whenever that version is run, to "mass defacing" or "local file include" functions. No bugs get fixed, but the version numbering miraculously stays independent, in that there aren't two clades of WSO both version numbered 2.8, for example. Evolution by feature addition is obviously the mode of development outside of the version 2.0 through 2.5 progression. Version 2.5 is the only code I have confidence of possessing the original.

Above is a phylogeny for WSO versions using the algorithm in this project.

You should be able to reproduce the phylogeny above by cloning this repository, and issuing a make wso_phylogeny.png command.

Compare this phylogeny to the one calculated with NCD and neighbor joining above. Version 2.4 constitutes the ancestor of many versions. The waw version is revealed as a modification of version 2.1, rather than as a sister clade of all other versions.

My selection and enumeration of features of WSO is in the script find_traits. I determined the features by careful manual code examination. If you read find_traits and the WSO exemplars carefully, you will notice that I've done some clever trait selection. Every version and variant up to 2.4 (and 2.4 descendants not related to 2.5) keeps information in a PHP session variable, which resides on the server. Versions 2.5 and descendants (2.5.1, 2.8, 3.1) keep the same information in HTTP cookies, presumably to avoid leaving traces behind on the exploited server. I did not use "use PHP session" as a feature, but I did use "use HTTP cookies" as a feature. This works around the algorithm not accounting for feature deletions, just additions. Workarounds may not be possible in all such situations.

Why does WSO appear in so many versions? "oRb" wrote the code in a mostly modular fashion. "oRb" actually designed the program. This makes it easier to add backdoors or phone-homes, change password, modify page headers and footers, or add new functions. This is in contrast to other web shells like fx29, c99 or r57. These other web shells are an unholy mess inside, with little or no modularity, and no easy way to add functionality. "oRb" also shepherded WSO through a corporate-style linear improvement process from versions 2.0 through 2.5

Counter example: Perl/Shellbot

Perl/Shellbot is a piece of Perl that understands the IRC protocol. Specially formatted messages from the botherder cause the bot to do things, like perform a UDP packet flood, run a command via the system's shell, or look up "dorks" in every search engine on the planet. The code arrives in a dazzling number of variants, some with function names and commentary in Portugese, others with function names & etc in English. Every botherder seems to do minor customization.

Biology inspired Shellbot phylogeny

Above, a biology-inspired phylogeny using NCD and neighbor joining for 39 Perl/Shellbot variants I collected, or had downloaded to a honey pot. I did select the v0.1 variant as the outgroup to construct this tree. Neighbor joining did catch perlbot.pl.txt as the next most ancestral version, which seems correct, as perlbot.pl.txt seems almost identical, barring some extra comments.

The really wrong thing that NCD/NJ does is mix the English variants in with the Portugese variants. English versions new.txt.2, STMIK2.txt and sqltes.txt.1 all get placed near the base of the phylogeny. Careful examination of the Shellbot variants convinced me that the Portugese commented and named variants are primitive, while the English variants are derivative. Moreover, the English variants are quite a bit more uniform than the Portugese variants, all containing some variant of the search engine "dork" lookup code. All English variants also lack the package DCC code that's in some of the Portugese variants, including the most primitive.

Feature Addition Algorithm Shellbot phylogeny

I combed through all 39 of the Perl/Shellbot variants for features and traits. The simplest variants seem to have Portugese function names and variables names, but at the same time, seem to be patched together from other programs and modules. For instance, sub fixaddr has a Portugese comment that has it coming from Perl module Net::IRC::DCC.

The variants with English language comments and function names seem far more uniform. Every version (except byr.jpg) includes code for searching Google and other search engines for text fragments. The Portugese variants exhibit a lot more peculiarities.

Above is the best phylogeny I could find, derived from close comparison of the variants and careful trait selection. It's a disjointed mess. It does seperate the English language variants into their own lines of descent, but on the whole the result is unsatisfactory.

Addition of hypothetical lost species

It is possible to make a non-disjoint tree phylogeny with the addition of 3 extra species. The first species is just a Last Common Ancestor. In my view, it was very similar to the code in v0.1 or perlbot.pl.txt. This doesn't seem too controversial.

The second added species is labeled "english" in the tree below. That would be the first program with function names and comments translated from Portugese to English. Because all the english variants are so similar, this hypothetical species almost certainly existed.

The third added species is labeled "hypothetical". In reality, the oval labled "hypothetical" may have comprised a line of descent with gradual modifications, but at the very least that line of descent lost package DCC and had some code worked over to eliminate references to and use of package DCC. A single reference to a variable $dcc_sel, remains in every version of Perl/Shellbot, one reason why I consider presence of package DCC as the primitive condition.

With 3 hypothetical but not unrealistic additional species, the phylogeny becomes a tree. The number of immediate descendants of hypothetical makes it unrealistic, given the lines of descent attested to by the existing versions.

The above tree still does not contain the complexity of descent of this code. For example, a line select( undef, undef, undef, 0.01 ); appears in 18 of the versions, some English-language, some Portugese, all of the versions at least somewhat derived. This is one instance of a trait appearing without apparent ancestors, unlike biological "horizontal gene transfer" and more like "intelligent design".

Perl/Shellbot occurs in even more versions than WSO does. It's only other competitor is the pBot family of PHP IRC bots. I suspect this is because it's been around longer than any web shell, and you have to modify the source code to configure it. Once the botherder has taken the trouble to configure a server name and master alias, why not go even further? The modifications tend to be in non-IRC parts of the code, sub sendraw, sub conectar/connector and sub parse all stay much the same from the most primitive to the most derived versions.

Conclusion

Malware that has source code available evolves by feature addition and subtraction. An algorithm that reconstructs phylogenies of source code malware must take this evolution into account. Algorithms considering only feature addition won't be able to reconstruct complicated phylogenies of some real world source code malware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

An algorithm to calculate malware phylogeny, an example and counterexample

Source code malware evolution

Biology-influenced phylogenies

Normalized Compression Distance

Algorithm

Determine traits

Description

Data Structure

Details of algorithm

Performance

Worst case

Effect of ill-chosen traits

Example: Web Shell by oRb

Description

Counter example: Perl/Shellbot

Biology inspired Shellbot phylogeny

Feature Addition Algorithm Shellbot phylogeny

Addition of hypothetical lost species

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

An algorithm to calculate malware phylogeny, an example and counterexample

Source code malware evolution

Biology-influenced phylogenies

Normalized Compression Distance

Algorithm

Determine traits

Description

Data Structure

Details of algorithm

Performance

Worst case

Effect of ill-chosen traits

Example: Web Shell by oRb

Description

Counter example: Perl/Shellbot

Biology inspired Shellbot phylogeny

Feature Addition Algorithm Shellbot phylogeny

Addition of hypothetical lost species

Conclusion