Skip to content

drjerry/cve-score

Repository files navigation

Scoring software vulnerabilities

This project evolved as a collection of tools for analyzing software vulnerability data. It is largely a set of command line utilities. Each script focuses on a single unit of work, the aim being that more complex processing pipelines are built via composition. This design allows for leveraging other CL utilities, keeping the API to a minimal surface area.

One of the main intended uses is training ML models for the exploit prediction problem. Please see that paper references for more background.

System requirements

The utilities target Python 3 (tested against 3.5-7). See requirements.txt for the Python dependencies.

jq 1.5+ is required for essentially all data processing tasks. (See data workflow below.) One can download the latest stable version for your target platform, and Linux systems allow for installation via the system package manager.

Data workflow

Exploit prediction is a supervised learning problem. Most machine learning workflows start by marshaling the data into a tabular format--an N-by-D feature matrix, together with an additional column for the labels--and perform all cleaning and feature engineering steps from there. The DataFrame structure in R or Pandas are designed around this.

The tools here emphasize a different, "opinionated" workflow whose point of departure is the fact that raw vulnerability data is most readily available in a hierarchically structured format like XML or JSON instead of flat tables. The target format for the data is a line-delimited file of JSON records -- the so-called JSONL format. Each data cleaning or feature engineering step consumes a JSONL file and emits a new one, thereby building a pipeline of processing steps with checkpoints along the way.

One of the design decisions that is best made explicit early on involves the preferred way of defining and encoding features. Suppose that the input records have a top-level property called "foo," each being an object of categorical attributes:

{..., "foo": {"type": "debian", "version": "1.2", "affected": true}, ...}
{..., "foo": {"type": "legacy", ""affected": false}, ...}
...

One possible approach is to create a feature for each of the paths foo.type, foo.version, foo.affected, etc., each of which will be a categorical variable and have its own one-hot encoding. Instead, the preferred approach is to use a bag-of-words encoding for the top-level property. Its vocabulary is the space of all inter-object paths, eg, type.debian, version.1.2, etc., so that the the preprocessed records become:

{..., "foo": ["type.debian", "version.1.2", "affected.true"], ...}
{..., "foo": ["type.legacy", "affected.false"], ...}
...

The two approaches are mathematically equivalent. However the latter helps to keep the data wrangling steps simpler. For each data set, one only needs to specify the transforms and encodings for a bounded set of top-level properties.

The data cleaning and feature engineering steps of the workflow operate on the data one record at a time (or "row-wise"), and then the final encoding step transforms the data into a columnar format consumable for ML training and evaluation. That target format is a Python dictionary associating feature names (the keys) to 2D numpy arrays. A given array will have shape (N, K), where N is the number of records, and K is the dimensionality of the vector encoding for that feature. Note that the term "feature" here is applied loosely, as it may included the class labels for a supervised learning problem, in which case K=1.

Workflow outline

  1. Create a file of JSON records, where all records have the same set of keys corresponding to the "features" of interest. A basic walk through on data acquisition illustrates this.

  2. Apply the preprocssing script to the raw data, creating another file of JSON records with the same top level keys, but the corresponding values are either arrays of strings (literally bag-of-tokens) or numeric values.

  3. Apply the encoding script to transform the preprocessed records into the target dictionary of numpy arrays.

Command line API

This sections documents the preprocessing and encoding scripts in more detail. Each of these scripts consumes and emits files as part of a data pipeline that can be summarized as follows:

preprocess.py

Argument State Description
config required input JSON configuration file.
rawdata required input JSONL file of raw features.
processed output JSONL file of processed features.
vocabulary optional output JSON file of vocabularies.

encode.py

Argument State Description
config required input JSON configuration file.
vocabulary required input JSON file of vocabularies.
processed output JSONL file of processed features.
encoded output Dictionary of numpy arrays.

config schema

Both of the scripts take a config argument that defines all of preprocessing and encoding methods applied to each feature. It is a JSON array of objects, one for each feature, with the following schema:

[
  {
    "feature":      // key name in JSON input records.
    "preprocessor": // reference in preprocess.PREPROCESSORS
    "encoder":      // reference in encode.ENCODERS
    // optional key-word arguments for preprocessor and encoder methods.
  },
  ...
]

vocabulary schema

When working with a feature (text or structured data) to which a bag-of-words encoding will be applied, it is important to extract the vocabulary for that feature, which fixes an assignment of each token to its dimension in the vector representation. As it is critical that the same vector representation for a feature used to train an estimator is also applied to new examples during inference, the vocabulary needs to be treated as an artifact of preprocessing that becomes an input of any encoding step.

The vocabulary artifact emitted by the preprocessing script is a JSON file with a simple nested format:

{
  <feature>: {
    <token_0>: <frac_0>,
    <token_1>: <frac_1>,
    ...
  }
  ...
}

The top level keys are features from the input data, but only those targeting bag-of-words encoding; numeric features are absent. The nested maps associate each token in that "feature space" to the fraction of records in which that token appears.

When this object is consumed by the encoding script, the only thing that matters for the vector representation of a feature is its "key space" of tokens, as the token-to-dimension mapping is established by sorting. This allows for different dimension reduction strategies by pruning or otherwise transforming these nested objects in the input vocabulary; the numeric ranks are only provided as an aid toward these steps.