Research Artifact for our USENIX Security 2023 paper: "Decompiling x86 Deep Neural Network Executables"
BTD is the first deep neural network (DNN) executables decompiler. BTD takes DNN executables (running on x86 CPUs) compiled by DNN compilers (e.g., TVM, Glow, and NNFusion) and outputs full model specifications, including types of DNN operators, network topology, dimensions, and parameters that are (nearly) identical to those of the input models. BTD is evaluated to be robust against complex compiler optimizations, such as operator fusion and memory layout optimization. More details are reported in our paper published at USENIX Security 2023.
Paper: coming soon
Extended version (25 pages): https://arxiv.org/abs/2210.01075
Artifact Appendix in USENIX format: artifact-appendix.pdf
This repo contains all code and data used in the evaluation of BTD, we also provide a Docker image to ease the AE process.
This artifact is evaluated and awarded with badges: Available, Functional, Reproduced.
ubuntu 18.04
git
gcc/g++ (7.5.0)
make (4.1)
python3 (3.6.9 or higher)
- numpy-1.19.5
- torch (1.9.0 or higher)
- torchvision (0.11.2)
- fastBPE (0.1.0)
- tqdm (4.64.1)
Intel pin (3.14)
IDA Pro (optional)
You can download pin 3.14 from here, or use the docker image with all prerequisites installed.
BTD relies on IDA Pro (version 7.5) for disassembly, and because IDA is commercial software, we do not provide it in this repo; instead, in order to reduce the workload of AE reviewers, we provide the disassembly results directly as input for BTD. The scripts used to disassemble DNN executable into assembly functions with IDA are presented in ida/. IDA Pro is not indispensable; any other full-fledged disassembly tool can be used to replace IDA, but we do not provide the relevant code here.
We ran our evaluation experiments on a server equipped with Intel Xeon CPU E5-2683, 256GB RAM, and an Nvidia GeForce RTX 2080 GPU. Logging and filtering all traces for all DNN executables in the evaluation takes more than a week (sorry, we currently only provide a single-thread version) and consumes nearly 1TB disk storage. To ease the AE committee to review, we omit the trace logging process and provide the filtered traces in the docker image and evaluation data. The trace logger and filter are provided in MyPinTool/ and the trace_filter.py script. Without logging and filtering, the whole evaluation takes roughly one day and requires less than 120GB of disk space. Besides, the symbolic execution may consume a lot of memory resources, so please make sure that the machine on which the experiment is run has sufficient memory.
Our evaluation covers above 7 models compiled with 9 different compiler options, including Glow-2020
, Glow-2021
, Glow-2022
, TVM-v0.7 (O0 and O3)
, TVM-v0.8 (O0 and O3)
, TVM-v0.9.dev (O0 and O3)
, in total 63 DNN excutables. NNFusion-emitted executables are easier to decompile since they contain wrapper functions to invoke target operator implementations in kernel libraries (see our paper for more detailed discussion). Thus, in this evaluation we only focus on decompiling executables compiled by TVM and Glow.
Download the packed docker image, then run the command below to unpack the .tar file into a docker image. This may take a while. (You can replace btd-artifact
with any image name that would not conflict with existing names).
cat BTD-artifact.tar | docker import - btd-artifact
Create a container named BTD-AE
with the docker image:
docker run -dit --name BTD-AE btd-artifact /bin/bash
Open a bash in the container:
docker exec -it BTD-AE /bin/bash
cd /home
You can then run the evaluation commands (listed in Operator Inference and Decompilation & Rebuild below) within this bash. We strongly recommend reviewers use the provided Docker image for artifact evaluation to avoid errors that may be caused by environments.
If you are using the provided docker image, you can skip this Prepare section and move to Operator Inference.
Download and unzip Intel pin 3.14, then update the pin home directory (pin_home
) in config.py.
git clone https://github.com/monkbai/DNN-decompiler.git
mkdir <path_to_pin_home>/source/tools/MyPinTool/obj-intel64
cd DNN-decompiler
git pull
python3 pin_tools.py
pin_tools.py will copy and compile all pin tools listed in MyPinTool/.
Download and unzip the data (BTD-data) used for artifact evaluation, update the data directory DATA_DIR
in decompile_eval.sh.
Download data.zip and output.zip and unzip them into the operator_inference/data
and operator_inference/output
directories, respectively.
The code structure and docs of operator inference is provided in operator_inference/README.
cd DNN-decompiler
git pull
./op_infer_eval.sh
The ./op_infer_eval.sh
will run the operator inference experiments. Inference results are written in operator/output/<compiler_option>/text/test_000.txt
.
The output would be in format: <Compiler Option>-<Model>-<Operator Name/Type> Pred: output
. For example, the output below indicates that a libjit_fc_f
(Fully-Connected, FC) operator in the vgg16
model compiled with GLOW_2021
is correctly inferred as matmul
(Matrix Multiplication).
GLOW_2021-vgg16-libjit_fc_f Pred: matmul
GLOW_2021-vgg16-libjit_fc_f Label: matmul
cd DNN-decompiler
git pull
./decompile_eval.sh
The ./decompile_eval.sh
will decompile and rebuild all 63 DNN executables. It takes roughly 24 hours to finish all experiments. The output of rebuilt models and original DNN executables will be printed on screen (see example in Decompilation Correctness below). Corresponding decompilation outputs will be stored in evaluation/<model>_<compiler>_<version>_<opt level>
.
BTD will decompile a DNN executable into ❶ DNN operators and their topological connectivity, ❷ dimensions of each DNN operator, and ❸ parameters of each DNN operator, such as weights and biases.
After executing decompile_eval.sh, for each directory in evaluation/, a topo_list.json
containing the network topology (❶), a new_meta_data.json
containing dimensions information (❷), and a series of <func_id>.<weights/biases>_<id>.json
containing all parameters of the decompiled DNN model (❸) will be generated.
Each item in topo_list.json
: ['node id', '<func_id>.txt', 'operator type', [input addresses], 'output address', [input node ids], occurrence index]
.
Example (vgg16 TVM v0.8 O)):
[
1, // node id
"0031.txt", // func id (func name)
"bias_add", // operator type
[ // input addresses
"0x50a5e0", // output address of previous node
"0x22e2b1e0" // biases address
],
"0x114b1e0", // output address
[
0 // input node id
],
0 // occurrence index of the func
],
Each item in new_meta_data.json
: ['<func_id>.txt', [operator dimensions], 'operator entry address (in executable)', 'operator type', with_parameter, stride (if exists), padding (if exists)]
.
Example (vgg16 TVM v0.8 O0):
[
"0049.txt", // func_id (or func name)
[ // dimensions
[ // filter/weights dimensions
64.0,
3.0,
3,
3
],
[ // input dimensions
1,
3.0,
226.0,
226.0
],
[ // output dimensions
1,
64.0,
224,
224
],
[ // weights layout
2.0,
1,
3,
3,
3.0,
32.0
]
],
"0x405040", // operator entry
"conv2d", // operator type
1, // has parameters
1, // stride = 1
1 // padding = 1
],
After decompilation, the DNN model is rebuild with decompiled model structure and extracted parameters (stored in .json format). decompile_eval.sh will run each rebuilt model (implemented in pytorch) and the original DNN executable with the above example image in binary format as input. The output would be like this:
- vgg16_tvm_v09_O3
- Rebuilt model output:
Result: 282
Confidence: 9.341153
- DNN Executable output:
The maximum position in output vector is: 282, with max-value 9.341150.
timing: 566.89 ms (create), 0.54 ms (set_input), 4034.66 ms (run), 0.00 ms (get_output), 0.61 ms (
destroy)
In the above exmaple, both rebuilt model and DNN executable output result as 282
(see 1000 classes of ImageNet), and the confidence scores are 9.341153
and 9.341150
respectively. While the confidence scores (or max values) are slightly inconsistent, we interpret that such inconsistency is caused by the floating-point precision loss between pytorch model and DNN executable, i.e., the decompilation is still correct.
Update: We uploaded scripts to summarize the results of the above experiments.
git pull
./summarization.sh
summarization.sh
will invoke scripts including:
statistic.py
, which collects statistics of DNN executables evaluated in our study (Table 2). Note that the statistics may slightly deviate from the numbers in Table 2 depending on the building environment, but this should not affect our claims in the paper.operator/run_accuracy.py
, which calculates the average accuracy of operator inference (Table 3). Note that since we have manually fixed the "Add vs. BiasAdd" issue discussed in Operators with Similar Assembly Code of Section 7.1.1, in some cases, the accuracy may be higher (i.e., better results) than results reported in Table 3.parameter_accuracy.py
, which calculates the dimension inference accuracy/parameter inference accuracy ofTVM Resnet18
(Table 4). Note that it is difficult to compare the recovered dimensions/parameters with the reference due to compiler optimizations (e.g., operator fusion), i.e., the ground truth of optimized models is not available, as discussed in Sec 7.1.3. Hence, #failures in Table 4 equals #dimensions or #parameters that need to be fixed before the recovered models can be compiled into executables showing identical behavior with the references. This script only reproduces results for Resnet18 (accuracies for all other models are 100%, and therefore no need to be included in this script; see results in Sec 7.1.4).recompile_correctness.py
, which evaluates the correctness of recompilation (Table 5).Pass
means the model is 100% correctly rebuilt. Note that we manually fix errors inTVM Resnet18
as discussed in Sec 7.1.4 to confirm our claim that "all remaining operators in ResNet18 are correctly decompiled". Therefore, we expect to get 63/63 passes by running this script.
When the summarization.sh
script finishes running, all results reported in Table 2-5 should be printed to the screen.
├── MyPinTool/ // Pin tools' source code
├── compiler_opt/ // identify the complation provenance
├── evaluation/ // scripts for main evaluation including 63 executables
├── ida/ // ida scripts
├── nlp_models/ // nlp models evaluation
├── nnfusion/ // nnfusion evaluation
├── operator_inference/ // inference the type of a DNN operator
├── recompile/ // recompile decompiled models
├── validation/ // to validate the correctness of rebult models
├── white-box-attack/ // info about white-boix attacks we used
├── config.py
├── decompile_eval.sh // script for artifact evaluation
├── explain.py // heuristics used in BTD
├── fused_trace.py
├── mem_slices.py
├── pin_tools.py // pin tools compilation and usage
├── se_engine.py // a trace based emybolic execution engine
├── split_funcs.py // split disassembly output into functions
├── trace_filter.py // taint analysis to filter logged trace
└── utils.py
If you are interested in the interfaces of BTD, you can take a look at the decompilation scripts in evaluation/, e.g., vgg16_tvm_O0_decompile.py.
Our dataset is available at https://doi.org/10.5281/zenodo.7219867.
We also provided all datasets via Dropbox for better download speed.
-
Docker Image for Artifact Evaluation: https://www.dropbox.com/s/o43uoxrxisozdq5/BTD-artifact.tar?dl=0
-
Labeled Dataset for Operator Inference (containing compiled DNN executables and disassembly output): https://www.dropbox.com/s/a1mxqwqn4tytmgz/labeled_dataset_2022.zip?dl=0
- Data for artifact evaluataion (including filtered traces, disassembled functions, and original DNN executables): https://www.dropbox.com/s/ifzc4d7z4czbpvv/BTD-data.zip?dl=0
-
ONNX Models https://www.dropbox.com/s/x8gwqyej7fla3rz/DNN_models.zip?dl=0 (Also available at ONNX Model Zoo)
-
Compiled DNN Executables https://www.dropbox.com/s/lf297rjgx7e39id/DNN_executables.zip?dl=0
-
Model Inputs https://www.dropbox.com/s/nook5hs9srjveev/images.zip?dl=0
https://www.dropbox.com/s/9y0k71dbowixs8w/embedding_input.zip?dl=0 -
White-box Attack Results https://www.dropbox.com/s/9fuxmfuaroqtvjm/whitebox_steal.zip?dl=0