This is the repo for the code (TensorFlow version) and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023. Here you can find our slides.
If you find this repository useful, please give us a star : ) Thank you!
Update: I've recently added a section (Section 5.5) discussing the multi-hop modeling capability of BERT4ETH to the paper on arXiv. (10/30)
BERT4ETH-PyTorch: Here you can find the PyTorch implementation: https://github.com/Bayi-Hu/BERT4ETH_PyTorch
Note 1: The master branch hosts the basic BERT4ETH. If you wish to run the basic model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch but are not recommended due to the inefficiency of computation and memory.
Note 2: Despite BERT4ETH is a sequential model, it is able to capture three-hop relationship from a graph perspective. (For more details please refer to our slides.)
Note 3: The results reported in our paper are the best results among five times experiments (pre-training). The outcomes might slightly vary between different runs of pre-training, steps of checkpoints, and runs of cascaded MLP classifier training. Below are our recent results on the phishing detection task with fixed training:
- Python >= 3.6
- TensorFlow >= 2
I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.
-
Transaction Dataset:
cd BERT4ETH/Data; # Labels are already included
unzip ...;
cd Model;
python gen_seq.py --bizdate=bert4eth_exp
python gen_pretrain_data.py --bizdate=bert4eth_exp \
--max_seq_length=100 \
--dupe_factor=10 \
--masked_lm_prob=0.8
python run_pretrain.py --bizdate=bert4eth_exp \
--max_seq_length=100 \
--epoch=5 \
--batch_size=256 \
--learning_rate=1e-4 \
--num_train_steps=1000000 \
--save_checkpoints_steps=8000 \
--neg_strategy=zip \
--neg_sample_num=5000 \
--neg_share=True \
--checkpointDir=bert4eth_exp
Parameter | Description |
---|---|
bizdate |
The signature for this experiment run. |
max_seq_length |
The maximum length of BERT4ETH. |
masked_lm_prob |
The probability of masking an address. |
epochs |
Number of training epochs, default = 5 . |
batch_size |
Batch size, default = 256 . |
learning_rate |
Learning rate for the optimizer (Adam), default = 1e-4 . |
num_train_steps |
The maximum number of training steps, default = 1000000 , |
save_checkpoints_steps |
The parameter controlling the step of saving checkpoints, default = 8000 . |
neg_strategy |
Strategy for negative sampling, default zip , options (uniform , zip , freq ). |
neg_share |
Whether enable in-batch sharing strategy, default = True . |
neg_sample_num |
The negative sampling number for one batch, default = 5000 . |
checkpointDir |
Specify the directory to save the checkpoints. |
python output_embed.py --bizdate=bert4eth_exp \
--init_checkpoint=bert4eth_exp/model_104000 \
--max_seq_length=100 \
--neg_sample_num=5000 \
--neg_strategy=zip \
--neg_share=True
I have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.
python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)
python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RF
python run_dean_ENS.py --metric=euclidean \
--init_checkpoint=bert4eth_exp/model_104000
python run_dean_Tornado.py --metric=euclidean \
--init_checkpoint=bert4eth_exp/model_104000
python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \
--max_seq_length=100
python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
--bizdate=bert4eth_exp \
--max_seq_length=100 \
--checkpointDir=tmp
@inproceedings{hu2023bert4eth,
title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
booktitle={Proceedings of the ACM Web Conference 2023},
pages={2189--2197},
year={2023}
}
If you have any questions, you can either open an issue or contact me (sihaohu@gatech.edu), and I will reply as soon as I see the issue or email.