Re-implementation of Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling (T. Shen et al., ICLR 2018) on Pytorch.
Dataset: SNLI
Model | ACC(%) |
---|---|
Re-implementation (600D Bi-BloSAN) | 84.1 |
Baseline from the paper (480D Bi-BloSAN) | 85.7 |
- OS: Ubuntu 16.04 LTS (64bit)
- Language: Python 3.6.2
- Pytorch: 0.3.0
Please install the following library requirements specified in the requirements.txt first.
nltk==3.2.4
tensorboardX==1.0
torch==0.3.0
torchtext==0.2.1
python train.py --help
usage: train.py [-h] [--batch-size BATCH_SIZE] [--block-size BLOCK_SIZE]
[--data-type DATA_TYPE] [--dropout DROPOUT] [--epoch EPOCH]
[--gpu GPU] [--learning-rate LEARNING_RATE]
[--mSA-scalar MSA_SCALAR] [--print-freq PRINT_FREQ]
[--weight-decay WEIGHT_DECAY] [--word-dim WORD_DIM]
optional arguments:
-h, --help show this help message and exit
--batch-size BATCH_SIZE
--block-size BLOCK_SIZE
--data-type DATA_TYPE
--dropout DROPOUT
--epoch EPOCH
--gpu GPU
--learning-rate LEARNING_RATE
--mSA-scalar MSA_SCALAR
--print-freq PRINT_FREQ
--weight-decay WEIGHT_DECAY
--word-dim WORD_DIM
Note:
- The two of most important hyperparameters are block-size (r in the paper) and mSA-scalar (c in the paper). The paper suggests a heuristic to decide the r (in the Appendix) but there's no mention about c. In this implementation, r is computed by the suggested heuristic and c is set to 5, following the settings of the authors. But you can also assign values to them manually.
- The Dropout technique also exists in this model, but it is not specified that how the dropout is applied. Therefore, to be naive, the dropout is adapted to layers for SNLI (NN4SNLI class) only.
- Furthermore, there're no details about 480D Bi-BloSAN, whose result is reported in the paper. Hence, the result reported here is based on 600D(300D-Forward + 300D-Backward) Bi-BloSAN. Note that hyperparameter tuning hasn't been done thoroughly. The result can be improved with fine-tuning.
python test.py --help
usage: test.py [-h] [--batch-size BATCH_SIZE] [--block-size BLOCK_SIZE]
[--data-type DATA_TYPE] [--dropout DROPOUT] [--epoch EPOCH]
[--gpu GPU] [--mSA-scalar MSA_SCALAR] [--print-freq PRINT_FREQ]
[--word-dim WORD_DIM] --model-path MODEL_PATH
optional arguments:
-h, --help show this help message and exit
--batch-size BATCH_SIZE
--block-size BLOCK_SIZE
--data-type DATA_TYPE
--dropout DROPOUT
--epoch EPOCH
--gpu GPU
--mSA-scalar MSA_SCALAR
--print-freq PRINT_FREQ
--word-dim WORD_DIM
--model-path MODEL_PATH
Note: You should execute test.py with the same hyperparameters, which are used for training the model you want to run.
The original code implemented by the authors (on Tensorflow) can be found here