Skip to content

Commit

Permalink
use hdf5 dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
gglin001 committed Mar 23, 2022
1 parent 36b9436 commit d674e7e
Show file tree
Hide file tree
Showing 10 changed files with 232 additions and 239 deletions.
41 changes: 20 additions & 21 deletions examples/language_model/bert/static_ipu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ This project enabled BERT-Base pre-training and SQuAD fine-tuning task using [Pa
| `run_squad.py` | The algorithm script to run SQuAD finetune and validation task. |
| `modeling.py` | The algorithm script to build the Bert-Base model. |
| `dataset_ipu.py` | The algorithm script to load input data in pretraining. |
| `custom_ops/` | The folder contains custom ops that will be used. |
| `run_pretrain.sh` | Test script to run pretrain phase 1. |
| `run_pretrain_phase2.sh` | Test script to run pretrain phase 2. |
| `run_squad.sh` | Test script to run SQuAD finetune. |
| `run_squad_infer.sh` | Test script to run SQuAD validation. |
| `LICENSE` | The license of Apache. |

## Dataset

Expand All @@ -27,37 +27,36 @@ This project enabled BERT-Base pre-training and SQuAD fine-tuning task using [Pa

The sequence length used in pretraining phase1 and phase2 are: 128 and 384. Following steps are provided for dataset generation.

```
# Code base:https://github.com/NVIDIA/DeepLearningExamples/tree/88eb3cff2f03dad85035621d041e23a14345999e/TensorFlow/LanguageModeling/BERT
```bash
# Here we use a specific commmit, the latest commit should also be fine.
git clone https://github.com/NVIDIA/DeepLearningExamples.git
git checkout 88eb3cff2f03dad85035621d041e23a14345999e

cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT
bash scripts/docker/build.sh
cd DeepLearningExamples/PyTorch/LanguageModeling/BERT

cd data/
# Modified the parameters `--max_seq_length 512` to `--max_seq_length 384` at line 50 and
# `--max_predictions_per_seq 80` to `--max_predictions_per_seq 56` at line 51.
vim data/create_datasets_from_start.sh

# Modified the parameters `--max_seq_length 512` to `--max_seq_length 384` at line 68, `--max_predictions_per_seq 80` to `--max_predictions_per_seq 56` at line 69.
vim create_datasets_from_start.sh
# Build docker image
bash scripts/docker/build.sh

cd ../
# Use NV's docker to download and generate hdf5 file. This may requires GPU available.
# You can Remove `--gpus $NV_VISIBLE_DEVICES` to avoid GPU requirements.
bash scripts/docker/launch.sh

# Use NV's docker to download and generate tfrecord. This may requires GPU available. Removing `--gpus $NV_VISIBLE_DEVICES` in data_download.sh to avoid GPU requirements.
bash scripts/data_download.sh wiki_only
# generate dataset with wiki_only
bash data/create_datasets_from_start.sh wiki_only
```

2. SQuAD 1.1 dataset
2. SQuAD v1.1 dataset

```
curl --create-dirs -L https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -o data/squad/train-v1.1.json
paddlenlp will download SQuAD v1.1 dataset automatically. You don't have to download manually.

curl --create-dirs -L https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -o data/squad/dev-v1.1.json
```

## Quick Start Guide

### 1)Prepare Project Environment
### Prepare Project Environment

PaddlePaddle with IPU implementation, which is provided by Graphcore, is required by this application. User can either download the released package or build it from source.

Expand All @@ -75,15 +74,15 @@ git clone -b bert_base_sdk_2.3.0 https://github.com/graphcore/Paddle.git
cd Paddle
# build docker image
docker build -t paddlepaddle/paddle:dev-ipu-2.3.0 -f tools/dockerfile/Dockerfile.ipu .
docker build -t paddlepaddle/paddle:ipu-dev-2.3.0 -f tools/dockerfile/Dockerfile.ipu .
# create container
# The ipuof.conf is required here.
docker run --ulimit memlock=-1:-1 --net=host --cap-add=IPC_LOCK \
--device=/dev/infiniband/ --ipc=host --name paddle-ipu-dev \
-v ${HOST_IPUOF_PATH}:/ipuof \
-e IPUOF_CONFIG_PATH=/ipuof/ipu.conf \
-it paddlepaddle/paddle:dev-ipu-2.3.0 bash
-it paddlepaddle/paddle:ipu-dev-2.3.0 bash
```

All of later processes are required to be executed in the container.
Expand All @@ -104,7 +103,7 @@ cmake --build `pwd`/build --config Release --target paddle_python -j$(nproc)
pip3.7 install -U build/python/dist/paddlepaddle-0.0.0-cp37-cp37m-linux_x86_64.whl
```

### 2) Execution
### Execution

- Run pretraining phase1 (sequence_length = 128)

Expand Down
Loading

0 comments on commit d674e7e

Please sign in to comment.