A step by step tutorial to train COCO-FUNIT using the animal faces dataset. Forked from the imaginaire project. The original readme can be found here
This tutorial was tested on Pop! OS 22.04
Imaginaire is released under NVIDIA Software license. For commercial use, please consult NVIDIA Research Inquiries.
-
Install Docker
Install docker engine using the official guide. The link for each platform is shown under the server heading.
After installation you should add your user to the docker group :
sudo groupadd docker sudo usermod -aG docker $USER
Then restart so that your group membership is re-evaluated.
You can verify that docker has been set up correctly by running the
hello-world
image :docker run hello-world
-
Install NVIDIA Container Toolkit
Currently at the time of writing I could not get the official installation method to work. The following commands were performed to install NVIDIA Container Toolkit version 1.10.0-1 :
-
Ubuntu:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit=1.10.0-1 sudo apt-get install libnvidia-container1=1.10.0-1 sudo apt-get install libnvidia-container-tools=1.10.0-1
-
Pop! os:
Because the nvidia-container-toolkit is only supported by a couple of distribution, you have some manipulations to do to be able to install it on Pop! OS
distribution="ubuntu22.04" \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit=1.10.0-1 sudo apt-get install libnvidia-container1=1.10.0-1 sudo apt-get install libnvidia-container-tools=1.10.0-1
-
Testing:
A working setup can be tested by running a base CUDA container:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
This should result in a console output shown below:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
-
-
Clone the repo
git clone https://github.com/Jesse27/imaginaire-coco-funit.git
-
Build docker image
After cloning navigate to the
path/to/imaginaire-coco-funit/
directory and run the build script :Note all scripts should be run from this directory.
bash scripts/build_docker.sh 21.06
-
Start docker image
bash scripts/start_local_docker.sh 21.06
This should result in a console output shown below where 0f388ec0d8b2 is the docker CONTAINER ID:
root@0f388ec0d8b2:/workspace/coco-funit#
If you have run the code previously you may receive the error below.
docker: Error response from daemon: Conflict. The container name "/coco-funit" is already in use by container ...
This can be solved by running the following commands to stop and remove the existing container.
docker stop coco-funit docker rm coco-funit
If you wish to keep the existing container you can use the following command to open it's terminal shell
docker exec -it coco-funit /bin/bash
-
Downloading the data
The example animal-faces dataset can be downloaded using the
download_dataset.py
script. This should be run in the docker container from the/workspace/coco-funit/
directorypython scripts/download_dataset.py --dataset animal_faces
-
Build the lmdbs
for f in train train_all val; do python scripts/build_lmdb.py \ --config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \ --data_root dataset/animal_faces/${f} \ --output_root projects/coco_funit/data/lmdb/training/animal_faces/${f} \ --overwrite done
-
Start Training
--nproc_per_node=1
configures the number of GPU's used in training, it is set to 1 by default.Other configuration parameters are found in
configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml --logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml
Note that you may encounter a git config issue. This can be solved by running the command displayed in terminal then re-running the train command:
git config --global --add safe.directory path/to/imaginaire-coco-funit/
-
Output
The output contains images, TensorBoard logs and model checkpoints.
The training output is found in
path/to/imaginaire-coco-funit/logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml/
The number of rows shown in the output image is equal to the batch size per GPU.
-
TensorBoard
TensorBoard logs should be opened within the docker container. To access the docker container while the model is training open another terminal and run the following command:
Note that coco-funit is the name of the docker container, this can be found by running
docker container ls
. It is set to coco-funit by default.docker exec -it coco-funit /bin/bash
This should result in a console output shown below where 0f388ec0d8b2 is the docker CONTAINER ID:
root@0f388ec0d8b2:/workspace/coco-funit#
To start TensorBoard run the following command in the docker container:
tensorboard --logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml/tensorboard
TensorBoard can then be opened at
0.0.0.0:8083
on the local machine.