Sockeye is training much faster than Marian #396

tomsbergmanis · 2022-09-14T15:44:49Z

Bug description

Sockeye is training much faster than Marian.
I run a 1 data epoch long training on a 4.7M training examples small data set with either framework. To best of my knowledge I used comparable training parameters for both frameworks. Bu the results were 21 min vs 36 min, favoring Sockeye.
What I do not know is, if it is a problem due to my old setup - Ubuntu 18.04.6 and everything that follows from that (e.g. old compiler and other stuff), or it something to do with Marian.

How to reproduce

A typical way of training Sockeye systems is to run data prep step before training.
sockeye-prepare-data --source train.bpe.en --target train.bpe.lv --output . --max-seq-len 128 --shared-vocab --num-words 25000
Data prep time was not included in training time.
To measure Sockeye's training time I used timestamps between start and end of the training, which to me worked out to be 21 min.
touch sockeye.start & torchrun --no_python --nproc_per_node 2 sockeye-train --prepared-data . --output models --validation-source dev.bpe.en --validation-target dev.bpe.lv --max-num-epochs 1 --shared-vocab --dist --amp --update-interval 12 --batch-size 18000--max-seq-len 128 > training.log 2>&1 & touch sockeye.end

For Marian I used /marian-vocab --max-size 25000
marian --devices 0 1 --type transformer --model /tmp/toms/sockeye-test/model.npz --train-sets /tmp/toms/sockeye-test/train.bpe.en /tmp/toms/sockeye-test/train.bpe.lv --vocabs en-lv-shared-vocab.yml en-lv-shared-vocab.yml --max-length 128 --max-length-factor 1.5 --mini-batch-fit --workspace 18000 --maxi-batch 2000 --early-stopping 10 --valid-freq 1000000 --save-freq 2000000 --disp-freq 100 --keep-best --overwrite --valid-metrics cross-entropy translation --valid-sets /tmp/toms/sockeye-test/dev.bpe.en /tmp/toms/sockeye-test/dev.bpe.lv --valid-script-path /tmp/toms/sockeye-test/validate.sh --log /tmp/toms/sockeye-test/train.log --valid-log /tmp/toms/sockeye-test/valid.log --seed 347155 --exponential-smoothing --normalize 0.6 --beam-size 6 --quiet-translation --valid-translation-output /tmp/toms/sockeye-test/valid.output.txt --valid-mini-batch 16 --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-preprocess d --transformer-postprocess-emb d --transformer-postprocess dan --optimizer-delay 12 --learn-rate 0.0005 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --clip-norm 5 --tied-embeddings-all --sync-sgd --transformer-dropout 0.1 --transformer-dropout-attention 0.1 --transformer-dropout-ffn 0.1 --optimizer adam --optimizer-params 0.9 0.98 1e-09 --sqlite /tmp/en-lv-W69bwc2f6meuT-combined.db -e 1 --fp16

To measure Marian's training time I used timestamps for outputs Training started and Training finished which to me worked out to be around 36 min. This was with Marian version: v1.10.24; 4dd30b5 2021-09-08 14:02:21 +0100
I also tried Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800 but it gave even worse - 43 min.

I do realize, that Marian's --workspace 18000 and Sockeye's --batch-size 18000 aren't the same, however, running with different --batch-size values didn't affect time it took Sockeye to train for one epoch.

I also checked if both frameworks have seen the same number of sentences during their respective training runs. The numbers were about the same.

Context

Marian version: v1.10.24; 4dd30b5 2021-09-08 14:02:21 +0100
Marian version: v1.11.0 f00d062 2022-02-08 08:39:24 -0800
CMake command:
cmake ..
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Project name: marian
-- Project version: v1.11.0+f00d0621
Submodule 'examples' (https://github.com/marian-nmt/marian-examples) registered for path 'examples'
Submodule 'regression-tests' (https://github.com/marian-nmt/marian-regression-tests) registered for path 'regression-tests'
Submodule 'src/3rd_party/fbgemm' (https://github.com/marian-nmt/FBGEMM) registered for path 'src/3rd_party/fbgemm'
Submodule 'src/3rd_party/intgemm' (https://github.com/marian-nmt/intgemm/) registered for path 'src/3rd_party/intgemm'
Submodule 'src/3rd_party/nccl' (https://github.com/marian-nmt/nccl) registered for path 'src/3rd_party/nccl'
Submodule 'src/3rd_party/sentencepiece' (https://github.com/marian-nmt/sentencepiece) registered for path 'src/3rd_party/sentencepiece'
Submodule 'src/3rd_party/simple-websocket-server' (https://github.com/marian-nmt/Simple-WebSocket-Server) registered for path 'src/3rd_party/simple-websocket-server'
Cloning into '/tmp/toms/sockeye-test/marian/examples'...
Cloning into '/tmp/toms/sockeye-test/marian/regression-tests'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/intgemm'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/nccl'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/sentencepiece'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/simple-websocket-server'...
Submodule path 'examples': checked out '6d5921cc7de91f4e915b59e9c52c9a76c4e99b00'
Submodule path 'regression-tests': checked out '0716f4e012d1e3f7543bffa8aecc97ce9c903e17'
Submodule path 'src/3rd_party/fbgemm': checked out '6f45243cb8ab7d7ab921af18d313ae97144618b8'
Submodule 'third_party/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'src/3rd_party/fbgemm/third_party/asmjit'
Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'src/3rd_party/fbgemm/third_party/cpuinfo'
Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'src/3rd_party/fbgemm/third_party/googletest'
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/asmjit'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/cpuinfo'...
Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/googletest'...
Submodule path 'src/3rd_party/fbgemm/third_party/asmjit': checked out '4da474ac9aa2689e88d5e40a2f37628f302d7e3c'
Submodule path 'src/3rd_party/fbgemm/third_party/cpuinfo': checked out 'd5e37adf1406cf899d7d9ec1d317c47506ccb970'
Submodule path 'src/3rd_party/fbgemm/third_party/googletest': checked out '0fc5466dbb9e623029b1ada539717d10bd45e99e'
Submodule path 'src/3rd_party/intgemm': checked out '8abde25b13c3ab210c0dec8e23f4944e3953812d'
Submodule path 'src/3rd_party/nccl': checked out '5dcf7751494f9d04057bfc6b4a2b64611bc12253'
Submodule path 'src/3rd_party/sentencepiece': checked out 'c307b874deb5ea896db8f93506e173353e66d4d3'
Submodule path 'src/3rd_party/simple-websocket-server': checked out '1d7e84aeb3f1ebdc78f6965d79ad3ca3003789fe'
CMake Warning at CMakeLists.txt:79 (message):
CMAKE_BUILD_TYPE not set; setting to Release

-- Building with -march=native and intrinsics will be chosen automatically by the compiler to match the current machine.
-- Checking support for CPU intrinsics
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: software/anaconda3/envs/sockeye3 (found suitable version "10.0", minimum required is "9.0")
-- Compiling code for Pascal GPUs
-- Compiling code for Volta GPUs
-- Compiling code for Turing GPUs
-- Found CUDA libraries: software/anaconda3/envs/sockeye3/lib64/libcurand.so; software/anaconda3/envs/sockeye3/lib64/libcusparse.so; software/anaconda3/envs/sockeye3/lib64/libcublas.so
-- Found Tcmalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
-- Found MKL: -Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group
CMake Warning at src/3rd_party/intgemm/CMakeLists.txt:33 (message):
Not building AVX512VNNI-based multiplication because your compiler is
too old.

For details rerun cmake with --debug-trycompile then try to build in
compile_tests/CMakeFiles/CMakeTmp.

-- VERSION: 0.1.94
-- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.13") found components: doxygen dot
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/toms/sockeye-test/marian/build

Both frameworks use CUDA Version 10 although there could be minor differences, as Sockeye 3 is installed via Conda and uses its installation.
I ran it on two NVIDIA TITAN RTXs

Ubuntu 18.04.6

marian-v-1.10.train.log
marian-v-1.11.train.log
sockye_training.log
sockeye.args.yaml.txt
sockeye.data.config.txt
marian-v-1.10.train.log

The text was updated successfully, but these errors were encountered:

tomsbergmanis added the bug label Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sockeye is training much faster than Marian #396

Sockeye is training much faster than Marian #396

tomsbergmanis commented Sep 14, 2022 •

edited

Loading

Sockeye is training much faster than Marian #396

Sockeye is training much faster than Marian #396

Comments

tomsbergmanis commented Sep 14, 2022 • edited Loading

Bug description

How to reproduce

Context

tomsbergmanis commented Sep 14, 2022 •

edited

Loading