Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet 2.x significantly slower than 1.x in Sockeye #20636

Closed
fhieber opened this issue Oct 5, 2021 · 2 comments
Closed

MXNet 2.x significantly slower than 1.x in Sockeye #20636

fhieber opened this issue Oct 5, 2021 · 2 comments

Comments

@fhieber
Copy link
Contributor

fhieber commented Oct 5, 2021

Description

We observe a significant reduction in Sockeye inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is ~2x slower.

For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (static_alloc=True) in both branches. The pull request/branch can be found here: awslabs/sockeye#953.

The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.

p3.2xlarge instance

batch size 64

mxnet-cu112 2.0.0b20211001:

[INFO:__main__] Processed 3003 lines. Total time: 37.2888, sec/sent: 0.0124, sent/sec: 80.5336

mxnet-cu112 1.7:

[INFO:__main__] Processed 3003 lines. Total time: 20.2805, sec/sent: 0.0068, sent/sec: 148.0735

batch size 1

mxnet-cu112 2.0.0b20211001:

[INFO:__main__] Processed 3003 lines. Total time: 858.3818, sec/sent: 0.2858, sent/sec: 3.4984

mxnet-cu112 1.7:

[INFO:__main__] Processed 3003 lines. Total time: 302.0189, sec/sent: 0.1006, sent/sec: 9.9431

g4 instance

mx18/out.1.bpe.log:[2021-10-04:20:02:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 316.4692, sec/sent: 0.1054, sent/sec: 9.4891
mx18/out.64.bpe.log:[2021-10-04:20:03:10:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 31.8175, sec/sent: 0.0106, sent/sec: 94.3819
mx20/out.1.bpe.log:[2021-10-04:20:17:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 714.5509, sec/sent: 0.2379, sent/sec: 4.2026
mx20/out.64.bpe.log:[2021-10-04:20:18:26:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 46.4607, sec/sent: 0.0155, sent/sec: 64.6352

To Reproduce

  • Download the Sockeye sample model
  • Run translate.sh with the master branch of Sockeye
  • Run translate.sh with the mx2 branch of Sockeye

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. wget https://github.com/awslabs/sockeye/releases/download/2.3.22/wmt14_en_de.tgz
  2. tar -xvf wmt14_en_de.tgz
  3. git clone https://github.com/awslabs/sockeye.git
  4. pip install -r sockeye/requirements/requirements.gpu-cu112.txt`
  5. mv sockeye/sockeye wmt_14_en_de
  6. cd wmt_14_en_de
  7. bash translate.sh [translate with master branch]
  8. git checkout mx2
  9. (Install nightly build of mx2: pip uninstall mxnet-cu112 ; pip install --pre -f https://dist.mxnet.io/python 'mxnet-cu112')
  10. bash translate.sh [translate with mx2 branch]

What have you tried to solve it?

Environment

  • Cuda 11.2 (conda install -c conda-forge nccl cudnn cudatoolkit==11.2)
  • MXNet 1.8.post0 or MXNet 1.7 vs MXNet 2.x (2.0.0b20211001)
@TristonC
Copy link
Contributor

TristonC commented Nov 5, 2021

@blchu has been working on this together with @barry-jin . We found big CPU overhead in 2.x vs. 1.x. One specific op, unravel, runs on CPU instead of GPU in 2.x due to the interface change. The fixing is ongoing. @szha FYI too.

@blchu
Copy link
Contributor

blchu commented Jan 11, 2022

I've done some additional profiling of the code, and have noticed that certain parts of the code are being slowed down by functions that currently call asnumpy() and use the numpy array equivalent function instead. Directly implementing these functions should improve the performance considerably. Also, the __getitem__ function is slower than the numpy version, and moving the code to the backend would improve array indexing performance.

I've attached an image of the profile visualization of the related part of the code (getting the best translations at the end of decoding).
mx2_profile_visual

@fhieber fhieber closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants