Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS threading affinity #7197

Closed
luotao1 opened this issue Jan 4, 2018 · 1 comment · Fixed by #7397
Closed

OpenBLAS threading affinity #7197

luotao1 opened this issue Jan 4, 2018 · 1 comment · Fixed by #7397
Assignees
Labels

Comments

@luotao1
Copy link
Contributor

luotao1 commented Jan 4, 2018

The OpenBLAS inference benchmark seems wrong in IntelOptimizedPaddle.md.
In all three networks, when BatchSize increases, the images/second decreases.

  • VGG-19
BatchSize 1 2 4 8 16
OpenBLAS 1.07 1.08 1.06 0.88 0.65
  • ResNet-50
BatchSize 1 2 4 8 16
OpenBLAS 3.35 3.19 3.09 2.55 1.96
  • GoogLeNet
BatchSize 1 2 4 8 16
OpenBLAS 12.04 11.31 10.00 9.07 4.34

Possible Reason

Why images/second increases with BatchSize increasing in training benchmark?

OPENBLAS_NUM_THREADS * trainer_count = core number

The minimum BatchSize used in training is 64, which is larger than core number (40 in the experiment). Thus, we export OPENBLAS_NUM_THREADS=1 and trainer_count=40.

However, in inference, the BatchSize is smaller than core number. For example, when BatchSize=2, we export OPENBLAS_NUM_THREADS=20 and trainer_count=2. Which may cause the conflict in thread affinity.

How could I disable OpenBLAS threading affinity on runtime?

You can define the OPENBLAS_MAIN_FREE or GOTOBLAS_MAIN_FREE environment variable to disable threading affinity on runtime. For example, before the running, export OPENBLAS_MAIN_FREE=1
Alternatively, you can disable affinity feature with enabling NO_AFFINITY=1 in Makefile.rule. https://github.com/xianyi/OpenBLAS/wiki/Faq#no_affinity

Thus, I export OPENBLAS_MAIN_FREE=1, and test VGG inference, the result speedups:

BatchSize 1 2 4 8 16
OpenBLAS 1.07->1.08 1.08->1.99 1.06->3.64 0.88->3.57 0.65->2.27

@tensor-tang Can you help double check this result?

Solution

If OpenBLAS threading affinity affects the elapsed time, should we auto set it in the program like MKL does?

@tensor-tang Can you give some suggestion about this?

@tensor-tang
Copy link
Contributor

tensor-tang commented Jan 5, 2018

Just FYI.

I tried VGG-19 after added export OPENBLAS_MAIN_FREE=1

BatchSize 1 2 4 8 16
OpenBLAS 1.36 2.72 3.67 - 2.64

It hanged at batchsize 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants