Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Discussion] Unified performance tests and dashboard #15757

Open
sandeep-krishnamurthy opened this issue Aug 5, 2019 · 22 comments
Open

[Discussion] Unified performance tests and dashboard #15757

sandeep-krishnamurthy opened this issue Aug 5, 2019 · 22 comments

Comments

@sandeep-krishnamurthy
Copy link
Contributor

sandeep-krishnamurthy commented Aug 5, 2019

Problem Statement

  1. Performance tests are not integrated with CI. We do not run any performance tests during PR validation and nightly tests. We will not be able to catch performance leaks early enough leading to performance degradations, regressions caught during or after a release.
  2. Without performance tests with CI, we are unable to track performance improvement/degradation and bring in the focus of the community towards performance improvement related projects.
  3. With new projects such as NumPy, Large Tensor Support, MKLDNN 1.0 integration, MShadow deprecation etc... tracking changes in the performance is critical. Having tools and integration with CI will make us move faster and handle regression swiftly.
  4. Current performance/benchmark tests are too diverse distributed and maintained across teams and repos.
    1. We have few performance tests under - benchmark/python
    2. Recently, operator performance tests opperf
    3. MXNet contributors at AWS maintain a suite of performance tests in - awslabs/deeplearning-benchmarks
    4. MXNet contributors at Intel maintain a suite of performance tests. (repo - ??)
    5. MXNet contributors at NVIDIA maintain a suite of performance tests. (repo - ??)
  5. MXNet currently does not have a common dashboard to view performance benchmarks.

Proposal

  1. At high level we can divide all performance tests into 3 categories:
    1. Kernel level tests - Ex: Conv MKLDNN/CuDNN kernels.
    2. Operator level tests - Ex: OpPerf we have in MXNet. This tests MXNet engine and other critical paths involved in execution of an operator.
    3. End to end topology/model tests - Ex: ResNet50-v1 on ImageNet
      1. Training
      2. Inference
  2. We will unify all performance tests distributed across MXNet repo, repos maintained by contributors across AWS, NVIDIA, Intel, and others under one single umbrella of MXNet performance tests and benchmarks.
  3. We will integrate these performance tests with MXNet CI system. We need to divide tests across PR and nightly/weekly tests.
  4. We will have a unified dashboard with results from nightly builds to see the status of MXNet at given point by the community.

This is a topic open for discussion. Please do comment with your suggestions/feedbacks.

CC: @apeforest @ChaiBapchya @access2rohit @samskalicky @PatricZhao @TaoLv @ptrendx @marcoabreu

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

@marcoabreu
Copy link
Contributor

We can't use the CI system for performance measurements since it does not provide a consistent environment for various reasons (efficiency, maintainability, etc). Thus, we need a separate system that has the sole purpose of being entirely consistent.

Also, I'm afraid that using tests to also measure performance could be misleading since tests might get extended or altered. I'd propose to have dedicated benchmarks instead.

@pengzhao-intel
Copy link
Contributor

+1

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

@sandeep-krishnamurthy
Copy link
Contributor Author

+1

It's a nice proposal which can save lots of maintaining efforts for a different organization with a single and unified dashboard and also very easy to track the performance regression.
Meanwhile, everyone can check and cite the latest performance from the official repo.

Actually, there're lots of tasks before achieving this goal. @juliusshufan can share some of our local experience first and then we can go to details of this proposal including SW, HW, database, metrics, etc.

Thanks @PatricZhao - This requires both hardware and software setup. Let us start small with whatever is available and incrementally expand it. Looking forward to more learnings from your experience.

@sandeep-krishnamurthy
Copy link
Contributor Author

@ptrendx - Any inputs on the performance related tests / benchmarks / CI you maintain that can be upstreamed here?

@ptrendx
Copy link
Member

ptrendx commented Aug 8, 2019

We can certainly push some of our benchmarks to that common repo, although I'm not sure how to handle the differences between our container version of MXNet and upstream.

As for the performance testing insights - having a dedicated machine is important (so probably p3.16xlarge instance) as other tenants may skew the results, especially for the cases that are more CPU or IO intensive.

@juliusshufan
Copy link
Contributor

juliusshufan commented Aug 9, 2019

Update some benchmark and accuracy test from Intel side.

Currently, we track the performance, accuracy and convergence of the MXNet github repo nightly, covering different models and MXNet Op. The kernel level performance is also measured with MKLDNN upgrade. The performance measurement on Xeon platform, covering the "top-bin" and "main-stream" SKUs. The scripts involve the internals and also levergate the public MXNet examples.

The performance report normally compared and presented by,

  • day-to-day comparison, if the performance fluctuation exceeds a preset threshold (model level normally 10%, accuracy is 0 gap), it will raise an suspicious regression;
  • Long-term trends tracking, The recent 30-day performance are presented as a curve;
  • The most recent nightly performance data will be the default criteria for the internal CI test and comparison target.

Detailed HW spec we used for performance tracking in below table, we using CentOS 7.5 and metal machine using below HW spec.

  Socket Physical Core HT Turbo RAM RAM Slot Memory Bandwidth
SKX-8180 2 28 On On DDR4 2666 2*6 255GB/s
SKX-6148 2 20 On On DDR4 2666 2*6 255GB/s
CLX-8280 2 28 On On DDR4 2933 2*6 281GB/s
CLX-8260 2 24 On On DDR4 2933 2*6 281GB/s
CLX-6248 2 20 On On DDR4 2666 2*6 255GB/s

To reflect the real production scenario, the SW configurations we used for performance tracking, the benchmark measurements are executed with different socket/cores/instances configuration.

Other details we may discuss offline. Thanks.

@apeforest
Copy link
Contributor

apeforest commented Aug 14, 2019

@juliusshufan Thanks for providing the benchmark setup. Recently we have been running operator-level runtime comparison between int32 and int64 data types for tensor indexing using the MXNet Opperf profiler contributed by @sandeep-krishnamurthy and et al. However, we do noticed large variations if we calibrate the runtime using built-in profiler in MXNet, also mis-correlation from the runtime we measured using Python time module directly. @ChaiBapchya can provide more detailed performance results. We need a universal way to calibrate runtime in order for us to track the performance results. Any advice will be appreciated.

@ChaiBapchya
Copy link
Contributor

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

@pengzhao-intel
Copy link
Contributor

Here are the links for Large Tensor Operator benchmarks I ran.

Python's Time module -
https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

MXNet Profiler (built-in CPP profiler) - https://docs.google.com/spreadsheets/d/1VkZoBFacZo8NGNcdFU5P9gFs3dm7D_ykOkPUzUD-Yu4/edit?usp=sharing

Tested on - p3.16xl instance

Thanks, @apeforest @ChaiBapchya we are testing large tensor operator now. Will come back with the results soon

@apeforest
Copy link
Contributor

@pengzhao-intel There was some mistake in the earlier results due to CPU sharing. Chai has re-run profiling and collected the updated results here:

https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit?usp=sharing

Please check the three sheets: Shape (1024, 1024), Shape (10000, 1) and Shape (10000, 100) corresponding to three different input shapes. The runtime numbers are the 50 percentile out of 100 runs. There are comparison between int64/int32 and in64mkl/int32mkl. Please feel free to ping @ChaiBapchya or me should you have any question.

Thanks!

@marcoabreu
Copy link
Contributor

marcoabreu commented Aug 22, 2019 via email

@apeforest
Copy link
Contributor

apeforest commented Aug 22, 2019

@marcoabreu You are right. We should be more frugal :) @ChaiBapchya c5.x18 might be sufficient.

@marcoabreu
Copy link
Contributor

marcoabreu commented Aug 22, 2019 via email

@ChaiBapchya
Copy link
Contributor

It didn't occur to me about the instance. Apologies for the same. @marcoabreu Thanks for bringing it to our notice.

Having said that wanted clarification
"So the results
don't really reflect the reality " - Why does running CPU only benchmark on p3.16xl not reflect reality? All the 4 config (int32,int32+mkl.int64,int64+mkl) were run on the same instance. Moreover, I was planning to run GPU benchmarks as well. In that sense, wouldn't it make sense to run all of this in an instance that provides CPU+GPU support too.

"apples stay apples
and pears be pears :)" - meaning CPU benchmark in c5.18xl and GPU benchmark in p3.16xl?

Thanks

@wuxun-zhang
Copy link
Contributor

We have just collected the performance numbers of some operators (like fullyConnected, softmax, etc) with MKL-DNN implementation. We also compared the results between MKL-DNN v0.20 and v1.0. Now one local CLX-8280 with 28 physical cores is used to run benchmark. Later maybe we'll switch to AWS EC2 C5 instance.
Because I don't have edit access to Chai's google doc, so I just listed the results in another doc below (please check the sheet Large Tensor Test (MKL-DNN)):

https://docs.google.com/spreadsheets/d/10rhQEzDqnCjSKq27QlT04qNHegmAZjOoVqT_q287_ZU/edit?usp=sharing

@marcoabreu
Copy link
Contributor

It doesn't reflect reality in so far as that users would not run a cpu only build on a p3.16xlarge but on a c5 instead.

Right, they were run on the same instance, but I'm not sure (Intel, please confirm) if the CPUs in a c5 might perform differently. But in general I would doubt it and say that the relative results are still relevant, just not accurate.

I don't think it would make sense to be honest. A user looks at throughput/$ (or latency or whatever metric they optimize for). Cpu instances are way cheaper, but might underperform In direct comparison. But if you normalize these results with the costs, you will get a picture that's way closer to the reality of how a real user will use MXNet. In the end, we're optimizing for real use cases, so we should make the benchmarks and environment also as close to reality as possible.

Correct, that's what I meant :)

I didn't check in detail and sorry if my proposal introduces too much of a complexity, but what do you think about considering the performance of more than one sequential execution (think of a service) but instead measure the performance a fully utilized system is capable to handle? Like high batch size with one process (throughput optimized) vs batch size one with many processes (latency optimized).

@apeforest
Copy link
Contributor

apeforest commented Aug 22, 2019

Hi @wuxun-zhang Thanks for the running the test and sharing data. Are the performance numbers generated from your inhouse profiling tool at Intel? We also noticed using average sometimes can be misleading due to some glitch (one super large number). We used p50 number to present in the table instead.

@wuxun-zhang
Copy link
Contributor

@apeforest I used the Chai's large tensor benchmark scripts with latest MXNet master. So I think the data should be average but not a p50 number. Later I will update the data by using p50 metric to ensure consistency with your data.

@ChaiBapchya
Copy link
Contributor

@wuxun-zhang For p50,90 and 99 numbers, I've this PR #15953

Once that's merged you will be able to get those numbers using python's time module.

With profiler flag, you can choose between python or Native.

@wuxun-zhang
Copy link
Contributor

Hi @ChaiBapchya , is there has some updates for this large tensor benchmark script? I tried to run this script with this commit and will get such an error below. Look that this error is caused by incomplete input arguments (missing num_hidden for FC). BTW, this script works well for other operators but FC from my side. Thanks for your help in advance.

(mxnet_p36) ubuntu@ip-172-31-18-141:~/github/incubator-mxnet/benchmark/opperf$ python opperf_large_tensor.py --ctx=cpu -p python
Large tensor support : OFF
INFO:root:Running Large tensor benchmarks with the following options: Namespace(ctx='cpu', dtype='float32', mkldnn_option='mkldnn', output_file='./mxnet_operator_benchmarks.json', output_format='json', profiler='python')
[{'data': (1024, 1024), 'weight': (1024, 1024)}, {'data': (10000, 1), 'weight': (10000, 1)}, {'data': (10000, 100), 'weight': (10000, 100)}]
Traceback (most recent call last):
  File "opperf_large_tensor.py", line 114, in <module>
    sys.exit(main())
  File "opperf_large_tensor.py", line 103, in main
    final_benchmark_results = run_large_test_benchmarks(args.profiler, ctx=ctx, dtype=dtype)
  File "opperf_large_tensor.py", line 46, in run_large_test_benchmarks
    mx_large_tensor_results = run_op_benchmarks(mx_large_tensor_ops, dtype, ctx, profiler, warmup=10, runs=100)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 157, in run_op_benchmarks
    warmup=warmup, runs=runs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 137, in run_performance_test
    benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, args_list, kwargs_list, profiler)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 69, in _run_nd_operator_performance_test
    _, _ = benchmark_helper_func(op, warmup, [], **kwargs_list[0])
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 241, in python_profile_it
    res = func(*modified_args, **kwargs)
  File "/home/ubuntu/github/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 48, in nd_forward_backward_and_profile
    res = op(**kwargs)
  File "<string>", line 86, in FullyConnected
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 100, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/github/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Required parameter num_hidden of int is not presented, in operator FullyConnected(name="")

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Aug 28, 2019

Yes. (This error is probably caused because incorrect file is being used. It was previously used for testing on my branch. But now with latest master, opperf.py file is good to use.)

Few pointers -

  1. Don't use separate file for testing large tensor opperf_large_tensor.py. Functionality has been merged into the original opperf.py file.
  2. All the operators that have been benchmarked so far in the opperf utility (in the master branch) can be profiled with native/python.
  3. Inclusion of python time module via flag
  4. Adding more operators to improve coverage

For current master branch,
All you've to do now for the opperf utility is run
python opperf.py with your desired flags --ctx=cpu -p python
It will run all the ops supported without error.

Let me know if that helps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants