Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-91] Added unittest for benchmarking metric performance #9705

Merged
merged 1 commit into from
Mar 16, 2018

Conversation

safrooze
Copy link
Contributor

@safrooze safrooze commented Feb 6, 2018

Output of the benchmark is sent to stderr

Description

Benchmark loops through two batch-sizes (100,000 and 1,000,000) and two output dimensions (100 and 500) and generates random data on CPU and GPU and calls metric.update() on a list of metrics with the generated date.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Added unit-test for benchmarking metric performance.

Comments

  • Unit-test passes without GPU, but fails if GPU memory allocation fails
  • The output looks like this on a p2.x instance
mx.metric benchmarks
Metric         Ctx       Batch Size     Output Dim     Elapsed Time
----------------------------------------------------------------------
acc            cpu(0)    100000         100            0.069804
acc            gpu(0)    100000         100            0.0055592
----------------------------------------------------------------------
acc            cpu(0)    100000         500            0.29323
acc            gpu(0)    100000         500            0.034261
----------------------------------------------------------------------
acc            cpu(0)    1000000        100            0.66856
acc            gpu(0)    1000000        100            0.057442
----------------------------------------------------------------------
acc            cpu(0)    1000000        500            2.9239
acc            gpu(0)    1000000        500            0.27827
----------------------------------------------------------------------
top_k_acc      cpu(0)    100000         100            0.39707
top_k_acc      gpu(0)    100000         100            0.39684
----------------------------------------------------------------------
top_k_acc      cpu(0)    100000         500            2.6537
top_k_acc      gpu(0)    100000         500            2.6574
----------------------------------------------------------------------
top_k_acc      cpu(0)    1000000        100            4.0662
top_k_acc      gpu(0)    1000000        100            4.0537
----------------------------------------------------------------------
top_k_acc      cpu(0)    1000000        500            26.581
top_k_acc      gpu(0)    1000000        500            26.594
----------------------------------------------------------------------
F1             cpu(0)    100000         2              0.2515
F1             gpu(0)    100000         2              0.25105
----------------------------------------------------------------------
F1             cpu(0)    100000         2              0.25086
F1             gpu(0)    100000         2              0.24956
----------------------------------------------------------------------
F1             cpu(0)    1000000        2              2.509
F1             gpu(0)    1000000        2              2.5127
----------------------------------------------------------------------
F1             cpu(0)    1000000        2              2.5107
F1             gpu(0)    1000000        2              2.5094
----------------------------------------------------------------------
Perplexity     cpu(0)    100000         100            0.0058115
Perplexity     gpu(0)    100000         100            0.0030518
----------------------------------------------------------------------
Perplexity     cpu(0)    100000         500            0.0054376
Perplexity     gpu(0)    100000         500            0.0070541
----------------------------------------------------------------------
Perplexity     cpu(0)    1000000        100            0.042403
Perplexity     gpu(0)    1000000        100            0.003443
----------------------------------------------------------------------
Perplexity     cpu(0)    1000000        500            0.041232
Perplexity     gpu(0)    1000000        500            0.051778
----------------------------------------------------------------------
MAE            cpu(0)    100000         100            0.058175
MAE            gpu(0)    100000         100            0.056117
----------------------------------------------------------------------
MAE            cpu(0)    100000         500            0.26928
MAE            gpu(0)    100000         500            0.26553
----------------------------------------------------------------------
MAE            cpu(0)    1000000        100            0.53227
MAE            gpu(0)    1000000        100            0.52565
----------------------------------------------------------------------
MAE            cpu(0)    1000000        500            2.6206
MAE            gpu(0)    1000000        500            2.607
----------------------------------------------------------------------
MSE            cpu(0)    100000         100            0.041658
MSE            gpu(0)    100000         100            0.041626
----------------------------------------------------------------------
MSE            cpu(0)    100000         500            0.215
MSE            gpu(0)    100000         500            0.21492
----------------------------------------------------------------------
MSE            cpu(0)    1000000        100            0.43541
MSE            gpu(0)    1000000        100            0.42094
----------------------------------------------------------------------
MSE            cpu(0)    1000000        500            2.1183
MSE            gpu(0)    1000000        500            2.1229
----------------------------------------------------------------------
RMSE           cpu(0)    100000         100            0.042453
RMSE           gpu(0)    100000         100            0.041688
----------------------------------------------------------------------
RMSE           cpu(0)    100000         500            0.21422
RMSE           gpu(0)    100000         500            0.21395
----------------------------------------------------------------------
RMSE           cpu(0)    1000000        100            0.43216
RMSE           gpu(0)    1000000        100            0.42024
----------------------------------------------------------------------
RMSE           cpu(0)    1000000        500            2.1158
RMSE           gpu(0)    1000000        500            2.1298
----------------------------------------------------------------------
ce             cpu(0)    100000         100            0.017465
ce             gpu(0)    100000         100            0.016886
----------------------------------------------------------------------
ce             cpu(0)    100000         500            0.084103
ce             gpu(0)    100000         500            0.080693
----------------------------------------------------------------------
ce             cpu(0)    1000000        100            0.19837
ce             gpu(0)    1000000        100            0.1848
----------------------------------------------------------------------
ce             cpu(0)    1000000        500            0.81667
ce             gpu(0)    1000000        500            0.8098
----------------------------------------------------------------------
nll_loss       cpu(0)    100000         100            0.018017
nll_loss       gpu(0)    100000         100            0.016982
----------------------------------------------------------------------
nll_loss       cpu(0)    100000         500            0.083593
nll_loss       gpu(0)    100000         500            0.080484
----------------------------------------------------------------------
nll_loss       cpu(0)    1000000        100            0.19791
nll_loss       gpu(0)    1000000        100            0.1856
----------------------------------------------------------------------
nll_loss       cpu(0)    1000000        500            0.81095
nll_loss       gpu(0)    1000000        500            0.81938
----------------------------------------------------------------------
pearsonr       cpu(0)    100000         100            0.57283
pearsonr       gpu(0)    100000         100            0.22794
----------------------------------------------------------------------
pearsonr       cpu(0)    100000         500            2.2202
pearsonr       gpu(0)    100000         500            1.1238
----------------------------------------------------------------------
pearsonr       cpu(0)    1000000        100            4.4207
pearsonr       gpu(0)    1000000        100            2.2353
----------------------------------------------------------------------
pearsonr       cpu(0)    1000000        500            21.999
pearsonr       gpu(0)    1000000        500            11.147
----------------------------------------------------------------------

@safrooze
Copy link
Contributor Author

safrooze commented Feb 6, 2018

@szha Please review.

@eric-haibin-lin
Copy link
Member

Why not include small batch size like 64? 100k is huge.

@safrooze
Copy link
Contributor Author

safrooze commented Feb 6, 2018

The intention is to observe a measurable elapsed time (hence large data size) and amplify the difference between CPU and GPU processing (hence processing all the data in one batch). A valid alternative is to use small batch size and iterate over multiple batches.

@ptrendx
Copy link
Member

ptrendx commented Feb 6, 2018

@safrooze This is the wrong reasoning - the fact that GPU is faster than CPU when processing million elements does not mean you should use GPU when adding 2 numbers together. You should only test on batch sizes that are realistic (and do multiple runs to have measurable time difference).

@marcoabreu
Copy link
Contributor

Please make sure to use a fixed seed in order to provide reproducibility in between different runs.

@safrooze
Copy link
Contributor Author

safrooze commented Feb 8, 2018

OK I think I addressed all the feedback:

  • random is seeded
  • nd.wait_all() used before starting timing and before ending timing
  • Added batch-size values of 16, 64, 256, and 1024
  • Datasize varies by number of output channels to keep total runtime down to a few minutes

The modified code output looks like this:

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
acc            cpu(0)    cpu(0)      131072      16             128            1.0015
acc            cpu(0)    gpu(0)      131072      16             128            1.682
acc            gpu(0)    cpu(0)      131072      16             128            2.6263
acc            gpu(0)    gpu(0)      131072      16             128            3.3028
------------------------------------------------------------------------------------------
acc            cpu(0)    cpu(0)      131072      64             128            0.42843
acc            cpu(0)    gpu(0)      131072      64             128            0.568
acc            gpu(0)    cpu(0)      131072      64             128            0.78586
acc            gpu(0)    gpu(0)      131072      64             128            0.94317
------------------------------------------------------------------------------------------
acc            cpu(0)    cpu(0)      131072      256            128            0.19074
acc            cpu(0)    gpu(0)      131072      256            128            0.24228
acc            gpu(0)    cpu(0)      131072      256            128            0.21548
acc            gpu(0)    gpu(0)      131072      256            128            0.25075
------------------------------------------------------------------------------------------
acc            cpu(0)    cpu(0)      131072      1024           128            0.1303
acc            cpu(0)    gpu(0)      131072      1024           128            0.14127
acc            gpu(0)    cpu(0)      131072      1024           128            0.055079
acc            gpu(0)    gpu(0)      131072      1024           128            0.065515
------------------------------------------------------------------------------------------

@eric-haibin-lin
Copy link
Member

Wouldn't nightly test be a better place for performance tests like this? This unit test doesn't verify the correctness at all.

@safrooze
Copy link
Contributor Author

safrooze commented Feb 9, 2018

@eric-haibin-lin You're correct that nightly would be a more suitable place. One concern with nightly was that community wouldn't be able to see the results of the benchmark.

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 9, 2018 via email

@szha
Copy link
Member

szha commented Feb 20, 2018

When will nightly tests be moved to public CI?

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 20, 2018 via email

@safrooze
Copy link
Contributor Author

I'll move this to nightly tests then.

@CodingCat
Copy link
Contributor

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

@szha
Copy link
Member

szha commented Mar 12, 2018

Checking in on the public nightly build results, is it still on track?

@marcoabreu
Copy link
Contributor

I don't think so - at least not from my side. We have been resource constrained and managing the Nightly CI does not fit into my schedule, especially since all jobs have to be refactored. I will ask Bhavins Team to do it and I will do the reviews, but I personally am not able to refactor that part as well.

On the other hand, we've got additional headcount approved for CI, but it will take some time until everybody is ramped up. We will have to see how and when we can continue.

@szha
Copy link
Member

szha commented Mar 12, 2018

In that case, let's put the test in unittest for now. @safrooze could you resolve conflict?

@szha
Copy link
Member

szha commented Mar 13, 2018

One last request: would you put the performance tests in a separate test file, such as test_metric_perf.py, so that it's easier to move to nightly later?

- Output of the benchmark is sent to stderr
- random is seeded
- nd.wait_all() used before starting timing and before ending timing
- Added batch-size values of 16, 64, 256, and 1024
- Datasize varies by number of output channels to keep total runtime down to a few minutes
@safrooze safrooze changed the title Added unittest for benchmarking metric performance [MXNET-91] Added unittest for benchmarking metric performance Mar 13, 2018
@szha szha merged commit 4ad37d8 into apache:master Mar 16, 2018
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
- Output of the benchmark is sent to stderr
- random is seeded
- nd.wait_all() used before starting timing and before ending timing
- Added batch-size values of 16, 64, 256, and 1024
- Datasize varies by number of output channels to keep total runtime down to a few minutes
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
- Output of the benchmark is sent to stderr
- random is seeded
- nd.wait_all() used before starting timing and before ending timing
- Added batch-size values of 16, 64, 256, and 1024
- Datasize varies by number of output channels to keep total runtime down to a few minutes
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
- Output of the benchmark is sent to stderr
- random is seeded
- nd.wait_all() used before starting timing and before ending timing
- Added batch-size values of 16, 64, 256, and 1024
- Datasize varies by number of output channels to keep total runtime down to a few minutes
@leezu
Copy link
Contributor

leezu commented May 15, 2020

Why does this test only print numbers but doesn't actually enforce anything?

@ChaiBapchya ChaiBapchya mentioned this pull request May 15, 2020
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants