The profiling results for ResNet. #6179

qingqing01 · 2017-12-01T15:52:44Z

Config and Env

ResNet 50: Add benchmark configuration and scripts for convnets. #6056
Input: 3 * 224 * 224
batch_size = 32, 80 iterations
Env: nvidia-docker, TITAN X (Pascal), single machine.

The profiling results

By code: https://github.com/PaddlePaddle/Paddle/compare/develop...qingqing01:add_timer?expand=1
Compling mode: Release
CMake config: -DWITH_GPU=ON -DWITH_TIMER=ON -DWITH_MKL=ON -DCMAKE_BUILD_TYPE=Release

Total examples: 2560, total time: 47.83653 sec
53.51559 examples/sec, 1.49489 sec/batch

unit: ms
Stat=ExecutorRunTimer       total=37099.9    avg=227.606    max=526.269    min=0.092      count=163

Stat=conv2d_grad             total=18104.7    avg=4.269      max=15.852     min=2.222      count=4240
Stat=conv2d                  total=9482.79    avg=2.236      max=10.438     min=0.601      count=4240
    ~~Stat=Im2ColTimer         total=461.405    avg=0.007      max=2.732      min=0.004      count=58880~~
    ~~Stat=GemmTimer           total=2343.45    avg=0.017      max=5.965      min=0.007      count=135680~~
Stat=batch_norm_grad         total=2008.38    avg=0.473      max=5.74       min=0.086      count=4240
Stat=batch_norm              total=1679.04    avg=0.396      max=4.169      min=0.09       count=4240
Stat=sum                     total=1347.13    avg=0.935      max=3.724      min=0.013      count=1440
Stat=relu_grad               total=988.555    avg=0.252      max=5.071      min=0.044      count=3920
Stat=elementwise_add_grad    total=706.461    avg=0.519      max=6.632      min=0.044      count=1360
Stat=relu                    total=694.561    avg=0.177      max=2.699      min=0.031      count=3920
Stat=elementwise_add         total=560.761    avg=0.412      max=3.953      min=0.02       count=1360
Stat=momentum                total=516.091    avg=0.04       max=9.432      min=0.02       count=12880
Stat=pool2d_grad             total=326.743    avg=2.042      max=6.607      min=0.241      count=160
Stat=CreateOpTimer           total=237.364    avg=0.005      max=7.503      min=0.001      count=44433
Stat=DeleteLocalScopeTimer   total=108.412    avg=0.665      max=2.648      min=0.001      count=163
Stat=CreateLocalScopeTimer   total=76.612     avg=0.47       max=1.804      min=0.004      count=163
Stat=pool2d                  total=73.987     avg=0.462      max=0.675      min=0.315      count=160
Stat=fill_constant           total=19.168     avg=0.03       max=12.984     min=0.006      count=619
Stat=cast                    total=12.543     avg=0.039      max=0.219      min=0.016      count=320
Stat=gaussian_random         total=11.399     avg=0.215      max=1.26       min=0.015      count=53
Stat=mul                     total=9.389      avg=0.117      max=0.212      min=0.092      count=80
Stat=accuracy                total=8.361      avg=0.104      max=0.477      min=0.068      count=80
Stat=mul_grad                total=7.426      avg=0.092      max=0.179      min=0.075      count=80
Stat=feed                    total=6.262      avg=0.039      max=0.185      min=0.015      count=160
Stat=softmax                 total=5.667      avg=0.07       max=0.677      min=0.042      count=80
Stat=fetch                   total=5.084      avg=0.021      max=0.254      min=0.01       count=240
Stat=top_k                   total=3.373      avg=0.042      max=0.126      min=0.026      count=80
Stat=softmax_grad            total=3.074      avg=0.038      max=0.1        min=0.029      count=80
Stat=cross_entropy_grad      total=2.604      avg=0.032      max=0.091      min=0.023      count=80
Stat=cross_entropy           total=2.429      avg=0.03       max=0.075      min=0.021      count=80
Stat=elementwise_div         total=2.412      avg=0.03       max=0.096      min=0.02       count=80
Stat=mean                    total=1.997      avg=0.024      max=0.077      min=0.018      count=80
Stat=mean_grad               total=1.838      avg=0.022      max=0.058      min=0.016      count=80
Stat=uniform_random          total=0.044      avg=0.044      max=0.044      min=0.044      count=1
--------------------------------------------------

The operators needing to optimize

Conv2d/Conv2d_grad
- The total time of conv2d is 9482.79ms, ~~But the mainly computing time of im2col and gemm is 461.405 + 2343.45 =2804.855ms~~. (There is no stream synchronization between im2col and gemm, so the time for im2col and gemm is not accurate.)
relu/relu_grad
elementwise_add/elementwise_add_grad
momentum
sum

The time of Python accounts about 22% of total time.

(47.83653-37.0999)/47.83653 = 22.44 %

The text was updated successfully, but these errors were encountered:

dzhwinter · 2017-12-01T16:09:16Z

Definitely, there is a lot optimization job need to do with Python code.....
Maybe we need to add the test commit id. Not sure Whether this test result has included the improved which has been done.

qingqing01 · 2017-12-01T16:15:57Z

@dzhwinter Use the latest code in develop branch, the commit id : fb91938 , which includes the improvement of enforce.

qingqing01 · 2017-12-05T08:03:30Z

Some python profiling:

run_benchmark()

Timer unit: 1e-06 s

Total time: 58.5132 s
File: benchmark_convnet_model.py
Function: run_benchmark at line 118

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   118                                           @profile
   119                                           def run_benchmark(model, args):
   120         1            2      2.0      0.0      if args.use_cprof:
   121                                                   pr = cProfile.Profile()
   122                                                   pr.enable()
   123         1            4      4.0      0.0      start_time = time.time()
   124
   125         1            1      1.0      0.0      class_dim = 102
   126         1           18     18.0      0.0      dshape = [3, 224, 224] if args.order == 'NCHW' else [224, 224, 3]
   127         1          151    151.0      0.0      input = fluid.layers.data(name='data', shape=dshape, dtype='float32')
   128         1           68     68.0      0.0      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
   129         1        80563  80563.0      0.1      predict = model(input, class_dim)
   130         1          123    123.0      0.0      cost = fluid.layers.cross_entropy(input=predict, label=label)
   131         1          170    170.0      0.0      avg_cost = fluid.layers.mean(x=cost)
   132         1           12     12.0      0.0      optimizer = fluid.optimizer.Momentum(learning_rate=0.01, momentum=0.9)
   133         1        84666  84666.0      0.1      opts = optimizer.minimize(avg_cost)
   134         1          902    902.0      0.0      accuracy = fluid.evaluator.Accuracy(input=predict, label=label)
   135
   136         1            2      2.0      0.0      train_reader = paddle.batch(
   137         1            1      1.0      0.0          paddle.reader.shuffle(
   138         1       753353 753353.0      1.3              paddle.dataset.flowers.train(), buf_size=5120),
   139         1            6      6.0      0.0          batch_size=args.batch_size)
   140
   141         1            1      1.0      0.0      if args.device == 'CPU':
   142                                                   place = fluid.CPUPlace()
   143                                               else:
   144         1           18     18.0      0.0          place = fluid.GPUPlace(0)
   145         1      1337688 1337688.0      2.3      exe = fluid.Executor(place)
   146         1      4506556 4506556.0      7.7      exe.run(fluid.default_startup_program())
   147
   148         1           21     21.0      0.0      iter = 0
   149         1            2      2.0      0.0      im_num = 0
   150         1            3      3.0      0.0      batch_num = 0
   151         2           65     32.5      0.0      for pass_id in range(args.pass_num):
   152         1         4443   4443.0      0.0          accuracy.reset(exe)
   153         1            6      6.0      0.0          if iter == args.iterations:
   154                                                       break
   155        81      2553690  31527.0      4.4          for data in train_reader():
   156        81          299      3.7      0.0              if iter == args.iterations:
   157         1        10266  10266.0      0.0                  break
   158        80      1607123  20089.0      2.7              image = np.array(map(lambda x: x[0].reshape(dshape), data)).astype(
   159        80       298943   3736.8      0.5                  'float32')
   160        80         4482     56.0      0.0              label = np.array(map(lambda x: x[1], data)).astype('int64')
   161        80          514      6.4      0.0              label = label.reshape([-1, 1])
   162        80          236      3.0      0.0              im_num += label.shape[0]
   163        80          127      1.6      0.0              batch_num += 1
   164        80          537      6.7      0.0              loss, acc = exe.run(fluid.default_main_program(),
   165        80          175      2.2      0.0                                  feed={'data': image,
   166        80          117      1.5      0.0                                        'label': label},
   167        80     47063610 588295.1     80.4                                  fetch_list=[avg_cost] + accuracy.metrics)
   168        80       128846   1610.6      0.2              pass_acc = accuracy.eval(exe)
   169        80          246      3.1      0.0              print("Iter: %d, loss: %s, acc: %s, pass_acc: %s" %
   170        80        74813    935.2      0.1                    (iter, str(loss), str(acc), str(pass_acc)))
   171        80          288      3.6      0.0              iter += 1
   172
   173         1           15     15.0      0.0      duration = time.time() - start_time
   174         1            2      2.0      0.0      examples_per_sec = im_num / duration
   175         1            1      1.0      0.0      sec_per_batch = duration / batch_num
   176
   177         1            1      1.0      0.0      print('\nTotal examples: %d, total time: %.5f' %
   178         1           16     16.0      0.0            (im_num, duration))
   179         1            2      2.0      0.0      print('%.5f examples/sec, %.5f sec/batch \n' %
   180         1            8      8.0      0.0            (examples_per_sec, sec_per_batch))
   181
   182         1            2      2.0      0.0      if args.use_cprof:
   183                                                   pr.disable()
   184                                                   s = StringIO.StringIO()
   185                                                   sortby = 'cumulative'
   186                                                   ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
   187                                                   ps.print_stats()
   188                                                   print(s.getvalue())

executor.run()

The program = program.clone() accounts 7.7% of executor.run().

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               @profile
    94                                               def run(self,
    95                                                       program=None,
    96                                                       feed=None,
    97                                                       fetch_list=None,
    98                                                       feed_var_name='feed',
    99                                                       fetch_var_name='fetch',
   100                                                       scope=None,
   101                                                       return_numpy=True):
   102       162          499      3.1      0.0          if feed is None:
   103        82          106      1.3      0.0              feed = {}
   104       162          183      1.1      0.0          if fetch_list is None:
   105         2            2      1.0      0.0              fetch_list = []
   106
   107       162          159      1.0      0.0          if program is None:
   108                                                       program = default_main_program()
   109
   110       162          331      2.0      0.0          if not isinstance(program, Program):
   111                                                       raise TypeError()
   112
   113       162          161      1.0      0.0          if scope is None:
   114       162          200      1.2      0.0              scope = g_scope
   115
   116       162      3887583  23997.4      7.7          program = program.clone()
   117       162          505      3.1      0.0          global_block = program.global_block()
   118       162          173      1.1      0.0          feed_var = global_block.create_var(
   119       162          146      0.9      0.0              name=feed_var_name,
   120       162          324      2.0      0.0              type=core.VarDesc.VarType.FEED_MINIBATCH,
   121       162         3513     21.7      0.0              persistable=True)
   122
   123       322          953      3.0      0.0          for i, name in enumerate(feed):
   124       160          935      5.8      0.0              out = global_block.var(name)
   125       160          176      1.1      0.0              global_block.prepend_op(
   126       160          136      0.8      0.0                  'feed',
   127       160          298      1.9      0.0                  inputs={'X': [feed_var]},
   128       160          191      1.2      0.0                  outputs={'Out': [out]},
   129       160        23937    149.6      0.0                  attrs={'col': i})
   130       160          343      2.1      0.0              cur_feed = feed[name]
   131       160          417      2.6      0.0              if not isinstance(cur_feed, core.LoDTensor):
   132       160       232442   1452.8      0.5                  cur_feed = self.aslodtensor(cur_feed)
   133       160         3658     22.9      0.0              core.set_feed_variable(scope, cur_feed, feed_var.name, i)
   134
   135       162          170      1.0      0.0          fetch_var = global_block.create_var(
   136       162          137      0.8      0.0              name=fetch_var_name,
   137       162          269      1.7      0.0              type=core.VarDesc.VarType.FETCH_LIST,
   138       162         3284     20.3      0.0              persistable=True)
   139       402          761      1.9      0.0          for i, var in enumerate(fetch_list):
   140       240          258      1.1      0.0              global_block.append_op(
   141       240          227      0.9      0.0                  type='fetch',
   142       240          299      1.2      0.0                  inputs={'X': [var]},
   143       240          253      1.1      0.0                  outputs={'Out': [fetch_var]},
   144       240        25780    107.4      0.1                  attrs={'col': i})
   145
   146       162     46221575 285318.4     91.7          self.executor.run(program.desc, scope, 0, True)
   147                                                   outs = [
   148       162          656      4.0      0.0              core.get_fetch_variable(scope, fetch_var_name, i)
   149       402         4128     10.3      0.0              for i in xrange(len(fetch_list))
   150                                                   ]
   151
   152       162          156      1.0      0.0          if return_numpy:
   153       162         8869     54.7      0.0              outs = as_numpy(outs)
   154       162          160      1.0      0.0          return outs

qingqing01 added the performance tuning label Dec 1, 2017

qingqing01 mentioned this issue Dec 5, 2017

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

Closed

tonyyang-svail mentioned this issue Dec 6, 2017

Performance Improvements suggestions on ConNets #6321

Closed

qingqing01 closed this as completed Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The profiling results for ResNet. #6179

The profiling results for ResNet. #6179

qingqing01 commented Dec 1, 2017 •

edited

Loading

dzhwinter commented Dec 1, 2017 •

edited

Loading

qingqing01 commented Dec 1, 2017 •

edited

Loading

qingqing01 commented Dec 5, 2017

The profiling results for ResNet. #6179

The profiling results for ResNet. #6179

Comments

qingqing01 commented Dec 1, 2017 • edited Loading

Config and Env