Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix record event for operator type in new dygraph #44582

Merged

Conversation

rainyfly
Copy link
Contributor

@rainyfly rainyfly commented Jul 25, 2022

PR types

Others

PR changes

Others

Describe

  1. 新动态图里记录Operator性能数据的打点外围也包裹了很多打点,这些外围的打点都被标记成了Operator类型,导致打印出来的算子表单严重冗余,并且会因为干扰没有办法获取真实的最大耗时op。现在通过将如下外围打点给标记为UserDefined类型进行修复。

image

通过跑PaddleDetection的yolov3_mobilenet_v1_roadsign.yml任务测试输出的算子表单如下:
----------------------------------------------------------------Operator Summary----------------------------------------------------------------
Time unit: ms
----------------------------------------------------  ------  ----------------------------------------  ----------------------------------------  
Name                                                  Calls   CPU Total / Avg / Max / Min / Ratio(%)    GPU Total / Avg / Max / Min / Ratio(%)    
----------------------------------------------------  ------  ----------------------------------------  ----------------------------------------  
-----------------------------------------------------------Thread: All threads merged-----------------------------------------------------------
Conv2dGradNodeFinal                                   296     195.39 / 0.66 / 1.17 / 0.18 / 13.89       622.99 / 2.10 / 4.79 / 0.24 / 23.94       
  MEMSET                                              344     - / - / - / - / -                         1.12 / 0.00 / 0.02 / 0.00 / 0.18          
  void wgrad_alg0_engine<float, 128, 5, 5, 3, 3, ...  32      - / - / - / - / -                         22.94 / 0.72 / 1.61 / 0.14 / 3.68         
  void cask_cudnn::computeOffsetsKernel<true, fal...  200     - / - / - / - / -                         0.74 / 0.00 / 0.01 / 0.00 / 0.12          
  cask_cudnn::computeBOffsetsKernel(cask_cudnn::C...  200     - / - / - / - / -                         0.73 / 0.00 / 0.00 / 0.00 / 0.12          
  maxwell_scudnn_128x64_stridedB_small_nn_v0          120     - / - / - / - / -                         79.47 / 0.66 / 1.32 / 0.09 / 12.76        
  void wgrad_alg0_engine<float, 128, 6, 7, 3, 3, ...  48      - / - / - / - / -                         56.80 / 1.18 / 3.46 / 0.17 / 9.12         
  void wgrad_alg0_engine<float, 128, 6, 8, 3, 3, ...  24      - / - / - / - / -                         32.42 / 1.35 / 2.46 / 0.50 / 5.20         
  cask_cudnn::computeWgradSplitKOffsetsKernel(cas...  120     - / - / - / - / -                         0.46 / 0.00 / 0.00 / 0.00 / 0.07          
  cask_cudnn::computeWgradBOffsetsKernel(cask_cud...  120     - / - / - / - / -                         0.46 / 0.00 / 0.00 / 0.00 / 0.07          
  maxwell_scudnn_128x128_stridedB_splitK_medium_n...  120     - / - / - / - / -                         102.69 / 0.86 / 1.27 / 0.29 / 16.48       
  void cudnn::ops::scalePackedTensor_kernel<float...  16      - / - / - / - / -                         1.08 / 0.07 / 0.07 / 0.07 / 0.17          
  void cudnn::detail::dgrad_engine<float, 512, 6,...  16      - / - / - / - / -                         6.51 / 0.41 / 0.55 / 0.26 / 1.05          
  maxwell_scudnn_128x128_stridedB_small_nn_v0         80      - / - / - / - / -                         49.89 / 0.62 / 0.79 / 0.40 / 8.01         
  void cudnn::winograd::generateWinogradTilesKern...  48      - / - / - / - / -                         6.80 / 0.14 / 0.23 / 0.06 / 1.09          
  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_...  48      - / - / - / - / -                         87.96 / 1.83 / 1.97 / 1.72 / 14.12        
  void cudnn::winograd_nonfused::winogradWgradDat...  72      - / - / - / - / -                         15.53 / 0.22 / 0.36 / 0.09 / 2.49         
  void cudnn::winograd_nonfused::winogradWgradDel...  72      - / - / - / - / -                         31.56 / 0.44 / 0.75 / 0.19 / 5.07         
  maxwell_sgemm_32x128_nt                             48      - / - / - / - / -                         48.79 / 1.02 / 1.06 / 0.93 / 7.83         
  void cudnn::winograd_nonfused::winogradWgradOut...  72      - / - / - / - / -                         14.22 / 0.20 / 0.43 / 0.04 / 2.28         
  void axpy_kernel_val<float, float>(cublasAxpyPa...  16      - / - / - / - / -                         1.64 / 0.10 / 0.14 / 0.07 / 0.26          
  maxwell_sgemm_64x64_nt                              24      - / - / - / - / -                         19.12 / 0.80 / 0.81 / 0.79 / 3.07         
  void cudnn::winograd::generateWinogradTilesKern...  24      - / - / - / - / -                         0.41 / 0.02 / 0.02 / 0.02 / 0.07          
  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_...  24      - / - / - / - / -                         41.65 / 1.74 / 1.76 / 1.72 / 6.69         
sync_batch_norm dygraph                               376     32.75 / 0.09 / 0.49 / 0.07 / 2.33         521.43 / 1.39 / 7.58 / 0.13 / 20.04       
  sync_batch_norm compute                             376     21.44 / 0.06 / 0.09 / 0.05 / 65.47        521.43 / 1.39 / 7.58 / 0.13 / 100.00      
    void phi::KeLocalStats<float, 256, (paddle::e...  376     - / - / - / - / -                         62.65 / 0.17 / 0.86 / 0.01 / 12.02        
    void phi::KeSyncAndMovingStats<float>(paddle:...  376     - / - / - / - / -                         2.15 / 0.01 / 0.01 / 0.00 / 0.41          
    void phi::KeNormAffine<float, (paddle::experi...  376     - / - / - / - / -                         456.63 / 1.21 / 6.71 / 0.11 / 87.57       
  sync_batch_norm node_creation                       376     4.58 / 0.01 / 0.02 / 0.01 / 13.98         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
SyncBatchNormGradNodeFinal                            376     28.69 / 0.08 / 0.13 / 0.06 / 2.04         421.23 / 1.12 / 6.17 / 0.12 / 16.18       
  sync_batch_norm_grad compute                        376     15.70 / 0.04 / 0.09 / 0.03 / 54.73        421.23 / 1.12 / 6.17 / 0.12 / 100.00      
    void phi::KeBackwardLocalStats<float, 256, (p...  376     - / - / - / - / -                         128.51 / 0.34 / 1.83 / 0.04 / 30.51       
    void phi::KeBNBackwardScaleBias<float, 256, (...  376     - / - / - / - / -                         125.86 / 0.33 / 1.82 / 0.03 / 29.88       
    void phi::KeBNBackwardData<float, (paddle::ex...  376     - / - / - / - / -                         166.86 / 0.44 / 2.53 / 0.04 / 39.61       
conv2d dygraph                                        296     115.38 / 0.39 / 0.65 / 0.24 / 8.20        341.94 / 1.16 / 5.38 / 0.09 / 13.14       
  conv2d node_creation                                296     2.14 / 0.01 / 0.02 / 0.01 / 1.85          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  void cask_cudnn::computeOffsetsKernel<false, fa...  176     - / - / - / - / -                         0.63 / 0.00 / 0.01 / 0.00 / 0.18          
  maxwell_scudnn_128x32_relu_medium_nn_v1             8       - / - / - / - / -                         1.99 / 0.25 / 0.25 / 0.25 / 0.58          
  maxwell_sgemm_64x64_nn                              40      - / - / - / - / -                         19.94 / 0.50 / 1.20 / 0.13 / 5.83         
  maxwell_sgemm_128x32_nn                             8       - / - / - / - / -                         0.79 / 0.10 / 0.10 / 0.09 / 0.23          
  void cudnn::winograd::generateWinogradTilesKern...  48      - / - / - / - / -                         6.27 / 0.13 / 0.25 / 0.02 / 1.83          
  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobil...  48      - / - / - / - / -                         153.01 / 3.19 / 5.13 / 1.96 / 44.75       
  maxwell_scudnn_128x64_relu_interior_nn_v1           104     - / - / - / - / -                         61.09 / 0.59 / 1.25 / 0.12 / 17.86        
  void cudnn::winograd::generateWinogradTilesKern...  24      - / - / - / - / -                         1.55 / 0.06 / 0.07 / 0.06 / 0.45          
  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_...  24      - / - / - / - / -                         48.56 / 2.02 / 2.07 / 1.99 / 14.20        
  maxwell_scudnn_128x64_relu_small_nn_v1              48      - / - / - / - / -                         38.06 / 0.79 / 0.89 / 0.48 / 11.13        
  maxwell_scudnn_128x128_relu_medium_nn_v1            8       - / - / - / - / -                         5.28 / 0.66 / 0.67 / 0.65 / 1.54          
  maxwell_scudnn_128x32_relu_small_nn_v1              8       - / - / - / - / -                         4.78 / 0.60 / 0.61 / 0.59 / 1.40          
DepthwiseConv2dGradNodeFinal                          104     6.58 / 0.06 / 0.09 / 0.06 / 0.47          239.45 / 2.30 / 4.10 / 1.16 / 9.20        
  depthwise_conv2d_grad compute                       104     4.80 / 0.05 / 0.05 / 0.04 / 72.87         232.81 / 2.24 / 4.10 / 1.16 / 97.23       
    void Eigen::internal::EigenMetaKernel<Eigen::...  208     - / - / - / - / -                         23.15 / 0.11 / 0.75 / 0.00 / 9.94         
    void paddle::operators::math::KernelDepthwise...  72      - / - / - / - / -                         38.00 / 0.53 / 1.16 / 0.23 / 16.32        
    void paddle::operators::math::KernelDepthwise...  72      - / - / - / - / -                         113.23 / 1.57 / 2.02 / 1.35 / 48.64       
    void paddle::operators::math::KernelDepthwise...  32      - / - / - / - / -                         27.12 / 0.85 / 1.85 / 0.28 / 11.65        
    void paddle::operators::math::KernelDepthwise...  32      - / - / - / - / -                         31.30 / 0.98 / 1.50 / 0.78 / 13.45        
  void axpy_kernel_val<float, float>(cublasAxpyPa...  16      - / - / - / - / -                         6.64 / 0.42 / 0.56 / 0.27 / 2.77          
ReluGradNodeFinal                                     216     6.60 / 0.03 / 0.05 / 0.02 / 0.47          115.71 / 0.54 / 2.31 / 0.07 / 4.45        
  relu_grad compute                                   216     3.21 / 0.01 / 0.03 / 0.01 / 48.61         115.71 / 0.54 / 2.31 / 0.07 / 100.00      
    void phi::funcs::VectorizedElementwiseKernel<...  216     - / - / - / - / -                         115.71 / 0.54 / 2.31 / 0.07 / 100.00      
relu dygraph                                          216     6.23 / 0.03 / 0.06 / 0.02 / 0.44          77.51 / 0.36 / 1.54 / 0.05 / 2.98         
  relu compute                                        216     3.98 / 0.02 / 0.04 / 0.02 / 63.86         77.51 / 0.36 / 1.54 / 0.05 / 100.00       
    void phi::funcs::VectorizedElementwiseKernel<...  216     - / - / - / - / -                         77.51 / 0.36 / 1.54 / 0.05 / 100.00       
  relu node_creation                                  216     0.68 / 0.00 / 0.01 / 0.00 / 10.88         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
depthwise_conv2d dygraph                              104     4.17 / 0.04 / 0.06 / 0.04 / 0.30          55.37 / 0.53 / 1.16 / 0.18 / 2.13         
  depthwise_conv2d compute                            104     2.41 / 0.02 / 0.04 / 0.02 / 57.67         55.37 / 0.53 / 1.16 / 0.18 / 100.00       
    void paddle::operators::math::KernelDepthwise...  72      - / - / - / - / -                         37.94 / 0.53 / 1.16 / 0.23 / 68.52        
    void paddle::operators::math::KernelDepthwise...  32      - / - / - / - / -                         17.43 / 0.54 / 1.13 / 0.18 / 31.48        
  depthwise_conv2d node_creation                      104     0.63 / 0.01 / 0.01 / 0.00 / 15.03         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
LeakyReluGradNodeFinal                                160     5.35 / 0.03 / 0.04 / 0.02 / 0.38          37.42 / 0.23 / 0.58 / 0.03 / 1.44         
  leaky_relu_grad compute                             160     2.60 / 0.02 / 0.03 / 0.01 / 48.53         37.42 / 0.23 / 0.58 / 0.03 / 100.00       
    void phi::funcs::VectorizedElementwiseKernel<...  160     - / - / - / - / -                         37.42 / 0.23 / 0.58 / 0.03 / 100.00       
slice dygraph                                         608     42.74 / 0.07 / 3.37 / 0.02 / 3.04         29.04 / 0.05 / 3.06 / 0.00 / 1.12         
  slice compute                                       600     10.80 / 0.02 / 0.03 / 0.01 / 25.26        4.60 / 0.01 / 0.04 / 0.00 / 15.84         
    void Eigen::internal::EigenMetaKernel<Eigen::...  96      - / - / - / - / -                         0.47 / 0.00 / 0.01 / 0.00 / 10.22         
    void Eigen::internal::EigenMetaKernel<Eigen::...  96      - / - / - / - / -                         0.26 / 0.00 / 0.00 / 0.00 / 5.69          
    void Eigen::internal::EigenMetaKernel<Eigen::...  408     - / - / - / - / -                         3.87 / 0.01 / 0.04 / 0.00 / 84.09         
  slice node_creation                                 200     1.04 / 0.01 / 0.02 / 0.00 / 2.44          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  GpuMemcpySync:CUDAPinned->GPU                       8       0.26 / 0.03 / 0.03 / 0.03 / 0.61          0.01 / 0.00 / 0.00 / 0.00 / 0.04          
    MEMCPY_HtoD                                       8       - / - / - / - / -                         0.01 / 0.00 / 0.00 / 0.00 / 100.00        
leaky_relu dygraph                                    160     4.57 / 0.03 / 0.04 / 0.03 / 0.32          24.93 / 0.16 / 0.39 / 0.02 / 0.96         
  leaky_relu compute                                  160     3.02 / 0.02 / 0.03 / 0.02 / 66.09         24.93 / 0.16 / 0.39 / 0.02 / 100.00       
    void phi::funcs::VectorizedElementwiseKernel<...  160     - / - / - / - / -                         24.93 / 0.16 / 0.39 / 0.02 / 100.00       
  leaky_relu node_creation                            160     0.50 / 0.00 / 0.00 / 0.00 / 10.90         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
slice                                                 8       26.23 / 3.28 / 3.33 / 3.20 / 1.86         24.42 / 3.05 / 3.06 / 3.05 / 0.94         
  GpuMemcpySync:CUDAPinned->GPU                       8       24.78 / 3.10 / 3.12 / 3.08 / 94.46        24.40 / 3.05 / 3.06 / 3.04 / 99.88        
    MEMCPY_HtoD                                       8       - / - / - / - / -                         24.40 / 3.05 / 3.06 / 3.04 / 100.00       
  infer_shape                                         8       0.08 / 0.01 / 0.01 / 0.01 / 0.30          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  compute                                             8       0.57 / 0.07 / 0.10 / 0.05 / 2.17          0.03 / 0.00 / 0.00 / 0.00 / 0.12          
    void Eigen::internal::EigenMetaKernel<Eigen::...  8       - / - / - / - / -                         0.03 / 0.00 / 0.00 / 0.00 / 100.00        
  grad_node_creation                                  8       0.00 / 0.00 / 0.00 / 0.00 / 0.01          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
subtract dygraph                                      216     7.00 / 0.03 / 0.06 / 0.02 / 0.50          11.58 / 0.05 / 0.68 / 0.00 / 0.44         
  subtract compute                                    216     4.80 / 0.02 / 0.04 / 0.02 / 68.46         11.58 / 0.05 / 0.68 / 0.00 / 100.00       
    void phi::funcs::VectorizedBroadcastKernel<fl...  216     - / - / - / - / -                         11.58 / 0.05 / 0.68 / 0.00 / 100.00       
  subtract node_creation                              168     0.97 / 0.01 / 0.01 / 0.00 / 13.80         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
concat dygraph                                        64      3.34 / 0.05 / 0.11 / 0.03 / 0.24          8.86 / 0.14 / 0.65 / 0.01 / 0.34          
  concat compute                                      64      2.29 / 0.04 / 0.09 / 0.02 / 68.71         8.86 / 0.14 / 0.65 / 0.01 / 100.00        
    void phi::funcs::ConcatKernel_<float>(float c...  24      - / - / - / - / -                         0.20 / 0.01 / 0.01 / 0.01 / 2.20          
    void phi::funcs::ConcatKernel_<float>(float c...  24      - / - / - / - / -                         0.92 / 0.04 / 0.07 / 0.02 / 10.34         
    void phi::funcs::ConcatKernel_<float>(float c...  16      - / - / - / - / -                         7.71 / 0.48 / 0.65 / 0.32 / 87.07         
  concat node_creation                                40      0.28 / 0.01 / 0.01 / 0.01 / 8.33          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
ConcatGradNodeFinal                                   16      1.43 / 0.09 / 0.10 / 0.08 / 0.10          7.65 / 0.48 / 0.64 / 0.31 / 0.29          
  concat_grad compute                                 16      0.99 / 0.06 / 0.07 / 0.06 / 69.06         7.65 / 0.48 / 0.64 / 0.31 / 100.00        
    void phi::funcs::SplitKernel_<float>(float co...  16      - / - / - / - / -                         7.62 / 0.48 / 0.64 / 0.31 / 99.56         
transpose dygraph                                     48      814.82 / 16.98 / 103.17 / 0.03 / 57.90    6.29 / 0.13 / 0.53 / 0.01 / 0.24          
  GpuMemcpySync:CUDAPinned->GPU                       24      812.45 / 33.85 / 103.12 / 0.15 / 99.71    5.04 / 0.21 / 0.48 / 0.03 / 80.10         
    MEMCPY_HtoD                                       24      - / - / - / - / -                         5.04 / 0.21 / 0.48 / 0.03 / 100.00        
  transpose compute                                   48      1.41 / 0.03 / 0.07 / 0.02 / 0.17          1.25 / 0.03 / 0.06 / 0.01 / 19.90         
    void paddle::operators::TilingSwapDim1And2<un...  16      - / - / - / - / -                         0.88 / 0.06 / 0.06 / 0.05 / 70.55         
    void paddle::operators::TilingSwapDim1And2<un...  16      - / - / - / - / -                         0.17 / 0.01 / 0.01 / 0.01 / 13.43         
    void paddle::operators::TilingSwapDim1And2<un...  16      - / - / - / - / -                         0.20 / 0.01 / 0.02 / 0.01 / 16.02         
  transpose node_creation                             24      0.07 / 0.00 / 0.01 / 0.00 / 0.01          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
reduce_prod dygraph                                   72      2.84 / 0.04 / 0.06 / 0.03 / 0.20          5.04 / 0.07 / 0.45 / 0.00 / 0.19          
  prod_raw compute                                    72      2.09 / 0.03 / 0.05 / 0.02 / 73.66         5.04 / 0.07 / 0.45 / 0.00 / 100.00        
    void phi::funcs::ReduceAnyKernel<float, float...  72      - / - / - / - / -                         5.04 / 0.07 / 0.45 / 0.00 / 100.00        
  reduce_prod node_creation                           48      0.22 / 0.00 / 0.01 / 0.00 / 7.90          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
SliceGradNodeFinal                                    144     4.73 / 0.03 / 0.05 / 0.02 / 0.34          4.96 / 0.03 / 0.09 / 0.00 / 0.19          
  slice_grad compute                                  144     1.95 / 0.01 / 0.02 / 0.01 / 41.14         1.92 / 0.01 / 0.04 / 0.00 / 38.73         
    void Eigen::internal::EigenMetaKernel<Eigen::...  144     - / - / - / - / -                         1.92 / 0.01 / 0.04 / 0.00 / 100.00        
  void axpy_kernel_val<float, float>(cublasAxpyPa...  120     - / - / - / - / -                         3.04 / 0.03 / 0.06 / 0.00 / 61.27         
clip dygraph                                          72      2.04 / 0.03 / 0.05 / 0.02 / 0.15          4.91 / 0.07 / 0.45 / 0.00 / 0.19          
  clip compute                                        72      1.46 / 0.02 / 0.04 / 0.02 / 71.51         4.91 / 0.07 / 0.45 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  72      - / - / - / - / -                         4.91 / 0.07 / 0.45 / 0.00 / 100.00        
  clip node_creation                                  48      0.13 / 0.00 / 0.00 / 0.00 / 6.57          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
maximum dygraph                                       24      1.03 / 0.04 / 0.07 / 0.04 / 0.07          4.13 / 0.17 / 0.38 / 0.04 / 0.16          
  maximum compute                                     24      0.69 / 0.03 / 0.05 / 0.02 / 66.89         4.13 / 0.17 / 0.38 / 0.04 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  24      - / - / - / - / -                         4.13 / 0.17 / 0.38 / 0.04 / 100.00        
  maximum node_creation                               24      0.14 / 0.01 / 0.01 / 0.00 / 13.62         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
minimum dygraph                                       24      0.91 / 0.04 / 0.04 / 0.03 / 0.06          4.11 / 0.17 / 0.37 / 0.04 / 0.16          
  minimum compute                                     24      0.62 / 0.03 / 0.03 / 0.02 / 68.28         4.11 / 0.17 / 0.37 / 0.04 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  24      - / - / - / - / -                         4.11 / 0.17 / 0.37 / 0.04 / 100.00        
  minimum node_creation                               24      0.10 / 0.00 / 0.00 / 0.00 / 10.66         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
add dygraph                                           352     10.91 / 0.03 / 0.05 / 0.02 / 0.78         3.74 / 0.01 / 0.16 / 0.00 / 0.14          
  add compute                                         352     7.22 / 0.02 / 0.04 / 0.02 / 66.16         3.74 / 0.01 / 0.16 / 0.00 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  352     - / - / - / - / -                         3.74 / 0.01 / 0.16 / 0.00 / 100.00        
  add node_creation                                   304     1.71 / 0.01 / 0.02 / 0.00 / 15.72         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
scale dygraph                                         440     10.26 / 0.02 / 0.06 / 0.02 / 0.73         3.73 / 0.01 / 0.23 / 0.00 / 0.14          
  scale compute                                       440     7.25 / 0.02 / 0.05 / 0.01 / 70.66         3.73 / 0.01 / 0.23 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  440     - / - / - / - / -                         3.73 / 0.01 / 0.23 / 0.00 / 100.00        
  scale node_creation                                 320     0.71 / 0.00 / 0.02 / 0.00 / 6.91          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
divide dygraph                                        24      0.81 / 0.03 / 0.04 / 0.03 / 0.06          3.68 / 0.15 / 0.35 / 0.02 / 0.14          
  divide compute                                      24      0.52 / 0.02 / 0.03 / 0.02 / 64.04         3.68 / 0.15 / 0.35 / 0.02 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  24      - / - / - / - / -                         3.68 / 0.15 / 0.35 / 0.02 / 100.00        
  divide node_creation                                24      0.15 / 0.01 / 0.02 / 0.00 / 18.50         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
nearest_interp_v2GradNodeCompat                       16      1.79 / 0.11 / 0.16 / 0.09 / 0.13          3.29 / 0.21 / 0.28 / 0.14 / 0.13          
nearest_interp_v2_grad                                16      1.35 / 0.08 / 0.12 / 0.06 / 0.10          3.29 / 0.21 / 0.28 / 0.14 / 0.13          
  infer_shape                                         16      0.04 / 0.00 / 0.00 / 0.00 / 3.31          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  compute                                             16      0.77 / 0.05 / 0.08 / 0.04 / 56.86         3.29 / 0.21 / 0.28 / 0.14 / 100.00        
    void Eigen::internal::EigenMetaKernel<Eigen::...  16      - / - / - / - / -                         0.31 / 0.02 / 0.03 / 0.01 / 9.45          
    void phi::KeNearestNeighborInterpNCHWBw<float...  16      - / - / - / - / -                         2.98 / 0.19 / 0.25 / 0.12 / 90.55         
  grad_node_creation                                  16      0.00 / 0.00 / 0.00 / 0.00 / 0.31          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
max dygraph                                           24      0.90 / 0.04 / 0.05 / 0.03 / 0.06          2.02 / 0.08 / 0.19 / 0.02 / 0.08          
  max compute                                         24      0.73 / 0.03 / 0.04 / 0.03 / 81.28         2.02 / 0.08 / 0.19 / 0.02 / 100.00        
    void phi::funcs::ReduceAnyKernel<float, float...  24      - / - / - / - / -                         2.02 / 0.08 / 0.19 / 0.02 / 100.00        
nearest_interp_v2 dygraph                             16      1.90 / 0.12 / 0.16 / 0.09 / 0.13          1.57 / 0.10 / 0.13 / 0.07 / 0.06          
  nearest_interp_v2 node_creation                     16      0.09 / 0.01 / 0.01 / 0.01 / 4.98          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
nearest_interp_v2                                     16      1.48 / 0.09 / 0.13 / 0.07 / 0.11          1.57 / 0.10 / 0.13 / 0.07 / 0.06          
  infer_shape                                         16      0.25 / 0.02 / 0.03 / 0.01 / 17.17         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  compute                                             16      0.59 / 0.04 / 0.06 / 0.03 / 39.82         1.57 / 0.10 / 0.13 / 0.07 / 100.00        
    void phi::KeNearestNeighborInterpNCHWFw<float...  16      - / - / - / - / -                         1.57 / 0.10 / 0.13 / 0.07 / 100.00        
  grad_node_creation                                  16      0.00 / 0.00 / 0.00 / 0.00 / 0.27          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
multiply dygraph                                      216     6.71 / 0.03 / 0.06 / 0.02 / 0.48          1.30 / 0.01 / 0.03 / 0.00 / 0.05          
  multiply compute                                    216     4.59 / 0.02 / 0.05 / 0.02 / 68.41         1.30 / 0.01 / 0.03 / 0.00 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  216     - / - / - / - / -                         1.30 / 0.01 / 0.03 / 0.00 / 100.00        
  multiply node_creation                              192     0.82 / 0.00 / 0.01 / 0.00 / 12.15         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
MultiplyGradNodeFinal                                 144     5.16 / 0.04 / 0.45 / 0.02 / 0.37          1.12 / 0.01 / 0.03 / 0.00 / 0.04          
  multiply_grad compute                               144     2.40 / 0.02 / 0.04 / 0.01 / 46.61         1.01 / 0.01 / 0.03 / 0.00 / 90.33         
    void phi::funcs::VectorizedBroadcastKernel<fl...  144     - / - / - / - / -                         1.01 / 0.01 / 0.03 / 0.00 / 100.00        
  void axpy_kernel_val<float, float>(cublasAxpyPa...  24      - / - / - / - / -                         0.11 / 0.00 / 0.01 / 0.00 / 9.67          
AddGradNodeFinal                                      184     6.49 / 0.04 / 0.07 / 0.02 / 0.46          1.03 / 0.01 / 0.04 / 0.00 / 0.04          
  add_grad compute                                    184     4.40 / 0.02 / 0.05 / 0.02 / 67.84         1.03 / 0.01 / 0.04 / 0.00 / 100.00        
    void phi::funcs::ReduceAnyKernel<float, float...  24      - / - / - / - / -                         0.42 / 0.02 / 0.04 / 0.01 / 40.56         
    void phi::funcs::ReduceHigherDimKernel<float,...  24      - / - / - / - / -                         0.11 / 0.00 / 0.01 / 0.00 / 10.89         
SigmoidCrossEntropyWithLogitsGradNodeFinal            48      1.36 / 0.03 / 0.04 / 0.02 / 0.10          0.85 / 0.02 / 0.05 / 0.00 / 0.03          
  sigmoid_cross_entropy_with_logits_grad compute      48      0.83 / 0.02 / 0.03 / 0.01 / 61.00         0.85 / 0.02 / 0.05 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.85 / 0.02 / 0.05 / 0.00 / 100.00        
sum dygraph                                           96      5.20 / 0.05 / 0.07 / 0.04 / 0.37          0.84 / 0.01 / 0.02 / 0.00 / 0.03          
  sum compute                                         96      4.04 / 0.04 / 0.05 / 0.03 / 77.72         0.84 / 0.01 / 0.02 / 0.00 / 100.00        
    void phi::funcs::ReduceAnyKernel<float, float...  96      - / - / - / - / -                         0.48 / 0.00 / 0.01 / 0.00 / 56.94         
    void phi::funcs::ReduceHigherDimKernel<float,...  72      - / - / - / - / -                         0.36 / 0.01 / 0.01 / 0.00 / 43.06         
  sum node_creation                                   96      0.35 / 0.00 / 0.01 / 0.00 / 6.82          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
sigmoid_cross_entropy_with_logits dygraph             48      1.62 / 0.03 / 0.06 / 0.03 / 0.12          0.70 / 0.01 / 0.04 / 0.01 / 0.03          
  sigmoid_cross_entropy_with_logits compute           48      1.07 / 0.02 / 0.05 / 0.02 / 66.24         0.70 / 0.01 / 0.04 / 0.01 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.70 / 0.01 / 0.04 / 0.01 / 100.00        
  sigmoid_cross_entropy_with_logits node_creation     48      0.25 / 0.01 / 0.01 / 0.00 / 15.38         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
TransposeGradNodeFinal                                24      0.61 / 0.03 / 0.04 / 0.02 / 0.04          0.54 / 0.02 / 0.05 / 0.01 / 0.02          
  transpose_grad compute                              24      0.39 / 0.02 / 0.02 / 0.01 / 64.36         0.54 / 0.02 / 0.05 / 0.01 / 100.00        
    void paddle::operators::TilingSwapDim1And2<un...  16      - / - / - / - / -                         0.18 / 0.01 / 0.01 / 0.01 / 33.21         
    void paddle::operators::TilingSwapDim1And2<un...  8       - / - / - / - / -                         0.36 / 0.05 / 0.05 / 0.04 / 66.79         
cast dygraph                                          144     3.35 / 0.02 / 0.04 / 0.02 / 0.24          0.50 / 0.00 / 0.01 / 0.00 / 0.02          
  cast compute                                        144     2.60 / 0.02 / 0.03 / 0.01 / 77.64         0.50 / 0.00 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  96      - / - / - / - / -                         0.37 / 0.00 / 0.01 / 0.00 / 73.94         
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.13 / 0.00 / 0.00 / 0.00 / 26.06         
SumGradNodeFinal                                      96      2.57 / 0.03 / 0.07 / 0.02 / 0.18          0.45 / 0.00 / 0.02 / 0.00 / 0.02          
  sum_grad compute                                    96      1.67 / 0.02 / 0.06 / 0.01 / 65.21         0.45 / 0.00 / 0.02 / 0.00 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  96      - / - / - / - / -                         0.45 / 0.00 / 0.02 / 0.00 / 100.00        
ScaleGradNodeFinal                                    104     2.14 / 0.02 / 0.03 / 0.02 / 0.15          0.42 / 0.00 / 0.01 / 0.00 / 0.02          
  scale compute                                       104     1.23 / 0.01 / 0.02 / 0.01 / 57.42         0.42 / 0.00 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  104     - / - / - / - / -                         0.42 / 0.00 / 0.01 / 0.00 / 100.00        
BceLossGradNodeFinal                                  48      1.21 / 0.03 / 0.05 / 0.02 / 0.09          0.39 / 0.01 / 0.02 / 0.00 / 0.01          
  bce_loss_grad compute                               48      0.58 / 0.01 / 0.04 / 0.01 / 48.33         0.39 / 0.01 / 0.02 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.39 / 0.01 / 0.02 / 0.00 / 100.00        
meshgrid dygraph                                      24      2.65 / 0.11 / 0.15 / 0.10 / 0.19          0.34 / 0.01 / 0.01 / 0.01 / 0.01          
  meshgrid compute                                    24      2.29 / 0.10 / 0.14 / 0.08 / 86.57         0.34 / 0.01 / 0.01 / 0.01 / 100.00        
    void Eigen::internal::EigenMetaKernel<Eigen::...  48      - / - / - / - / -                         0.22 / 0.00 / 0.01 / 0.00 / 65.70         
mean dygraph                                          96      5.32 / 0.06 / 0.60 / 0.04 / 0.38          0.33 / 0.00 / 0.00 / 0.00 / 0.01          
  mean compute                                        96      4.27 / 0.04 / 0.59 / 0.04 / 80.21         0.33 / 0.00 / 0.00 / 0.00 / 100.00        
    void cub::DeviceReduceSingleTileKernel<cub::D...  96      - / - / - / - / -                         0.33 / 0.00 / 0.00 / 0.00 / 100.00        
  mean node_creation                                  96      0.42 / 0.00 / 0.01 / 0.00 / 7.98          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
bce_loss dygraph                                      48      1.16 / 0.02 / 0.05 / 0.02 / 0.08          0.31 / 0.01 / 0.01 / 0.00 / 0.01          
  bce_loss compute                                    48      0.77 / 0.02 / 0.04 / 0.01 / 65.98         0.31 / 0.01 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.31 / 0.01 / 0.01 / 0.00 / 100.00        
  bce_loss node_creation                              48      0.14 / 0.00 / 0.01 / 0.00 / 12.24         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
AbsGradNodeFinal                                      48      1.21 / 0.03 / 0.05 / 0.02 / 0.09          0.30 / 0.01 / 0.01 / 0.00 / 0.01          
  abs_grad compute                                    48      0.66 / 0.01 / 0.04 / 0.01 / 55.01         0.30 / 0.01 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.30 / 0.01 / 0.01 / 0.00 / 100.00        
MeanGradNodeFinal                                     96      2.58 / 0.03 / 0.06 / 0.02 / 0.18          0.27 / 0.00 / 0.00 / 0.00 / 0.01          
  mean_grad compute                                   96      1.67 / 0.02 / 0.05 / 0.01 / 64.80         0.27 / 0.00 / 0.00 / 0.00 / 100.00        
    void phi::funcs::VectorizedBroadcastKernel<fl...  96      - / - / - / - / -                         0.27 / 0.00 / 0.00 / 0.00 / 100.00        
SigmoidGradNodeFinal                                  48      1.03 / 0.02 / 0.04 / 0.02 / 0.07          0.26 / 0.01 / 0.01 / 0.00 / 0.01          
  sigmoid_grad compute                                48      0.55 / 0.01 / 0.03 / 0.01 / 53.57         0.26 / 0.01 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.26 / 0.01 / 0.01 / 0.00 / 100.00        
sigmoid dygraph                                       48      1.19 / 0.02 / 0.04 / 0.02 / 0.08          0.24 / 0.01 / 0.01 / 0.00 / 0.01          
  sigmoid compute                                     48      0.77 / 0.02 / 0.03 / 0.01 / 65.16         0.24 / 0.01 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.24 / 0.01 / 0.01 / 0.00 / 100.00        
  sigmoid node_creation                               48      0.17 / 0.00 / 0.01 / 0.00 / 14.09         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
exp dygraph                                           48      1.24 / 0.03 / 0.04 / 0.02 / 0.09          0.21 / 0.00 / 0.01 / 0.00 / 0.01          
  exp compute                                         48      0.85 / 0.02 / 0.03 / 0.01 / 68.41         0.21 / 0.00 / 0.01 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.21 / 0.00 / 0.01 / 0.00 / 100.00        
  exp node_creation                                   48      0.15 / 0.00 / 0.00 / 0.00 / 12.12         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
abs dygraph                                           48      1.19 / 0.02 / 0.03 / 0.02 / 0.08          0.15 / 0.00 / 0.00 / 0.00 / 0.01          
  abs compute                                         48      0.79 / 0.02 / 0.02 / 0.01 / 66.53         0.15 / 0.00 / 0.00 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  48      - / - / - / - / -                         0.15 / 0.00 / 0.00 / 0.00 / 100.00        
  abs node_creation                                   48      0.13 / 0.00 / 0.00 / 0.00 / 11.04         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
stack dygraph                                         24      1.22 / 0.05 / 0.09 / 0.04 / 0.09          0.11 / 0.00 / 0.00 / 0.00 / 0.00          
  stack compute                                       24      0.96 / 0.04 / 0.07 / 0.03 / 78.73         0.11 / 0.00 / 0.00 / 0.00 / 100.00        
    void phi::StackCUDAKernel<long, int>(long**, ...  24      - / - / - / - / -                         0.08 / 0.00 / 0.00 / 0.00 / 75.61         
fill_constant dygraph                                 8       0.74 / 0.09 / 0.10 / 0.09 / 0.05          0.02 / 0.00 / 0.00 / 0.00 / 0.00          
fill_constant                                         8       0.63 / 0.08 / 0.08 / 0.07 / 0.04          0.02 / 0.00 / 0.00 / 0.00 / 0.00          
  infer_shape                                         8       0.02 / 0.00 / 0.00 / 0.00 / 3.54          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  compute                                             8       0.27 / 0.03 / 0.04 / 0.03 / 43.38         0.02 / 0.00 / 0.00 / 0.00 / 100.00        
    void phi::funcs::VectorizedElementwiseKernel<...  8       - / - / - / - / -                         0.02 / 0.00 / 0.00 / 0.00 / 100.00        
  grad_node_creation                                  8       0.00 / 0.00 / 0.00 / 0.00 / 0.32          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
GradNodeAccumulation                                  1176    4.78 / 0.00 / 0.01 / 0.00 / 0.34          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
ReshapeGradNodeFinal                                  72      0.78 / 0.01 / 0.02 / 0.00 / 0.06          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  reshape_grad compute                                72      0.11 / 0.00 / 0.00 / 0.00 / 14.40         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
SubtractGradNodeFinal                                 48      0.36 / 0.01 / 0.01 / 0.01 / 0.03          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  subtract_grad compute                               48      0.03 / 0.00 / 0.00 / 0.00 / 8.48          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
reshape dygraph                                       168     1.99 / 0.01 / 0.02 / 0.01 / 0.14          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  reshape_with_xshape compute                         168     0.25 / 0.00 / 0.01 / 0.00 / 12.36         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  reshape node_creation                               96      0.31 / 0.00 / 0.00 / 0.00 / 15.59         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
unsqueeze dygraph                                     48      0.54 / 0.01 / 0.02 / 0.01 / 0.04          0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  unsqueeze_with_xshape compute                       48      0.09 / 0.00 / 0.00 / 0.00 / 17.58         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
  unsqueeze node_creation                             24      0.10 / 0.00 / 0.00 / 0.00 / 18.80         0.00 / 0.00 / 0.00 / 0.00 / 0.00          
----------------------------------------------------  ------  ----------------------------------------  ----------------------------------------  
  1. 修复导出的chrome tracing中显卡内存数据格式化字符串设置错误的bug。
  2. 用户自定义表单只统计python层用户自定义的打点。

@paddle-bot
Copy link

paddle-bot bot commented Jul 25, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit 963163e into PaddlePaddle:develop Jul 26, 2022
rainyfly added a commit to rainyfly/Paddle that referenced this pull request Jul 27, 2022
* fix new dygraph record event for op

* update unit test
XiaoguangHu01 pushed a commit that referenced this pull request Aug 2, 2022
* fix record event for operator type in new dygraph (#44582)

* fix new dygraph record event for op

* update unit test

* fix file mode
xuewujiao added a commit to xuewujiao/Paddle that referenced this pull request Aug 4, 2022
* fix python3.10 compile bug on window (PaddlePaddle#44330)

* Fix random seed for several unit tests (PaddlePaddle#44135)

* Fix test_functional_conv2d_transpose random seed

* Fix random seed and use np.testing

* Fix random seed for test_lu_unpack_op

* Fix test_autograd_functional_dynamic random seed

* Remove boost library (PaddlePaddle#44092)

* add fused token prune op and plugin (PaddlePaddle#44281)

* add fused token prune op and plugin

* Fix run inference bug for standalone executor (PaddlePaddle#44340)

* xpu-paddlepaddle-33 [任务] matmul单测 timeout (PaddlePaddle#44333)

test=kunlun

* [IPU] add custom-op UTs 0/N (PaddlePaddle#44328)

* add custom-op UTs 0

* add authors

Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* [IPU] add custom-op UTs 1/N (PaddlePaddle#44329)

* add custom-op UTs 1

* add authors

Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* update url

Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* support KL2 multi-card training, *test=kunlun (PaddlePaddle#43889)

* update xccl lib
    * use separate streams for compute/comm on XPU
    * add broadcast op to xpu2_op_list

* Remove auto to_pascal_case for args in op generator (PaddlePaddle#44350)

* remove auto to_pascal_case for args in op generator

* fix yaml config

* Standard sparse conv name (PaddlePaddle#44353)

* [Eager] eager variable back sync (PaddlePaddle#44343)

* eager variable back sync

* [ Phi Kernel ] Transfer as_real to phi. (PaddlePaddle#44263)

* transfer as_real to phi

* fix erros

* blocking: True -> False

* [Eager]Fix assert statement (PaddlePaddle#43492)

* Not rename pb file to avoid re-compile (PaddlePaddle#44370)

* [Phi] Migrate solve kernel to phi (PaddlePaddle#44363)

* draft version

* draft version

* draft version

* migrate solve kernel to phi

* polish

* polish

* re useless header file, fix a bug in grad_kernel_impl

* add header file in need

* [auto parallel] remove comm init control (PaddlePaddle#44385)

* [CustomDevice] remove unused file (PaddlePaddle#44358)

* [Paddle-TRT] reshape fill_constant (PaddlePaddle#44314)

* reshape fill_constant

* commit

* commit

* set seed for uts (PaddlePaddle#44372)

* [Paddle-TRT] remove useless code in fc (PaddlePaddle#44382)

* remove useless code in fc

* [Paddle-TRT] Fix cast (PaddlePaddle#44312)

* fix_cast

* fix_cast

* commit

* Polish jit layer cmakelists to hide some message (PaddlePaddle#44351)

* Enable inference multi stream ci test (PaddlePaddle#44275)

* test

* update

* fix bug of old pp (PaddlePaddle#44361)

* add xpu resnet_unit (PaddlePaddle#44297)

* add xpu resnet_unit
*test=kunlun

* tmp
*test=kunlun

* add blacklist in prim2orig interface (PaddlePaddle#44383)

* [Plugin] Fix Custom device in eager mode, test=develop (PaddlePaddle#43952)

* [Plugin] Fix Custom device in eager mode, test=develop

* update test case, test=develop

* update ut for coverage, test=develop

* add ipu support for standalone executor.  (PaddlePaddle#44342)

* fix typos in template for codegen of operators (PaddlePaddle#44364)

* fix duplicate slice logic in _grad (PaddlePaddle#44396)

* [MLU] fix mlu ctest final. (PaddlePaddle#44404)

* fix data transform bug of interpolate op (PaddlePaddle#44401)

* [Sparse] Add sparse matmul kernel(coo*dense->dense) (PaddlePaddle#44346)

* fix new autodiff api docs (PaddlePaddle#44341)

* fix build error in low arch (PaddlePaddle#44391)

* [new api] add new api paddle.vision.ops.distribute_fpn_proposals (PaddlePaddle#43736)

* add distribute_fpn_proposals

* change to new dygraph

* fix doc and example code

* change fluid impl to current version

* update (PaddlePaddle#44418)

* [Paddle-TRT] Shape sum fix scale (PaddlePaddle#44394)

* shape sum

* add shape, sum trt layer

* [Phi] Migrate infermeta and add yaml for solve op (PaddlePaddle#44379)

* migrate solve kernel to phi

* re useless header file, fix a bug in grad_kernel_impl

* add header file in need

* add yaml for solve op

* fix solve_sig.cc ArgumentMapping and update tests case

* disable legacy dygraph check in op_test

* rm solve_op.cc / solve_sig.cc and migrate yaml config

* Update op_test.py

disable legacy dygraph check when check_eager is True

* add labels for infer ut (PaddlePaddle#44279)

* add labels for infer ut

* add RUN_TYPE=INFER for cpp ut

* fix formaterror

* update

* Add mfence for XPU2 KP (PaddlePaddle#44258)

* remove include of all.h in resnet_basic_block_op_xpu.cc, test=kunlun (PaddlePaddle#44423)

* Rename BOOST_GET macros (PaddlePaddle#44368)

* Rename BOOST_GET macros

* Fix conflicts

* [new API] add paddle.vision.ops.generate_proposals (PaddlePaddle#43611)

* add generate_proposals into paddle.vision

* remove class api

* im_info -> img_size

* change fluid impl to current version

* Accelerate inference period in op Cache method (PaddlePaddle#43857)

* Added pad3d and pad2d FP32 FWD oneDNN kernels (PaddlePaddle#43990)

* Piotrek's changes for pad3d

* my changes

* first version of pad3d, single copy, unnecessary reads

* optimized pad3d kernel

* test upadte

* removed magic numbers

* added support for pad2d

* reverted two files

* reverted one old change

* added support for Paddings tensor

* CI fix

* CI fix

* fixed timeout of tests

* fixed typo

* changes to GetKernelTypeForVar

* Revert "changes to GetKernelTypeForVar"

This reverts commit 4691061.

* added AsExtra() to pad2d

Co-authored-by: Piotr Paturej <piotr.paturej@intel.com>

* add save_cache/patch (PaddlePaddle#44420)

* add save_cache/patch

* add pybind

* remove pybind

* remove const_cast

* add fleet

* Standard name of sparse pool (PaddlePaddle#44344)

* move eig operator from fluid to phi (PaddlePaddle#44398)

* move eig operator from fluid to phi

* add eig_grad unitest, upgrade IsComplexType() from fluid to phi

* [Phi]Move angle op to phi (PaddlePaddle#44393)

* Move angle op to phi

* Replace mutable_data using Alloc

* Remove some include

* Try to fix windows ci error

* include math.h to fix windows ci error

* Fix kernel name

* Move angle_grad infershape

* [Eager]release gil when run backward (PaddlePaddle#44433)

* release gil when run backward

* compile phi/backends into one static library (PaddlePaddle#44373)

* compile into one static library

* fix xpu compile

* fix xpu compile

* fix inference compile

* fix inference compile

* add custom test

* revert one file

* [IPU] Add more Ops (PaddlePaddle#44414)

* [IPU] Add more Ops

* update boost API

* Clean CI_SKIP_CPP_TEST (PaddlePaddle#44412)

* Add dependency for read op in standalone executor (PaddlePaddle#44362)

* Add dependency for read op in standalone executor

* Fix CI errors

* Add UT

* add_dependency -> dependency_utils

* Fix CI errors

* Add distro in ci docker (PaddlePaddle#44332)

* add distro zstd

* test

* test

* add pip3.8

* [Phi] migrate as_complex kernel to phi (PaddlePaddle#44438)

* migrate as_complex kernel to phi

* support as_complex and as_real in phi

* rm GetExpectedKernelType for AsRealOp

* [GPUPS]FleetWrapper initialize (PaddlePaddle#44441)

* fix FleetWrapper initialize

* [XPU][NPU] (1) add device_guard. (2) add support for LoDTensorArray of sum op. (PaddlePaddle#44367)

* device_guard support xpu. test=kunlun

* sum op of xpu support LoDTensorArray. add test for while op of xpu. test=kunlun.

* [IPU] add Op uts (PaddlePaddle#44415)

* transfer block_id to CreateVarNode in multi_devices_graph_pass (PaddlePaddle#44366)

* fix CreateVarNode in multi_devices_graph_pass

* Revert "Fix var duplication bug for graph_to_program_pass (PaddlePaddle#44278)"

This reverts commit a2c4c86.

* 【GPUPS】Adam accessor (PaddlePaddle#43919)

* add adam/sharedadam optimzier for gpups;edit optimizer struct;test=develop

* [Phi] migrate sync_batch_norm to phi (PaddlePaddle#44369)

* [GPUPS]Fix psgpuwrapper initialization (PaddlePaddle#44468)

* Update ps_gpu_wrapper.h

* Update ps_gpu_wrapper.h

* Update ps_gpu_wrapper.cc

* [Phi] migrate exponential kernel to phi (PaddlePaddle#44376)

* [Phi] migrate exponential kernel to phi

* fix comment

* fix CI

* [PHI] move diag_embed op to phi. (PaddlePaddle#44408)

* move diag_embed to phi.

* [MLU] set_value performance optimizing (PaddlePaddle#44390)

* Update api changing approve members (PaddlePaddle#44463)

* update api approve members, test=document_fix

* add qingqnig into list, test=document_fix

* fix bug,test=document_fix (PaddlePaddle#44478)

* [Phi] migrate clip_by_norm to phi (PaddlePaddle#44458)

* add eigen3 dependency for phi_backends (PaddlePaddle#44479)

* remove fleet_13 ut in parallel_UT_rule.py; test=develop (PaddlePaddle#44477)

* [PHI]Seperate xshape kernel from normal kernel (PaddlePaddle#44315)

* seperate xshape kernel from normal kernel

* fix bugs in infermeta

* fix compile bugs

* fix compile bugs

* [AutoParallel] fix unittest with paddle.distributed.launch (PaddlePaddle#44439)

* fix unittest

* fix log_dir

* _enable_legacy_dygraph

* [Phi] add temporal_shift yaml (PaddlePaddle#44409)

* add temporal_shift yaml and unittest

* [Paddle inference] Add conv_fusion_fp16 (PaddlePaddle#44435)

* convfusionfp16

* convfusionfp16

* convfusionfp16

* fix some convert error found in tipc. (PaddlePaddle#44457)

* fix some error found in tipc.

* update

* [BugFix]Fix randint_like bugs when save program that don't need use tensor's value (PaddlePaddle#44446)

* fix bugs of random

* fix unittest error

* fix unittest bugs

* add adaptive pool and softmax with cross entropy supports different axis, * test = kunlun  (PaddlePaddle#44428)

* add xpu pnorm op and fix pool op, *test=kunlun

* add adaptive pool, and softmax with cross entropy supports different axis, *test=kunlun

* add slot attr for push sparse op (PaddlePaddle#44422)

* add slot attr for push sparse op

* add pybind

* remove fleet

* add unittest

* fix

* [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad (PaddlePaddle#44485)

* [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad

* fix unittest

* fix tensor stream error in custom op (PaddlePaddle#44500)

* Replace with dygraph op calling method. (PaddlePaddle#44331)

* Replace with dygraph op calling method.

* [JitLayer]Pybind PEFunction and call phi api in layer_test (PaddlePaddle#44465)

* Support predictor function in JitLayer

* Pybind PEFunction

* Pybind PEFunction and call phi api in layer_test

* Call sqrt phi API

* Polish flags

* Fix comments

* [Sparse] Add sparse addmm kernel (dense+coo*dense->dense,dense+csr*dense->dense) (PaddlePaddle#44451)

* [Eager] bilinear_tensor_product yaml (PaddlePaddle#44459)

* bilinear_tensor_product yaml

* [ Phi ] svd transfer (PaddlePaddle#44392)

* svd cpu forward

* svd gpu forward

* transfer the backward of svd

* remove cusolver in svd_grad

* svd kernel bug fix

* fix bugs

* fix bugs.

* fix bug

* [Paddle-TRT] fix_fill_constant (PaddlePaddle#44481)

* fix_fill_constant

* fix_fill_constant

* fix_ernie

* [MLU] transpose avg_pool2d to NHWC for better performance. (PaddlePaddle#44475)

* [jit] jit support property.proto (PaddlePaddle#44337)

* add property.proto, can compiled

* property get and deserilize

* support get float

* format code

* format code

* add unittest

* add more set method

* fix grammar error

* Update paddle/fluid/jit/property.h

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* fix comment

* fix error throw

* fix property save unit test

* fix error info

* fix copyright and header import

* reorder jit property tensor datatype

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* [ Dy2static ] infer_program may be incorrect in amp mode. (PaddlePaddle#44487)

* fix the outputs of net is x,x

* add unittest for duplicate output

* fix

* fix _infer_program use the original program not the amp program.

* get _***program_id back and avoid duplicate cache
ing

* fix

* Fc fp16 (PaddlePaddle#44505)

* fc support fp16

* add a ‘,’ on paddle_pass_builder.cc

* fc support fp16 on non-cuda.

* add batch stream (PaddlePaddle#44524)

* shufflechannelfix (PaddlePaddle#44516)

* fix arg_max to select first index (PaddlePaddle#44521)

* [MLU] add floor kernel and grid_sampler kernel (PaddlePaddle#44498)

* commit (PaddlePaddle#44534)

* [CustomDevice] register Copy for custom device (PaddlePaddle#44200)

* [CustomDevice] register Copy for custom device

* [CustomDevice] register Copy for custom device

* [CustomDevice] register Copy for custom device

* merge and add uts

* merge and add uts

* fix for blocking and unittests coverage

* (modified) fc support fp16 (PaddlePaddle#44540)

* Add code of occupancy computing on DCU and avoid threadID bug for DCU profiler (PaddlePaddle#44520)

* add xpu lars_momentum/pow2_decay (PaddlePaddle#44448)

*test=kunlun

* [phi] move inverse op from fluid to phi (PaddlePaddle#44471)

* move inverse from fluid to phi with unitest bug

* fix bug, add eager op yaml

* support send_partial, recv_partial and allgather_partial in ProcessGroupNCCL (PaddlePaddle#44444)

* [Sparse]add sparse unary api(expm1/deg2rad/rad2deg/relu6/leaky_relu) (PaddlePaddle#44432)

* Fc fp16 (PaddlePaddle#44558)

* (modified) fc support fp16

* __CUDA_ARCH__ version

* delete half

* delete half

* Fix bug of amp code-gen (PaddlePaddle#44570)

* fix bug of amp code_gen

* fix bug

* [JitLayer]Fix jit.save error when save params combined (PaddlePaddle#44504)

* Fix jit.save error when save params combined

* Change dict_value to list

* [Phi] Migrate squared_l2_norm_op to phi (PaddlePaddle#44492)

* add swish  using TensorRT layer (PaddlePaddle#44561)

* update

* empty commit

* update

* update

* update

* Phi gird sampler migration (PaddlePaddle#44562)

* add_ymal_utest for phi grid_sampler op

* skip dist test cases if mlu card number only one, test=develop (PaddlePaddle#44549)

* [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine (PaddlePaddle#44513)

* [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine

* refine code

* 【Hackathon No.21】为 Paddle 新增 SoftMarginLoss (PaddlePaddle#42364)

* 2022-04-28

* 2022-04-28_V2

* 2022-04-30

* 2022-04-30_V2

* 2022-05-01

* 2022-05-02

* 2022-05-02_V2

* 2022-05-05_V1

* 2022-05-06_V1

* 2022-05-07_V1

* Update loss.py

* 2022-05-07_V2

* 2022-05-13_V1

* Update test_soft_margin_loss.py

* Update loss.py

* Update loss.py

* 2022-05-16_V1

* 2022-05-19_V1

* 2022-05-20_V1

* Update test_soft_margin_loss.py

* 2022-06-01_V1

* 2022-06-05

* 2022-06-07

* 2022-06-07

* 2022-06-08

* 2022-06-08_V2

* 2022-06-17-code_style

* Modify python

* 2022-06-20

* for

* for CI;test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* [MLU]transpose convbpf output to HWCN for better performance (PaddlePaddle#44552)

* Fc fp16 (PaddlePaddle#44578)

* (modified) fc support fp16

* __CUDA_ARCH__ version

* delete half

* delete half

* add half support

* add half support

* add half support

* [Auto Parallel] Add dist op cost (PaddlePaddle#44146)

* update comp cost

* add dist default op cost

* add dist fill constant batch size like op cost

* add elewise op cost

* add fill_constant_batch_size_like op cost unittest

* add unittest and remove fill_constant_batch_size_like grad op cost

* add to cmakelist

* fix unittest bug

* Improve CI unittest parallel execution strategy (PaddlePaddle#44334)

* paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=parallel_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test pre_test_bak

* test cfs

* test_cfs,test=paralle_test_daily

* test_cfs,test=paralle_test_daily

* fix nightly test name,test=paralle_test_daily

* fix nightly test name,test=paralle_test_daily

* test ci parallel speed

* refine parallel rule,test=paralle_test_daily

* Move bmm OP from fluid to phi (PaddlePaddle#44496)

* [PHI]Move slogdeterminant op to phi (PaddlePaddle#44547)

* Move slogdeterminant op to phi

* Add yaml and unit test for slogdeterminant

* Rename pybind_boost_header.h (PaddlePaddle#44592)

* unify data type and property enum value (PaddlePaddle#44585)

* inference multi stream support handle lazy init. (PaddlePaddle#44563)

* multi stream support handle lazy init.

* support eigen lazy init

* update

* fix ci problem

* Remove ControlDepVar in GraphToBlock (PaddlePaddle#44591)

* transfer the svd infer into phi infermeta (PaddlePaddle#44528)

* transfer the svd infer into phi infermeta

* remove the svd.h

* modify svd api

* fix svd error by insert optional

* Einsum grad complex (PaddlePaddle#44598)

* add complex for einsum grad kernel

* pass the ci

* add reverse yaml (PaddlePaddle#44518)

* add reverse yaml

* Set more attrs in ReplaceScaleLossGradOp (PaddlePaddle#44576)

* Set more attrs in ReplaceScaleLossGradOp

* Fix typos

* Fix CI errors

* Add UT

* [Phi] Migrate box coder to phi. (PaddlePaddle#44550)

* fix behavior of device_id=None in Tensor.cuda (PaddlePaddle#44515)

* fix behavior of device_id=None in Tensor.cuda

* fix CI

* fix windows cuda11.7 bug (PaddlePaddle#44601)

* add  horizontal federation learning ps feature (PaddlePaddle#44327)

* back fl

* delete ssl cert

* .

* make warning

* .

* unittest paral degree

* solve unittest

* heter & multi cloud commm ready

* .

* .

* fl-ps v1.0

* .

* support N + N mode

* .

* .

* .

* .

* delete print

* .

* .

* .

* .

* fix bug

* .

* .

* fl-ps with coordinator ready

* merge dev

* update message parse only

* update fl client scheduler

* fix bug

* update multithreads sync

* fix ci errors

* update role_maker.py

* update role_maker.py

* fix ci error: windows py import error

* fix ci error: windows py import error

* fix windows ci pylib import error

* add dump fields & params

* try to fix windows import fleet error

* fix ps FLAGS error

* [MLU] rollback cntoolkit vetsion to 2.8.5 (PaddlePaddle#44595)

* [CustomDevice] add blas_axpby api for gradient_accumulator (PaddlePaddle#44584)

* add sin,cos,exp primitive operators (PaddlePaddle#44345)

* Optimize sparse convolution (PaddlePaddle#43576)

* Merge kProgramDescs in GraphToProgram (PaddlePaddle#44526)

* [Eager] Add warpctc yaml (PaddlePaddle#44617)

* Add a feed op before each input parameter var. (PaddlePaddle#44499)

* Add a feed op before each input parameter var.

* Fix some issues about the unit test build_cinn_pass_test.

* fix record event for operator type in new dygraph (PaddlePaddle#44582)

* fix new dygraph record event for op

* update unit test

* fix bug of elementwise_add_grad, *test=kunlun (PaddlePaddle#44545)

* fix bug of elementwise_add_grad, *test=kunlun

* fix bug, *test=kunlun

* rm pooling_t, *test=kunlun

* fix bug of ew_add_grad when inplace, *test=kunlun

* [IPU] small bug fix (PaddlePaddle#44473)

* sync misc changes

* add authors

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* up x

* Revert "up x"

This reverts commit f3fde45.

* add guarg for ipu

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* support auto fallback to  cpu kernel for cusom device (PaddlePaddle#44639)

* fix dygraph bugs in broadcast_to api. (PaddlePaddle#44612)

* add set_dtype for inverse_op (PaddlePaddle#44618)

* refine overalls.cmake (PaddlePaddle#44623)

* [PHI]Add yaml and unittest for bmm op (PaddlePaddle#44625)

Add yaml and unittest for bmm op

* Phi average accumulates migration (PaddlePaddle#44554)

* move average_accumulates op to phi kernel

* new exe not support pg (PaddlePaddle#44628)

* [CustomDevice]fix phi kernel header (PaddlePaddle#44637)

* [CustomDevice] add process_group_xccl ut (PaddlePaddle#44632)

* [CustomDevice] add process_group_xccl ut

* update

* Fix conv api name (PaddlePaddle#44636)

* [DCU] Fix NAN problem when training BERT on DUC platform (PaddlePaddle#44643)

* [JitLayer]Remove include fluid head files in JitLayer (PaddlePaddle#44597)

* Remove include fluid head files in JitLayer

* Format code

* Remove const to fix ci error

* Fix param error

* Polish jit layer include and cp some headers to python/include

* Fix comment

* [jit]  jit.save support property serialization (PaddlePaddle#44581)

* jit.save support peropty serilization

* extract set property function

* fix property test file name

* fix typing error

* fix typing error

* fix test coverage

* Replaced add_custom_command with add_custom_target in xpu_kp_cmake (PaddlePaddle#44619)

* Replaced add_custom_command with add_custom_target in xpu_kp_cmake

* add adagrad and rmsprop yaml (PaddlePaddle#44631)

* [phi] move crop_tensor kernel from fluid to phi (PaddlePaddle#44574)

* move crop_tensor from fluid to phi

* delete fluid header files

* fix crop_tensor_op dygraph_mode bug

* modify header files, add out tensor check

* fix RemoveIntermediateOut in fuse_elewise_add_act_pass while converting graph to program (PaddlePaddle#44593)

* fix RemoveNode in fuse_elewise_add_act_pass

* fix

* change pointer to share_ptr

* fix

* fix

* fix format

* fix

* fix graph_safe_remove_nodes

* fix UTs on physical ipu (PaddlePaddle#44647)

* [IPU] add more loss ops  (PaddlePaddle#44646)

* add more loss ops

* add authors

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* add g_ipuplace_pytype (PaddlePaddle#44648)

* Strided slice fp16 (PaddlePaddle#44653)

* [MLU]fix sync_batch_norm and concat_grad op (PaddlePaddle#44586)

* retain dist op returns (PaddlePaddle#44634)

* xpu unittest grad compute supports more types, *test=kunlun (PaddlePaddle#44606)

* [Eager] Add hierarchical_sigmoid yaml (PaddlePaddle#44638)

* add matrix_nms in python/paddle/vision/ops.py (PaddlePaddle#44357)

* [auto parallel] bug fix for op has sub_block attr created with copy_from (PaddlePaddle#44664)

* Change the way to set attributes for grad op maker (PaddlePaddle#44514)

* fix typos in template for codegen of operators
* change the way to set attributes for grad op maker

* [XPU] add top_k op (PaddlePaddle#44656)

* [XPU] add top_k op. test=kunlun

* [XPU] add top_k op. test=kunlun

* use PADDLE_ENFORCE_XDNN_NOT_NULL to check pointer. test=kunlun

* Support broadcast tensor in phi system (PaddlePaddle#44590)

* [PHI] Move spectral_norm to phi (PaddlePaddle#44577)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Fix: Move out_grad to first

* Register new kernels

* Remove old kernels

* Move out_grad to last

* Fix bugs

* Transfer infermeta

* Add yaml files

* Add blank line

* Fix code style

* Optimize directory structure

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* Complete the dtypes for all_gather, add all_gather_object api (PaddlePaddle#44417)

* [Eager] refactor general_grad and fix some bugs (PaddlePaddle#44611)

* refactor general_grad and fix some bugs

* add TODO: support prune logic deeper

* support log_grad op, *test=kunlun (PaddlePaddle#44662)

* [LAUNCH] add distributed launch check tools (PaddlePaddle#44495)

* add launch test

* launch test for cpu

* bs 1

* Move api(lgamma) from legacy_api.yaml to api.yaml (PaddlePaddle#44355)

* Move api(lgamma) from legacy_api.yaml to api.yaml

* Move api(lgamma) from legacy_api.yaml to api.yaml

* Move api(lgamma) from legacy_api.yaml to api.yaml

* modify code style

* add x to X mapping

* add definition of lgamma

* delete redundant lgamma definitions

* Modify code comments

* Modify ops.py code format

* add lgamma  single test and lgamma api in fluid

* Optimized lgamma unittest

* Move frame kernel to phi (PaddlePaddle#44615)

* Move frame OP to phi、add frame OP yaml config and supplement single test

* add Header file of in_dygraph_mode

* Modify variable name and FrameGradInferMeta multiplex UnchangedInferMeta

* move seq2col to phi

* delete elementwise pow in xpu_kp_list (PaddlePaddle#44661)

* [MLU] fix log_softmax mode selection. (PaddlePaddle#44669)

* adapt for resnet (PaddlePaddle#44685)

* Fix some problem of kernel fallback in C++ API (PaddlePaddle#44681)

* support auto fallback to  cpu kernel for cusom device

* fix some problem of kernel fallback

* fix bugs of lstsq (PaddlePaddle#44689)

* migrate dirichlet kernel to phi (PaddlePaddle#44434)

* migrate dirichlet op kernel to phi

* fix dirichlet sample memory leak

* [phi]move softsign from fluid to phi (PaddlePaddle#44616)

* test_activation_op unitest error, yaml & activation.py in_dygraph_mode incomplete

* fix test_activation_op unitest error, add yaml and dygraph test

* fix code style with pre-commit

* try to fix namespace error of abs in activation_functor.h

* fix namespace error of abs

* [Paddle Inference] Support depthwise_conv2d fp16. (PaddlePaddle#44642)

* depthwise_fp16

* depthwise_fp16

* depthwise_fp16

* depthwise_fp16

* fix logging debug level (PaddlePaddle#44684)

* back fl

* delete ssl cert

* .

* make warning

* .

* unittest paral degree

* solve unittest

* heter & multi cloud commm ready

* .

* .

* fl-ps v1.0

* .

* support N + N mode

* .

* .

* .

* .

* delete print

* .

* .

* .

* .

* fix bug

* .

* .

* fl-ps with coordinator ready

* merge dev

* update message parse only

* update fl client scheduler

* fix bug

* update multithreads sync

* fix ci errors

* update role_maker.py

* update role_maker.py

* fix ci error: windows py import error

* fix ci error: windows py import error

* fix windows ci pylib import error

* add dump fields & params

* try to fix windows import fleet error

* fix ps FLAGS error

* fix logging risk

* fix logging possible risk

* Skip CUDA Graph case for standalone executor (PaddlePaddle#44693)

* [Eager] fix lerp grad kernel logic (PaddlePaddle#44705)

* clone ort_predictor reuse session (PaddlePaddle#44703)

* [XPU] add sampling_id op, add top_k op, update xdnn api. test=kunlun (PaddlePaddle#44704)

* fused_fc_elementwise_layernorm_op support fp16 (PaddlePaddle#44710)

* fused_fc_elementwise_layernorm support fp16

* fused_fc_elementwise_layernorm support double

* [Phi] Add yaml for assign_value (PaddlePaddle#44596)

* [Phi] Add yaml for assign_value

* [Phi] Fix the bug of the assign api and modify the unittest

* [Phi] Fix the bug when the tensor does not have the backend info

* [Phi] Replace the functional-style cast init by the brace-init

* [Phi] Cast the data explicitly

* [PHI] Move lu to phi  (PaddlePaddle#44605)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Register new kernels

* Remove old kernels

* Fix code style

* Fix bugs

* mutable_data->HostAlloc

* Transfer infermeta

* Add yaml and update python api

* Add PADDLE_WITH_HIP check

* Update unittests

* Fix bugs

* Fix bugs

* Optimize directory structure

* Add output checks

* lu_impl.h->lu_kernel_impl.h

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* [MLU] add pytest for mlu strided_slice kernel (PaddlePaddle#44523)

* Support backward final hook (PaddlePaddle#44686)

* update to sdk2.6.0 (PaddlePaddle#44673)

* move CUDAStream to phi (PaddlePaddle#44529)

* init

* move CUDAStream to phi

* fix compilation

* merge develop

* add stream_owned_ member

* split cuda_stream.h

* fix cpu compile

* fix constructor

* fix bug

* fix windows compile

* fix inference test_levit

* fix windows tests

* [Auto parallel] Optimization Tuning (PaddlePaddle#43782)

* fixed bug for pass & engine

* fixed bug for benchmark GPT-3

* add tuner & profiler

* add algorithms & config

* skip cast trt convert when input dtype is bool (PaddlePaddle#44716)

* skip cast trt convert when input dtype is bool

* [LAUNCH] fix set args bug (PaddlePaddle#44717)

* Phi softplus migration (PaddlePaddle#44542)

* add yaml and utests of phi softplus

add yaml of softplus

fix softplus bug in phi

* update utests

* bug fix

* bug fix for test_layers

* layer api match

* match def and doc in ops.py

* doc polish

* fix unwanted modified of thresholded_relu

* style imporve

* 【PaddlePaddle Hackathon 3 No.15】为 Paddle 新增 count_nonzero (PaddlePaddle#44169)

* add count_nonzero api

* remove grad test

* [WIP] Matmul v1 & v2 unification -- part 1 (PaddlePaddle#44640)

* - Unit tests to be debugged

- fix

- refactor

- diagnostic

- more diagnostic

- fix

- Fix number two

- fix

- fix

- fix

- alpha added

- more fixes

- compilation fix

- removed diagnostic code

- cosmetic fixes

* lint

* add FLAGS_enable_api_kernel_fallback (PaddlePaddle#44706)

* add FLAGS_enable_api_kernel_fallback

* deal with more cases

* add ut for coverage

* phi_multiclass_nms3 (PaddlePaddle#44613)

* add some fp16 op for kunlun resnet50 model (PaddlePaddle#44672)

* add some fp16 op for kunlun resnet50 model
*test=kunlun

* tmp
*test=kunlun

* add dist op costs (PaddlePaddle#44701)

* [API/OP] Migrate Lstsq op into phi (PaddlePaddle#44318)

* migrate lstsq op

* update

* fix bugs for CIs

* update

* fix bugs

* add uts

* update

* update

* update

* fix bugs of jip

* fix bugs of hip

* update

* update according to review

* update

* update

* update

* update

* Add sparse SyncBatchNorm (PaddlePaddle#43520)

* add sparse SyncBatchNorm

* unify fluid::CUDADeviceContext and phi::GpuContext (PaddlePaddle#44723)

* remove cudaDeviceContext

* remove more template

* fix rocm compile

* 【PaddlePaddle Hackathon 3 No.12】为 Paddle 新增 pairwise_distance (PaddlePaddle#44161)

* add paddle.nn.functional.pairwise_distance (cattidea#273)
* remove the test case for undefined behavior

Co-authored-by: SigureMo <sigure.qaq@gmail.com>

* Phi prior box (PaddlePaddle#44431)

* phi_prior_box

* add float[] support

* phi_prior_box_optest

* update

* ort backend support output mutable data (PaddlePaddle#44724)

* [PHI] Move lu_unpack to phi (PaddlePaddle#44674)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Register new kernels

* Remove old kernels

* Fix code style

* Fix bugs

* mutable_data->HostAlloc

* Transfer infermeta

* Add yaml and update python api

* Add PADDLE_WITH_HIP check

* Update unittests

* Add kernel declarations

* Copy kernel implementation code

* Transfer kernel implementation code

* Register new kernels

* Remove old kernels

* Add lu_unpack_sig

* Fix bugs

* Fix bugs

* Fix bugs

* Optimize directory structure

* Add output checks

* Update include files

* lu_impl.h->lu_kernel_impl.h

* Transfer infermeta

* Add yaml and update python api

* Add check_eager

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* update document of quantile and nanquantile; test=document_fix (PaddlePaddle#42413)

* migrate reduce_amin,reduce_amax kernel to phi (PaddlePaddle#44698)

* [Paddle Inference] add varlen_token_prune plugin, pass, convert (PaddlePaddle#44733)

* add varlen_token_prune plugin, pass, convert

* support build with Ninja on Linux (PaddlePaddle#44210)

* support ninja

* fix mkldnn on windows

* fix mkldnn on windows up1

* up2

* up3

* fix gflags

* BUILD_BYPRODUCTS_OPTION -> BUILD_BYPRODUCTS_ARGS

* use CMAKE_COMMAND

* up x

* migrate overlap_add and overlap_add_grad op (PaddlePaddle#44739)

* update code format

* add ymal and test

* update for comments

* Fix to CI (PaddlePaddle#44744)

* - fix

* - another fix

* lint

* infer context fix place error. (PaddlePaddle#44726)

* infer context fix place error.

* update

* update

* [operator migration] Migrate unstack_op and nms_op (PaddlePaddle#44424)

* update unstack_op

* update unstack_op

* update unstack_op

* fix unstack test

* update unstack

* update with remote

* fix unstack_test.py

* temp_save_change_nms_op

* add nms test

* update nms fix

* update unstack_op

* temp save change

* finish fix nms_op

* pass nms test

* fix CI

* fix ops test

* save change

* fix code style

* fix code style

* fix ci and codestyle

* fix ci

Co-authored-by: ShiningZhang <zhang_liang1991@126.com>

* Update linalg.py (PaddlePaddle#44347)

* Fix test and doc (PaddlePaddle#44735)

* fix test and doc

* fix all_gather_object with various length, test=allcases (PaddlePaddle#44718)

* update manipulation.py paddle.moveaxis (PaddlePaddle#44191)

* [CI] CI for Distributed (PaddlePaddle#44085)

* generate_unify_header supports excludes (PaddlePaddle#44761)

* [JitLayer]Polish PEFuntion to speed up JitLayer and fix memory leak (PaddlePaddle#44738)

* Polish PEFuntion to speed up JitLayer

* Polish PEFunction code

* Fix comments

* paddle2onnx update version to 1.0.0rc2 (PaddlePaddle#44759)

* set parallel_job according to CUDA memory in Windows CI unittest (PaddlePaddle#44695)

* set parallel_job according to CUDA memory

* fix bug: add whitespace between conten and [] or condition wont work

* [Sparse] optimize sparse attention (PaddlePaddle#44743)

* GPUGraph merge to develop (PaddlePaddle#44594)

Co-authored-by: seemingwang <zsasuke@qq.com>
Co-authored-by: DesmonDay <908660116@qq.com>
Co-authored-by: seemingwang <seemingwang@users.noreply.github.com>
Co-authored-by: Thunderbrook <a754913769@163.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com>
Co-authored-by: lxsbupt <luoxsbupt@163.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: qingshui <qshuihu@gmail.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Revert for cmake static library errors on XPU KP PaddlePaddle#44762

* unify gpu context (PaddlePaddle#44740)

* remove cudaDeviceContext

* remove more template

* fix rocm compile

* remove alias name CUDADeviceContext

* fix compile

* fix tests

* revert changes

* API doc(en) Bugs fix in 第四期体验评估 (PaddlePaddle#44749)

* fix docs(en) bugs;test=document_fix

* update paddle.add docs;test=document_fix

* update paddle.where docs;test=document_fix

* for ci;test=document_fix

* Update manipulation.py

* update paddle.where;test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* Modify the output result annotation under the lerp function (PaddlePaddle#44035)

* Refactor build_op_downstream_map for standalone executor (PaddlePaddle#44729)

* Refactor build_op_downstream_map for standalone executor

* Add some comments

* update xpu.cmake to 20220731, test=kunlun (PaddlePaddle#44767)

* fix ut new_group_api (PaddlePaddle#44764)

* support beam_search operator on xpu. test=kunlun (PaddlePaddle#44720)

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* [phi] add yolov3_loss yaml and unittest (PaddlePaddle#44476)

* add yaml and unittest

* update yaml

* update backward yaml and unittest

* update yaml

* add Yolov3LossGradInferMeta

* update yolov3_loss_op.cc

* fix bug

* code format

* Update manipulation.py for rot90() (PaddlePaddle#44038)

* fix compile error;test=develop

* fix compile error;test=develop

* fix compile;test=develop

Co-authored-by: Sing_chan <51314274+betterpig@users.noreply.github.com>
Co-authored-by: zlsh80826 <rewang@nvidia.com>
Co-authored-by: Ruibiao Chen <chenruibiao@baidu.com>
Co-authored-by: RichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com>
Co-authored-by: taixiurong <taixiurong@126.com>
Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>
Co-authored-by: zhangxiaoci <zhangxiaoci@baidu.com>
Co-authored-by: zyfncg <zhangyunfei07@baidu.com>
Co-authored-by: zhangkaihuo <zhangkaihuo@baidu.com>
Co-authored-by: wanghuancoder <wanghuan29@baidu.com>
Co-authored-by: xiongkun <xiongkun03@baidu.com>
Co-authored-by: Aurelius84 <zhangliujie@baidu.com>
Co-authored-by: Leo Chen <chenqiuliang@baidu.com>
Co-authored-by: Weilong Wu <veyron_wu@163.com>
Co-authored-by: caozhou <48191911+Caozhou1995@users.noreply.github.com>
Co-authored-by: ronnywang <ronny1996@163.com>
Co-authored-by: zhoutianzi666 <39978853+zhoutianzi666@users.noreply.github.com>
Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com>
Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com>
Co-authored-by: Wilber <jiweibo@baidu.com>
Co-authored-by: ShenLiang <1422485404@qq.com>
Co-authored-by: QingshuChen <chenqingshu@baidu.com>
Co-authored-by: levi131 <83750468+levi131@users.noreply.github.com>
Co-authored-by: Qi Li <qili93@qq.com>
Co-authored-by: 王明冬 <78149749+winter-wang@users.noreply.github.com>
Co-authored-by: Feiyu Chan <chenfeiyu@baidu.com>
Co-authored-by: Xiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Chenxiao Niu <ncxinhanzhong@gmail.com>
Co-authored-by: Zhou Wei <1183042833@qq.com>
Co-authored-by: JYChen <zoooo0820@qq.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
Co-authored-by: niuliling123 <51102941+niuliling123@users.noreply.github.com>
Co-authored-by: zhangyikun02 <48021248+zhangyk0314@users.noreply.github.com>
Co-authored-by: huzhiqiang <912790387@qq.com>
Co-authored-by: jakpiase <jakpia21@gmail.com>
Co-authored-by: Piotr Paturej <piotr.paturej@intel.com>
Co-authored-by: zhaocaibei123 <48509226+zhaocaibei123@users.noreply.github.com>
Co-authored-by: freeliuzc <lzc842650834@gmail.com>
Co-authored-by: tianshuo78520a <707759223@qq.com>
Co-authored-by: zmxdream <zhangminxu01@baidu.com>
Co-authored-by: houj04 <35131887+houj04@users.noreply.github.com>
Co-authored-by: pangyoki <pangyoki@126.com>
Co-authored-by: lyq <30404405+affectionlu@users.noreply.github.com>
Co-authored-by: Zhong Hui <zhonghui.net@gmail.com>
Co-authored-by: fuyou765 <64373205+fuyou765@users.noreply.github.com>
Co-authored-by: Chen Weihang <chenweihang@baidu.com>
Co-authored-by: YuanRisheng <yuanrisheng@baidu.com>
Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
Co-authored-by: ccrrong <101700995+ccrrong@users.noreply.github.com>
Co-authored-by: xiaoxiaohehe001 <49090790+xiaoxiaohehe001@users.noreply.github.com>
Co-authored-by: ykkk2333 <77383312+ykkk2333@users.noreply.github.com>
Co-authored-by: Li Min <11663212+limin2021@users.noreply.github.com>
Co-authored-by: Hui Zhang <zhtclz@foxmail.com>
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
Co-authored-by: cifar10 <41565156+cifar10@users.noreply.github.com>
Co-authored-by: fwenguang <95677191+fwenguang@users.noreply.github.com>
Co-authored-by: Aganlengzi <aganlengzi@gmail.com>
Co-authored-by: yuguo <948529990@qq.com>
Co-authored-by: Zhang Jun <ewalker@live.cn>
Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com>
Co-authored-by: yangguohao <70266361+yangguohao@users.noreply.github.com>
Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>
Co-authored-by: Lux et Veritas <1004239791@qq.com>
Co-authored-by: zhangbo9674 <82555433+zhangbo9674@users.noreply.github.com>
Co-authored-by: BiynXu <62832681+BiynXu@users.noreply.github.com>
Co-authored-by: ziyoujiyi <73728031+ziyoujiyi@users.noreply.github.com>
Co-authored-by: Zhen Wang <wangzhen31@baidu.com>
Co-authored-by: chenjian <chenjian26@baidu.com>
Co-authored-by: helen88 <z8hanghuan@126.com>
Co-authored-by: Yuang Liu <liuyuang@baidu.com>
Co-authored-by: qipengh <huangqipeng@cambricon.com>
Co-authored-by: shangliang Xu <ghostxsl@users.noreply.github.com>
Co-authored-by: Jiabin Yang <360788950@qq.com>
Co-authored-by: Lin Manhui <mhlin425@whu.edu.cn>
Co-authored-by: Bobholamovic <linmanhui@baidu.com>
Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>
Co-authored-by: kuizhiqing <kuizhiqing@baidu.com>
Co-authored-by: Charles-hit <56987902+Charles-hit@users.noreply.github.com>
Co-authored-by: HongyuJia <jiahongyu@baidu.com>
Co-authored-by: heliqi <1101791222@qq.com>
Co-authored-by: Yulong Ao <aoyulong@baidu.com>
Co-authored-by: JZ-LIANG <jianzhongliang10@gmail.com>
Co-authored-by: thunder95 <290844930@qq.com>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: zhiboniu <31800336+zhiboniu@users.noreply.github.com>
Co-authored-by: Ainavo <57820731+Ainavo@users.noreply.github.com>
Co-authored-by: SigureMo <sigure.qaq@gmail.com>
Co-authored-by: Asthestarsfalll <72954905+Asthestarsfalll@users.noreply.github.com>
Co-authored-by: Wangzheee <634486483@qq.com>
Co-authored-by: Thomas Young <35565423+HexToString@users.noreply.github.com>
Co-authored-by: ShiningZhang <zhang_liang1991@126.com>
Co-authored-by: OccupyMars2025 <31559413+OccupyMars2025@users.noreply.github.com>
Co-authored-by: mrcangye <mrcangye@email.cn>
Co-authored-by: Roc <30228238+sljlp@users.noreply.github.com>
Co-authored-by: seemingwang <zsasuke@qq.com>
Co-authored-by: DesmonDay <908660116@qq.com>
Co-authored-by: seemingwang <seemingwang@users.noreply.github.com>
Co-authored-by: Thunderbrook <a754913769@163.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com>
Co-authored-by: lxsbupt <luoxsbupt@163.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: qingshui <qshuihu@gmail.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: yang131313 <lisy928472889@163.com>
Co-authored-by: mengqingchun02 <103740521+mengqingchun02@users.noreply.github.com>
Co-authored-by: 熊峻峰 <xiongjunfeng@sina.com>
lxsbupt added a commit to lxsbupt/Paddle that referenced this pull request Dec 17, 2022
* fix python3.10 compile bug on window (PaddlePaddle#44330)

* Fix random seed for several unit tests (PaddlePaddle#44135)

* Fix test_functional_conv2d_transpose random seed

* Fix random seed and use np.testing

* Fix random seed for test_lu_unpack_op

* Fix test_autograd_functional_dynamic random seed

* Remove boost library (PaddlePaddle#44092)

* add fused token prune op and plugin (PaddlePaddle#44281)

* add fused token prune op and plugin

* Fix run inference bug for standalone executor (PaddlePaddle#44340)

* xpu-paddlepaddle-33 [任务] matmul单测 timeout (PaddlePaddle#44333)

test=kunlun

* [IPU] add custom-op UTs 0/N (PaddlePaddle#44328)

* add custom-op UTs 0

* add authors

Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* [IPU] add custom-op UTs 1/N (PaddlePaddle#44329)

* add custom-op UTs 1

* add authors

Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* update url

Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* support KL2 multi-card training, *test=kunlun (PaddlePaddle#43889)

* update xccl lib
    * use separate streams for compute/comm on XPU
    * add broadcast op to xpu2_op_list

* Remove auto to_pascal_case for args in op generator (PaddlePaddle#44350)

* remove auto to_pascal_case for args in op generator

* fix yaml config

* Standard sparse conv name (PaddlePaddle#44353)

* [Eager] eager variable back sync (PaddlePaddle#44343)

* eager variable back sync

* [ Phi Kernel ] Transfer as_real to phi. (PaddlePaddle#44263)

* transfer as_real to phi

* fix erros

* blocking: True -> False

* [Eager]Fix assert statement (PaddlePaddle#43492)

* Not rename pb file to avoid re-compile (PaddlePaddle#44370)

* [Phi] Migrate solve kernel to phi (PaddlePaddle#44363)

* draft version

* draft version

* draft version

* migrate solve kernel to phi

* polish

* polish

* re useless header file, fix a bug in grad_kernel_impl

* add header file in need

* [auto parallel] remove comm init control (PaddlePaddle#44385)

* [CustomDevice] remove unused file (PaddlePaddle#44358)

* [Paddle-TRT] reshape fill_constant (PaddlePaddle#44314)

* reshape fill_constant

* commit

* commit

* set seed for uts (PaddlePaddle#44372)

* [Paddle-TRT] remove useless code in fc (PaddlePaddle#44382)

* remove useless code in fc

* [Paddle-TRT] Fix cast (PaddlePaddle#44312)

* fix_cast

* fix_cast

* commit

* Polish jit layer cmakelists to hide some message (PaddlePaddle#44351)

* Enable inference multi stream ci test (PaddlePaddle#44275)

* test

* update

* fix bug of old pp (PaddlePaddle#44361)

* add xpu resnet_unit (PaddlePaddle#44297)

* add xpu resnet_unit
*test=kunlun

* tmp
*test=kunlun

* add blacklist in prim2orig interface (PaddlePaddle#44383)

* [Plugin] Fix Custom device in eager mode, test=develop (PaddlePaddle#43952)

* [Plugin] Fix Custom device in eager mode, test=develop

* update test case, test=develop

* update ut for coverage, test=develop

* add ipu support for standalone executor.  (PaddlePaddle#44342)

* fix typos in template for codegen of operators (PaddlePaddle#44364)

* fix duplicate slice logic in _grad (PaddlePaddle#44396)

* [MLU] fix mlu ctest final. (PaddlePaddle#44404)

* fix data transform bug of interpolate op (PaddlePaddle#44401)

* [Sparse] Add sparse matmul kernel(coo*dense->dense) (PaddlePaddle#44346)

* fix new autodiff api docs (PaddlePaddle#44341)

* fix build error in low arch (PaddlePaddle#44391)

* [new api] add new api paddle.vision.ops.distribute_fpn_proposals (PaddlePaddle#43736)

* add distribute_fpn_proposals

* change to new dygraph

* fix doc and example code

* change fluid impl to current version

* update (PaddlePaddle#44418)

* [Paddle-TRT] Shape sum fix scale (PaddlePaddle#44394)

* shape sum

* add shape, sum trt layer

* [Phi] Migrate infermeta and add yaml for solve op (PaddlePaddle#44379)

* migrate solve kernel to phi

* re useless header file, fix a bug in grad_kernel_impl

* add header file in need

* add yaml for solve op

* fix solve_sig.cc ArgumentMapping and update tests case

* disable legacy dygraph check in op_test

* rm solve_op.cc / solve_sig.cc and migrate yaml config

* Update op_test.py

disable legacy dygraph check when check_eager is True

* add labels for infer ut (PaddlePaddle#44279)

* add labels for infer ut

* add RUN_TYPE=INFER for cpp ut

* fix formaterror

* update

* Add mfence for XPU2 KP (PaddlePaddle#44258)

* remove include of all.h in resnet_basic_block_op_xpu.cc, test=kunlun (PaddlePaddle#44423)

* Rename BOOST_GET macros (PaddlePaddle#44368)

* Rename BOOST_GET macros

* Fix conflicts

* [new API] add paddle.vision.ops.generate_proposals (PaddlePaddle#43611)

* add generate_proposals into paddle.vision

* remove class api

* im_info -> img_size

* change fluid impl to current version

* Accelerate inference period in op Cache method (PaddlePaddle#43857)

* Added pad3d and pad2d FP32 FWD oneDNN kernels (PaddlePaddle#43990)

* Piotrek's changes for pad3d

* my changes

* first version of pad3d, single copy, unnecessary reads

* optimized pad3d kernel

* test upadte

* removed magic numbers

* added support for pad2d

* reverted two files

* reverted one old change

* added support for Paddings tensor

* CI fix

* CI fix

* fixed timeout of tests

* fixed typo

* changes to GetKernelTypeForVar

* Revert "changes to GetKernelTypeForVar"

This reverts commit 4691061.

* added AsExtra() to pad2d

Co-authored-by: Piotr Paturej <piotr.paturej@intel.com>

* add save_cache/patch (PaddlePaddle#44420)

* add save_cache/patch

* add pybind

* remove pybind

* remove const_cast

* add fleet

* Standard name of sparse pool (PaddlePaddle#44344)

* move eig operator from fluid to phi (PaddlePaddle#44398)

* move eig operator from fluid to phi

* add eig_grad unitest, upgrade IsComplexType() from fluid to phi

* [Phi]Move angle op to phi (PaddlePaddle#44393)

* Move angle op to phi

* Replace mutable_data using Alloc

* Remove some include

* Try to fix windows ci error

* include math.h to fix windows ci error

* Fix kernel name

* Move angle_grad infershape

* [Eager]release gil when run backward (PaddlePaddle#44433)

* release gil when run backward

* compile phi/backends into one static library (PaddlePaddle#44373)

* compile into one static library

* fix xpu compile

* fix xpu compile

* fix inference compile

* fix inference compile

* add custom test

* revert one file

* [IPU] Add more Ops (PaddlePaddle#44414)

* [IPU] Add more Ops

* update boost API

* Clean CI_SKIP_CPP_TEST (PaddlePaddle#44412)

* Add dependency for read op in standalone executor (PaddlePaddle#44362)

* Add dependency for read op in standalone executor

* Fix CI errors

* Add UT

* add_dependency -> dependency_utils

* Fix CI errors

* Add distro in ci docker (PaddlePaddle#44332)

* add distro zstd

* test

* test

* add pip3.8

* [Phi] migrate as_complex kernel to phi (PaddlePaddle#44438)

* migrate as_complex kernel to phi

* support as_complex and as_real in phi

* rm GetExpectedKernelType for AsRealOp

* [GPUPS]FleetWrapper initialize (PaddlePaddle#44441)

* fix FleetWrapper initialize

* [XPU][NPU] (1) add device_guard. (2) add support for LoDTensorArray of sum op. (PaddlePaddle#44367)

* device_guard support xpu. test=kunlun

* sum op of xpu support LoDTensorArray. add test for while op of xpu. test=kunlun.

* [IPU] add Op uts (PaddlePaddle#44415)

* transfer block_id to CreateVarNode in multi_devices_graph_pass (PaddlePaddle#44366)

* fix CreateVarNode in multi_devices_graph_pass

* Revert "Fix var duplication bug for graph_to_program_pass (PaddlePaddle#44278)"

This reverts commit a2c4c86.

* 【GPUPS】Adam accessor (PaddlePaddle#43919)

* add adam/sharedadam optimzier for gpups;edit optimizer struct;test=develop

* [Phi] migrate sync_batch_norm to phi (PaddlePaddle#44369)

* [GPUPS]Fix psgpuwrapper initialization (PaddlePaddle#44468)

* Update ps_gpu_wrapper.h

* Update ps_gpu_wrapper.h

* Update ps_gpu_wrapper.cc

* [Phi] migrate exponential kernel to phi (PaddlePaddle#44376)

* [Phi] migrate exponential kernel to phi

* fix comment

* fix CI

* [PHI] move diag_embed op to phi. (PaddlePaddle#44408)

* move diag_embed to phi.

* [MLU] set_value performance optimizing (PaddlePaddle#44390)

* Update api changing approve members (PaddlePaddle#44463)

* update api approve members, test=document_fix

* add qingqnig into list, test=document_fix

* fix bug,test=document_fix (PaddlePaddle#44478)

* [Phi] migrate clip_by_norm to phi (PaddlePaddle#44458)

* add eigen3 dependency for phi_backends (PaddlePaddle#44479)

* remove fleet_13 ut in parallel_UT_rule.py; test=develop (PaddlePaddle#44477)

* [PHI]Seperate xshape kernel from normal kernel (PaddlePaddle#44315)

* seperate xshape kernel from normal kernel

* fix bugs in infermeta

* fix compile bugs

* fix compile bugs

* [AutoParallel] fix unittest with paddle.distributed.launch (PaddlePaddle#44439)

* fix unittest

* fix log_dir

* _enable_legacy_dygraph

* [Phi] add temporal_shift yaml (PaddlePaddle#44409)

* add temporal_shift yaml and unittest

* [Paddle inference] Add conv_fusion_fp16 (PaddlePaddle#44435)

* convfusionfp16

* convfusionfp16

* convfusionfp16

* fix some convert error found in tipc. (PaddlePaddle#44457)

* fix some error found in tipc.

* update

* [BugFix]Fix randint_like bugs when save program that don't need use tensor's value (PaddlePaddle#44446)

* fix bugs of random

* fix unittest error

* fix unittest bugs

* add adaptive pool and softmax with cross entropy supports different axis, * test = kunlun  (PaddlePaddle#44428)

* add xpu pnorm op and fix pool op, *test=kunlun

* add adaptive pool, and softmax with cross entropy supports different axis, *test=kunlun

* add slot attr for push sparse op (PaddlePaddle#44422)

* add slot attr for push sparse op

* add pybind

* remove fleet

* add unittest

* fix

* [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad (PaddlePaddle#44485)

* [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad

* fix unittest

* fix tensor stream error in custom op (PaddlePaddle#44500)

* Replace with dygraph op calling method. (PaddlePaddle#44331)

* Replace with dygraph op calling method.

* [JitLayer]Pybind PEFunction and call phi api in layer_test (PaddlePaddle#44465)

* Support predictor function in JitLayer

* Pybind PEFunction

* Pybind PEFunction and call phi api in layer_test

* Call sqrt phi API

* Polish flags

* Fix comments

* [Sparse] Add sparse addmm kernel (dense+coo*dense->dense,dense+csr*dense->dense) (PaddlePaddle#44451)

* [Eager] bilinear_tensor_product yaml (PaddlePaddle#44459)

* bilinear_tensor_product yaml

* [ Phi ] svd transfer (PaddlePaddle#44392)

* svd cpu forward

* svd gpu forward

* transfer the backward of svd

* remove cusolver in svd_grad

* svd kernel bug fix

* fix bugs

* fix bugs.

* fix bug

* [Paddle-TRT] fix_fill_constant (PaddlePaddle#44481)

* fix_fill_constant

* fix_fill_constant

* fix_ernie

* [MLU] transpose avg_pool2d to NHWC for better performance. (PaddlePaddle#44475)

* [jit] jit support property.proto (PaddlePaddle#44337)

* add property.proto, can compiled

* property get and deserilize

* support get float

* format code

* format code

* add unittest

* add more set method

* fix grammar error

* Update paddle/fluid/jit/property.h

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* Update paddle/fluid/jit/property.cc

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* fix comment

* fix error throw

* fix property save unit test

* fix error info

* fix copyright and header import

* reorder jit property tensor datatype

Co-authored-by: Aurelius84 <zhangliujie@baidu.com>

* [ Dy2static ] infer_program may be incorrect in amp mode. (PaddlePaddle#44487)

* fix the outputs of net is x,x

* add unittest for duplicate output

* fix

* fix _infer_program use the original program not the amp program.

* get _***program_id back and avoid duplicate cache
ing

* fix

* Fc fp16 (PaddlePaddle#44505)

* fc support fp16

* add a ‘,’ on paddle_pass_builder.cc

* fc support fp16 on non-cuda.

* add batch stream (PaddlePaddle#44524)

* shufflechannelfix (PaddlePaddle#44516)

* fix arg_max to select first index (PaddlePaddle#44521)

* [MLU] add floor kernel and grid_sampler kernel (PaddlePaddle#44498)

* commit (PaddlePaddle#44534)

* [CustomDevice] register Copy for custom device (PaddlePaddle#44200)

* [CustomDevice] register Copy for custom device

* [CustomDevice] register Copy for custom device

* [CustomDevice] register Copy for custom device

* merge and add uts

* merge and add uts

* fix for blocking and unittests coverage

* (modified) fc support fp16 (PaddlePaddle#44540)

* Add code of occupancy computing on DCU and avoid threadID bug for DCU profiler (PaddlePaddle#44520)

* add xpu lars_momentum/pow2_decay (PaddlePaddle#44448)

*test=kunlun

* [phi] move inverse op from fluid to phi (PaddlePaddle#44471)

* move inverse from fluid to phi with unitest bug

* fix bug, add eager op yaml

* support send_partial, recv_partial and allgather_partial in ProcessGroupNCCL (PaddlePaddle#44444)

* [Sparse]add sparse unary api(expm1/deg2rad/rad2deg/relu6/leaky_relu) (PaddlePaddle#44432)

* Fc fp16 (PaddlePaddle#44558)

* (modified) fc support fp16

* __CUDA_ARCH__ version

* delete half

* delete half

* Fix bug of amp code-gen (PaddlePaddle#44570)

* fix bug of amp code_gen

* fix bug

* [JitLayer]Fix jit.save error when save params combined (PaddlePaddle#44504)

* Fix jit.save error when save params combined

* Change dict_value to list

* [Phi] Migrate squared_l2_norm_op to phi (PaddlePaddle#44492)

* add swish  using TensorRT layer (PaddlePaddle#44561)

* update

* empty commit

* update

* update

* update

* Phi gird sampler migration (PaddlePaddle#44562)

* add_ymal_utest for phi grid_sampler op

* skip dist test cases if mlu card number only one, test=develop (PaddlePaddle#44549)

* [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine (PaddlePaddle#44513)

* [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine

* refine code

* 【Hackathon No.21】为 Paddle 新增 SoftMarginLoss (PaddlePaddle#42364)

* 2022-04-28

* 2022-04-28_V2

* 2022-04-30

* 2022-04-30_V2

* 2022-05-01

* 2022-05-02

* 2022-05-02_V2

* 2022-05-05_V1

* 2022-05-06_V1

* 2022-05-07_V1

* Update loss.py

* 2022-05-07_V2

* 2022-05-13_V1

* Update test_soft_margin_loss.py

* Update loss.py

* Update loss.py

* 2022-05-16_V1

* 2022-05-19_V1

* 2022-05-20_V1

* Update test_soft_margin_loss.py

* 2022-06-01_V1

* 2022-06-05

* 2022-06-07

* 2022-06-07

* 2022-06-08

* 2022-06-08_V2

* 2022-06-17-code_style

* Modify python

* 2022-06-20

* for

* for CI;test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* [MLU]transpose convbpf output to HWCN for better performance (PaddlePaddle#44552)

* Fc fp16 (PaddlePaddle#44578)

* (modified) fc support fp16

* __CUDA_ARCH__ version

* delete half

* delete half

* add half support

* add half support

* add half support

* [Auto Parallel] Add dist op cost (PaddlePaddle#44146)

* update comp cost

* add dist default op cost

* add dist fill constant batch size like op cost

* add elewise op cost

* add fill_constant_batch_size_like op cost unittest

* add unittest and remove fill_constant_batch_size_like grad op cost

* add to cmakelist

* fix unittest bug

* Improve CI unittest parallel execution strategy (PaddlePaddle#44334)

* paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=parallel_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test=paralle_test_daily

* test pre_test_bak

* test cfs

* test_cfs,test=paralle_test_daily

* test_cfs,test=paralle_test_daily

* fix nightly test name,test=paralle_test_daily

* fix nightly test name,test=paralle_test_daily

* test ci parallel speed

* refine parallel rule,test=paralle_test_daily

* Move bmm OP from fluid to phi (PaddlePaddle#44496)

* [PHI]Move slogdeterminant op to phi (PaddlePaddle#44547)

* Move slogdeterminant op to phi

* Add yaml and unit test for slogdeterminant

* Rename pybind_boost_header.h (PaddlePaddle#44592)

* unify data type and property enum value (PaddlePaddle#44585)

* inference multi stream support handle lazy init. (PaddlePaddle#44563)

* multi stream support handle lazy init.

* support eigen lazy init

* update

* fix ci problem

* Remove ControlDepVar in GraphToBlock (PaddlePaddle#44591)

* transfer the svd infer into phi infermeta (PaddlePaddle#44528)

* transfer the svd infer into phi infermeta

* remove the svd.h

* modify svd api

* fix svd error by insert optional

* Einsum grad complex (PaddlePaddle#44598)

* add complex for einsum grad kernel

* pass the ci

* add reverse yaml (PaddlePaddle#44518)

* add reverse yaml

* Set more attrs in ReplaceScaleLossGradOp (PaddlePaddle#44576)

* Set more attrs in ReplaceScaleLossGradOp

* Fix typos

* Fix CI errors

* Add UT

* [Phi] Migrate box coder to phi. (PaddlePaddle#44550)

* fix behavior of device_id=None in Tensor.cuda (PaddlePaddle#44515)

* fix behavior of device_id=None in Tensor.cuda

* fix CI

* fix windows cuda11.7 bug (PaddlePaddle#44601)

* add  horizontal federation learning ps feature (PaddlePaddle#44327)

* back fl

* delete ssl cert

* .

* make warning

* .

* unittest paral degree

* solve unittest

* heter & multi cloud commm ready

* .

* .

* fl-ps v1.0

* .

* support N + N mode

* .

* .

* .

* .

* delete print

* .

* .

* .

* .

* fix bug

* .

* .

* fl-ps with coordinator ready

* merge dev

* update message parse only

* update fl client scheduler

* fix bug

* update multithreads sync

* fix ci errors

* update role_maker.py

* update role_maker.py

* fix ci error: windows py import error

* fix ci error: windows py import error

* fix windows ci pylib import error

* add dump fields & params

* try to fix windows import fleet error

* fix ps FLAGS error

* [MLU] rollback cntoolkit vetsion to 2.8.5 (PaddlePaddle#44595)

* [CustomDevice] add blas_axpby api for gradient_accumulator (PaddlePaddle#44584)

* add sin,cos,exp primitive operators (PaddlePaddle#44345)

* Optimize sparse convolution (PaddlePaddle#43576)

* Merge kProgramDescs in GraphToProgram (PaddlePaddle#44526)

* [Eager] Add warpctc yaml (PaddlePaddle#44617)

* Add a feed op before each input parameter var. (PaddlePaddle#44499)

* Add a feed op before each input parameter var.

* Fix some issues about the unit test build_cinn_pass_test.

* fix record event for operator type in new dygraph (PaddlePaddle#44582)

* fix new dygraph record event for op

* update unit test

* fix bug of elementwise_add_grad, *test=kunlun (PaddlePaddle#44545)

* fix bug of elementwise_add_grad, *test=kunlun

* fix bug, *test=kunlun

* rm pooling_t, *test=kunlun

* fix bug of ew_add_grad when inplace, *test=kunlun

* [IPU] small bug fix (PaddlePaddle#44473)

* sync misc changes

* add authors

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* up x

* Revert "up x"

This reverts commit f3fde45.

* add guarg for ipu

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* support auto fallback to  cpu kernel for cusom device (PaddlePaddle#44639)

* fix dygraph bugs in broadcast_to api. (PaddlePaddle#44612)

* add set_dtype for inverse_op (PaddlePaddle#44618)

* refine overalls.cmake (PaddlePaddle#44623)

* [PHI]Add yaml and unittest for bmm op (PaddlePaddle#44625)

Add yaml and unittest for bmm op

* Phi average accumulates migration (PaddlePaddle#44554)

* move average_accumulates op to phi kernel

* new exe not support pg (PaddlePaddle#44628)

* [CustomDevice]fix phi kernel header (PaddlePaddle#44637)

* [CustomDevice] add process_group_xccl ut (PaddlePaddle#44632)

* [CustomDevice] add process_group_xccl ut

* update

* Fix conv api name (PaddlePaddle#44636)

* [DCU] Fix NAN problem when training BERT on DUC platform (PaddlePaddle#44643)

* [JitLayer]Remove include fluid head files in JitLayer (PaddlePaddle#44597)

* Remove include fluid head files in JitLayer

* Format code

* Remove const to fix ci error

* Fix param error

* Polish jit layer include and cp some headers to python/include

* Fix comment

* [jit]  jit.save support property serialization (PaddlePaddle#44581)

* jit.save support peropty serilization

* extract set property function

* fix property test file name

* fix typing error

* fix typing error

* fix test coverage

* Replaced add_custom_command with add_custom_target in xpu_kp_cmake (PaddlePaddle#44619)

* Replaced add_custom_command with add_custom_target in xpu_kp_cmake

* add adagrad and rmsprop yaml (PaddlePaddle#44631)

* [phi] move crop_tensor kernel from fluid to phi (PaddlePaddle#44574)

* move crop_tensor from fluid to phi

* delete fluid header files

* fix crop_tensor_op dygraph_mode bug

* modify header files, add out tensor check

* fix RemoveIntermediateOut in fuse_elewise_add_act_pass while converting graph to program (PaddlePaddle#44593)

* fix RemoveNode in fuse_elewise_add_act_pass

* fix

* change pointer to share_ptr

* fix

* fix

* fix format

* fix

* fix graph_safe_remove_nodes

* fix UTs on physical ipu (PaddlePaddle#44647)

* [IPU] add more loss ops  (PaddlePaddle#44646)

* add more loss ops

* add authors

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>

* add g_ipuplace_pytype (PaddlePaddle#44648)

* Strided slice fp16 (PaddlePaddle#44653)

* [MLU]fix sync_batch_norm and concat_grad op (PaddlePaddle#44586)

* retain dist op returns (PaddlePaddle#44634)

* xpu unittest grad compute supports more types, *test=kunlun (PaddlePaddle#44606)

* [Eager] Add hierarchical_sigmoid yaml (PaddlePaddle#44638)

* add matrix_nms in python/paddle/vision/ops.py (PaddlePaddle#44357)

* [auto parallel] bug fix for op has sub_block attr created with copy_from (PaddlePaddle#44664)

* Change the way to set attributes for grad op maker (PaddlePaddle#44514)

* fix typos in template for codegen of operators
* change the way to set attributes for grad op maker

* [XPU] add top_k op (PaddlePaddle#44656)

* [XPU] add top_k op. test=kunlun

* [XPU] add top_k op. test=kunlun

* use PADDLE_ENFORCE_XDNN_NOT_NULL to check pointer. test=kunlun

* Support broadcast tensor in phi system (PaddlePaddle#44590)

* [PHI] Move spectral_norm to phi (PaddlePaddle#44577)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Fix: Move out_grad to first

* Register new kernels

* Remove old kernels

* Move out_grad to last

* Fix bugs

* Transfer infermeta

* Add yaml files

* Add blank line

* Fix code style

* Optimize directory structure

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* Complete the dtypes for all_gather, add all_gather_object api (PaddlePaddle#44417)

* [Eager] refactor general_grad and fix some bugs (PaddlePaddle#44611)

* refactor general_grad and fix some bugs

* add TODO: support prune logic deeper

* support log_grad op, *test=kunlun (PaddlePaddle#44662)

* [LAUNCH] add distributed launch check tools (PaddlePaddle#44495)

* add launch test

* launch test for cpu

* bs 1

* Move api(lgamma) from legacy_api.yaml to api.yaml (PaddlePaddle#44355)

* Move api(lgamma) from legacy_api.yaml to api.yaml

* Move api(lgamma) from legacy_api.yaml to api.yaml

* Move api(lgamma) from legacy_api.yaml to api.yaml

* modify code style

* add x to X mapping

* add definition of lgamma

* delete redundant lgamma definitions

* Modify code comments

* Modify ops.py code format

* add lgamma  single test and lgamma api in fluid

* Optimized lgamma unittest

* Move frame kernel to phi (PaddlePaddle#44615)

* Move frame OP to phi、add frame OP yaml config and supplement single test

* add Header file of in_dygraph_mode

* Modify variable name and FrameGradInferMeta multiplex UnchangedInferMeta

* move seq2col to phi

* delete elementwise pow in xpu_kp_list (PaddlePaddle#44661)

* [MLU] fix log_softmax mode selection. (PaddlePaddle#44669)

* adapt for resnet (PaddlePaddle#44685)

* Fix some problem of kernel fallback in C++ API (PaddlePaddle#44681)

* support auto fallback to  cpu kernel for cusom device

* fix some problem of kernel fallback

* fix bugs of lstsq (PaddlePaddle#44689)

* migrate dirichlet kernel to phi (PaddlePaddle#44434)

* migrate dirichlet op kernel to phi

* fix dirichlet sample memory leak

* [phi]move softsign from fluid to phi (PaddlePaddle#44616)

* test_activation_op unitest error, yaml & activation.py in_dygraph_mode incomplete

* fix test_activation_op unitest error, add yaml and dygraph test

* fix code style with pre-commit

* try to fix namespace error of abs in activation_functor.h

* fix namespace error of abs

* [Paddle Inference] Support depthwise_conv2d fp16. (PaddlePaddle#44642)

* depthwise_fp16

* depthwise_fp16

* depthwise_fp16

* depthwise_fp16

* fix logging debug level (PaddlePaddle#44684)

* back fl

* delete ssl cert

* .

* make warning

* .

* unittest paral degree

* solve unittest

* heter & multi cloud commm ready

* .

* .

* fl-ps v1.0

* .

* support N + N mode

* .

* .

* .

* .

* delete print

* .

* .

* .

* .

* fix bug

* .

* .

* fl-ps with coordinator ready

* merge dev

* update message parse only

* update fl client scheduler

* fix bug

* update multithreads sync

* fix ci errors

* update role_maker.py

* update role_maker.py

* fix ci error: windows py import error

* fix ci error: windows py import error

* fix windows ci pylib import error

* add dump fields & params

* try to fix windows import fleet error

* fix ps FLAGS error

* fix logging risk

* fix logging possible risk

* Skip CUDA Graph case for standalone executor (PaddlePaddle#44693)

* [Eager] fix lerp grad kernel logic (PaddlePaddle#44705)

* clone ort_predictor reuse session (PaddlePaddle#44703)

* [XPU] add sampling_id op, add top_k op, update xdnn api. test=kunlun (PaddlePaddle#44704)

* fused_fc_elementwise_layernorm_op support fp16 (PaddlePaddle#44710)

* fused_fc_elementwise_layernorm support fp16

* fused_fc_elementwise_layernorm support double

* [Phi] Add yaml for assign_value (PaddlePaddle#44596)

* [Phi] Add yaml for assign_value

* [Phi] Fix the bug of the assign api and modify the unittest

* [Phi] Fix the bug when the tensor does not have the backend info

* [Phi] Replace the functional-style cast init by the brace-init

* [Phi] Cast the data explicitly

* [PHI] Move lu to phi  (PaddlePaddle#44605)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Register new kernels

* Remove old kernels

* Fix code style

* Fix bugs

* mutable_data->HostAlloc

* Transfer infermeta

* Add yaml and update python api

* Add PADDLE_WITH_HIP check

* Update unittests

* Fix bugs

* Fix bugs

* Optimize directory structure

* Add output checks

* lu_impl.h->lu_kernel_impl.h

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* [MLU] add pytest for mlu strided_slice kernel (PaddlePaddle#44523)

* Support backward final hook (PaddlePaddle#44686)

* update to sdk2.6.0 (PaddlePaddle#44673)

* move CUDAStream to phi (PaddlePaddle#44529)

* init

* move CUDAStream to phi

* fix compilation

* merge develop

* add stream_owned_ member

* split cuda_stream.h

* fix cpu compile

* fix constructor

* fix bug

* fix windows compile

* fix inference test_levit

* fix windows tests

* [Auto parallel] Optimization Tuning (PaddlePaddle#43782)

* fixed bug for pass & engine

* fixed bug for benchmark GPT-3

* add tuner & profiler

* add algorithms & config

* skip cast trt convert when input dtype is bool (PaddlePaddle#44716)

* skip cast trt convert when input dtype is bool

* [LAUNCH] fix set args bug (PaddlePaddle#44717)

* Phi softplus migration (PaddlePaddle#44542)

* add yaml and utests of phi softplus

add yaml of softplus

fix softplus bug in phi

* update utests

* bug fix

* bug fix for test_layers

* layer api match

* match def and doc in ops.py

* doc polish

* fix unwanted modified of thresholded_relu

* style imporve

* 【PaddlePaddle Hackathon 3 No.15】为 Paddle 新增 count_nonzero (PaddlePaddle#44169)

* add count_nonzero api

* remove grad test

* [WIP] Matmul v1 & v2 unification -- part 1 (PaddlePaddle#44640)

* - Unit tests to be debugged

- fix

- refactor

- diagnostic

- more diagnostic

- fix

- Fix number two

- fix

- fix

- fix

- alpha added

- more fixes

- compilation fix

- removed diagnostic code

- cosmetic fixes

* lint

* add FLAGS_enable_api_kernel_fallback (PaddlePaddle#44706)

* add FLAGS_enable_api_kernel_fallback

* deal with more cases

* add ut for coverage

* phi_multiclass_nms3 (PaddlePaddle#44613)

* add some fp16 op for kunlun resnet50 model (PaddlePaddle#44672)

* add some fp16 op for kunlun resnet50 model
*test=kunlun

* tmp
*test=kunlun

* add dist op costs (PaddlePaddle#44701)

* [API/OP] Migrate Lstsq op into phi (PaddlePaddle#44318)

* migrate lstsq op

* update

* fix bugs for CIs

* update

* fix bugs

* add uts

* update

* update

* update

* fix bugs of jip

* fix bugs of hip

* update

* update according to review

* update

* update

* update

* update

* Add sparse SyncBatchNorm (PaddlePaddle#43520)

* add sparse SyncBatchNorm

* unify fluid::CUDADeviceContext and phi::GpuContext (PaddlePaddle#44723)

* remove cudaDeviceContext

* remove more template

* fix rocm compile

* 【PaddlePaddle Hackathon 3 No.12】为 Paddle 新增 pairwise_distance (PaddlePaddle#44161)

* add paddle.nn.functional.pairwise_distance (cattidea#273)
* remove the test case for undefined behavior

Co-authored-by: SigureMo <sigure.qaq@gmail.com>

* Phi prior box (PaddlePaddle#44431)

* phi_prior_box

* add float[] support

* phi_prior_box_optest

* update

* ort backend support output mutable data (PaddlePaddle#44724)

* [PHI] Move lu_unpack to phi (PaddlePaddle#44674)

* Add kernel declarations

* Copy kernel implementation code

* Transfer implementation code

* Register new kernels

* Remove old kernels

* Fix code style

* Fix bugs

* mutable_data->HostAlloc

* Transfer infermeta

* Add yaml and update python api

* Add PADDLE_WITH_HIP check

* Update unittests

* Add kernel declarations

* Copy kernel implementation code

* Transfer kernel implementation code

* Register new kernels

* Remove old kernels

* Add lu_unpack_sig

* Fix bugs

* Fix bugs

* Fix bugs

* Optimize directory structure

* Add output checks

* Update include files

* lu_impl.h->lu_kernel_impl.h

* Transfer infermeta

* Add yaml and update python api

* Add check_eager

Co-authored-by: Bobholamovic <linmanhui@baidu.com>

* update document of quantile and nanquantile; test=document_fix (PaddlePaddle#42413)

* migrate reduce_amin,reduce_amax kernel to phi (PaddlePaddle#44698)

* [Paddle Inference] add varlen_token_prune plugin, pass, convert (PaddlePaddle#44733)

* add varlen_token_prune plugin, pass, convert

* support build with Ninja on Linux (PaddlePaddle#44210)

* support ninja

* fix mkldnn on windows

* fix mkldnn on windows up1

* up2

* up3

* fix gflags

* BUILD_BYPRODUCTS_OPTION -> BUILD_BYPRODUCTS_ARGS

* use CMAKE_COMMAND

* up x

* migrate overlap_add and overlap_add_grad op (PaddlePaddle#44739)

* update code format

* add ymal and test

* update for comments

* Fix to CI (PaddlePaddle#44744)

* - fix

* - another fix

* lint

* infer context fix place error. (PaddlePaddle#44726)

* infer context fix place error.

* update

* update

* [operator migration] Migrate unstack_op and nms_op (PaddlePaddle#44424)

* update unstack_op

* update unstack_op

* update unstack_op

* fix unstack test

* update unstack

* update with remote

* fix unstack_test.py

* temp_save_change_nms_op

* add nms test

* update nms fix

* update unstack_op

* temp save change

* finish fix nms_op

* pass nms test

* fix CI

* fix ops test

* save change

* fix code style

* fix code style

* fix ci and codestyle

* fix ci

Co-authored-by: ShiningZhang <zhang_liang1991@126.com>

* Update linalg.py (PaddlePaddle#44347)

* Fix test and doc (PaddlePaddle#44735)

* fix test and doc

* fix all_gather_object with various length, test=allcases (PaddlePaddle#44718)

* update manipulation.py paddle.moveaxis (PaddlePaddle#44191)

* [CI] CI for Distributed (PaddlePaddle#44085)

* generate_unify_header supports excludes (PaddlePaddle#44761)

* [JitLayer]Polish PEFuntion to speed up JitLayer and fix memory leak (PaddlePaddle#44738)

* Polish PEFuntion to speed up JitLayer

* Polish PEFunction code

* Fix comments

* paddle2onnx update version to 1.0.0rc2 (PaddlePaddle#44759)

* set parallel_job according to CUDA memory in Windows CI unittest (PaddlePaddle#44695)

* set parallel_job according to CUDA memory

* fix bug: add whitespace between conten and [] or condition wont work

* [Sparse] optimize sparse attention (PaddlePaddle#44743)

* GPUGraph merge to develop (PaddlePaddle#44594)

Co-authored-by: seemingwang <zsasuke@qq.com>
Co-authored-by: DesmonDay <908660116@qq.com>
Co-authored-by: seemingwang <seemingwang@users.noreply.github.com>
Co-authored-by: Thunderbrook <a754913769@163.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com>
Co-authored-by: lxsbupt <luoxsbupt@163.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: qingshui <qshuihu@gmail.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Revert for cmake static library errors on XPU KP PaddlePaddle#44762

* unify gpu context (PaddlePaddle#44740)

* remove cudaDeviceContext

* remove more template

* fix rocm compile

* remove alias name CUDADeviceContext

* fix compile

* fix tests

* revert changes

* API doc(en) Bugs fix in 第四期体验评估 (PaddlePaddle#44749)

* fix docs(en) bugs;test=document_fix

* update paddle.add docs;test=document_fix

* update paddle.where docs;test=document_fix

* for ci;test=document_fix

* Update manipulation.py

* update paddle.where;test=document_fix

Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>

* Modify the output result annotation under the lerp function (PaddlePaddle#44035)

* Refactor build_op_downstream_map for standalone executor (PaddlePaddle#44729)

* Refactor build_op_downstream_map for standalone executor

* Add some comments

* update xpu.cmake to 20220731, test=kunlun (PaddlePaddle#44767)

* fix ut new_group_api (PaddlePaddle#44764)

* support beam_search operator on xpu. test=kunlun (PaddlePaddle#44720)

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* support beam_search operator on xpu. test=kunlun

* [phi] add yolov3_loss yaml and unittest (PaddlePaddle#44476)

* add yaml and unittest

* update yaml

* update backward yaml and unittest

* update yaml

* add Yolov3LossGradInferMeta

* update yolov3_loss_op.cc

* fix bug

* code format

* Update manipulation.py for rot90() (PaddlePaddle#44038)

* fix compile error;test=develop

* fix compile error;test=develop

* fix compile;test=develop

Co-authored-by: Sing_chan <51314274+betterpig@users.noreply.github.com>
Co-authored-by: zlsh80826 <rewang@nvidia.com>
Co-authored-by: Ruibiao Chen <chenruibiao@baidu.com>
Co-authored-by: RichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com>
Co-authored-by: taixiurong <taixiurong@126.com>
Co-authored-by: Allen Guo <alleng@graphcore.ai>
Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai>
Co-authored-by: zhangxiaoci <zhangxiaoci@baidu.com>
Co-authored-by: zyfncg <zhangyunfei07@baidu.com>
Co-authored-by: zhangkaihuo <zhangkaihuo@baidu.com>
Co-authored-by: wanghuancoder <wanghuan29@baidu.com>
Co-authored-by: xiongkun <xiongkun03@baidu.com>
Co-authored-by: Aurelius84 <zhangliujie@baidu.com>
Co-authored-by: Leo Chen <chenqiuliang@baidu.com>
Co-authored-by: Weilong Wu <veyron_wu@163.com>
Co-authored-by: caozhou <48191911+Caozhou1995@users.noreply.github.com>
Co-authored-by: ronnywang <ronny1996@163.com>
Co-authored-by: zhoutianzi666 <39978853+zhoutianzi666@users.noreply.github.com>
Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com>
Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com>
Co-authored-by: Wilber <jiweibo@baidu.com>
Co-authored-by: ShenLiang <1422485404@qq.com>
Co-authored-by: QingshuChen <chenqingshu@baidu.com>
Co-authored-by: levi131 <83750468+levi131@users.noreply.github.com>
Co-authored-by: Qi Li <qili93@qq.com>
Co-authored-by: 王明冬 <78149749+winter-wang@users.noreply.github.com>
Co-authored-by: Feiyu Chan <chenfeiyu@baidu.com>
Co-authored-by: Xiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Chenxiao Niu <ncxinhanzhong@gmail.com>
Co-authored-by: Zhou Wei <1183042833@qq.com>
Co-authored-by: JYChen <zoooo0820@qq.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
Co-authored-by: niuliling123 <51102941+niuliling123@users.noreply.github.com>
Co-authored-by: zhangyikun02 <48021248+zhangyk0314@users.noreply.github.com>
Co-authored-by: huzhiqiang <912790387@qq.com>
Co-authored-by: jakpiase <jakpia21@gmail.com>
Co-authored-by: Piotr Paturej <piotr.paturej@intel.com>
Co-authored-by: zhaocaibei123 <48509226+zhaocaibei123@users.noreply.github.com>
Co-authored-by: freeliuzc <lzc842650834@gmail.com>
Co-authored-by: tianshuo78520a <707759223@qq.com>
Co-authored-by: zmxdream <zhangminxu01@baidu.com>
Co-authored-by: houj04 <35131887+houj04@users.noreply.github.com>
Co-authored-by: pangyoki <pangyoki@126.com>
Co-authored-by: lyq <30404405+affectionlu@users.noreply.github.com>
Co-authored-by: Zhong Hui <zhonghui.net@gmail.com>
Co-authored-by: fuyou765 <64373205+fuyou765@users.noreply.github.com>
Co-authored-by: Chen Weihang <chenweihang@baidu.com>
Co-authored-by: YuanRisheng <yuanrisheng@baidu.com>
Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
Co-authored-by: ccrrong <101700995+ccrrong@users.noreply.github.com>
Co-authored-by: xiaoxiaohehe001 <49090790+xiaoxiaohehe001@users.noreply.github.com>
Co-authored-by: ykkk2333 <77383312+ykkk2333@users.noreply.github.com>
Co-authored-by: Li Min <11663212+limin2021@users.noreply.github.com>
Co-authored-by: Hui Zhang <zhtclz@foxmail.com>
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
Co-authored-by: cifar10 <41565156+cifar10@users.noreply.github.com>
Co-authored-by: fwenguang <95677191+fwenguang@users.noreply.github.com>
Co-authored-by: Aganlengzi <aganlengzi@gmail.com>
Co-authored-by: yuguo <948529990@qq.com>
Co-authored-by: Zhang Jun <ewalker@live.cn>
Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com>
Co-authored-by: yangguohao <70266361+yangguohao@users.noreply.github.com>
Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>
Co-authored-by: Lux et Veritas <1004239791@qq.com>
Co-authored-by: zhangbo9674 <82555433+zhangbo9674@users.noreply.github.com>
Co-authored-by: BiynXu <62832681+BiynXu@users.noreply.github.com>
Co-authored-by: ziyoujiyi <73728031+ziyoujiyi@users.noreply.github.com>
Co-authored-by: Zhen Wang <wangzhen31@baidu.com>
Co-authored-by: chenjian <chenjian26@baidu.com>
Co-authored-by: helen88 <z8hanghuan@126.com>
Co-authored-by: Yuang Liu <liuyuang@baidu.com>
Co-authored-by: qipengh <huangqipeng@cambricon.com>
Co-authored-by: shangliang Xu <ghostxsl@users.noreply.github.com>
Co-authored-by: Jiabin Yang <360788950@qq.com>
Co-authored-by: Lin Manhui <mhlin425@whu.edu.cn>
Co-authored-by: Bobholamovic <linmanhui@baidu.com>
Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>
Co-authored-by: kuizhiqing <kuizhiqing@baidu.com>
Co-authored-by: Charles-hit <56987902+Charles-hit@users.noreply.github.com>
Co-authored-by: HongyuJia <jiahongyu@baidu.com>
Co-authored-by: heliqi <1101791222@qq.com>
Co-authored-by: Yulong Ao <aoyulong@baidu.com>
Co-authored-by: JZ-LIANG <jianzhongliang10@gmail.com>
Co-authored-by: thunder95 <290844930@qq.com>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: zhiboniu <31800336+zhiboniu@users.noreply.github.com>
Co-authored-by: Ainavo <57820731+Ainavo@users.noreply.github.com>
Co-authored-by: SigureMo <sigure.qaq@gmail.com>
Co-authored-by: Asthestarsfalll <72954905+Asthestarsfalll@users.noreply.github.com>
Co-authored-by: Wangzheee <634486483@qq.com>
Co-authored-by: Thomas Young <35565423+HexToString@users.noreply.github.com>
Co-authored-by: ShiningZhang <zhang_liang1991@126.com>
Co-authored-by: OccupyMars2025 <31559413+OccupyMars2025@users.noreply.github.com>
Co-authored-by: mrcangye <mrcangye@email.cn>
Co-authored-by: Roc <30228238+sljlp@users.noreply.github.com>
Co-authored-by: seemingwang <zsasuke@qq.com>
Co-authored-by: DesmonDay <908660116@qq.com>
Co-authored-by: seemingwang <seemingwang@users.noreply.github.com>
Co-authored-by: Thunderbrook <a754913769@163.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com>
Co-authored-by: lxsbupt <luoxsbupt@163.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: qingshui <qshuihu@gmail.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: yang131313 <lisy928472889@163.com>
Co-authored-by: mengqingchun02 <103740521+mengqingchun02@users.noreply.github.com>
Co-authored-by: 熊峻峰 <xiongjunfeng@sina.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants