Enable BF16 on Paddle Parameter Server Distributed Training #30560

seiriosPlus · 2021-01-19T04:05:26Z

Paddle的分布式参数服务器训练目前主要面向的是大数据量+浅层模型为主的推荐场景，一般训练在数十上百台高性能CPU服务器。

推荐场景一般使用Embedding的嵌入层表示来表征用户特征，规模在千万至千亿级别，因此会导致PServer端内存的急剧消耗（可能会占用数十T的内存），同时PServer端对于稀疏参数的获取/更新速度也是整个训练的瓶颈之一。

期望用BF降低PServer端内存的占用，同时期望采用BF加速PServer端稀疏参数的获取/更新速度。

paddle-bot-old · 2021-01-19T04:05:31Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

luotao1 · 2021-01-26T02:50:29Z

We don't have BF16 Optimizer and training interface. But you can refer to static fp16 training interface: https://github.com/PaddlePaddle/models/blob/release/2.0-beta/PaddleNLP/benchmark/bert/run_pretrain_single.py#L241

amp_list = paddle.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
            custom_white_list=['layer_norm', 'softmax', 'gelu'])
optimizer = paddle.fluid.contrib.mixed_precision.decorate(
            optimizer,
            amp_list,
            init_loss_scaling=args.scale_loss,
            use_dynamic_loss_scaling=True)

Besides, maybe you should provide enable_mkldnn interface instead of FLAGS_use_mkldnn (see #27935)

lidanqing-intel · 2021-01-26T07:47:04Z

Luotao means that we should better not use global variable anymore.

lidanqing-intel · 2021-01-27T07:08:41Z

Hi, @luotao1 What is the compiling option to run the recommender models? I have this error when I run the train.py

  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/fleet_base.py", line 1192, in minimize
    self._runtime_handle = RuntimeFactory()._create_runtime(context)
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/runtime_factory.py", line 32, in _create_runtime
    ps_runtime = TheOnePSRuntime()
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/runtime/the_one_ps.py", line 383, in __init__
    self._worker = fluid.core.DistFleetWrapper()
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'DistFleetWrapper

My option is

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON

Update

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON

luotao1 · 2021-01-27T08:42:36Z

@lidanqing-intel You should use -DWITH_DISTRIBUTE=ON.
Besides, discussed with @MrChengmo, above models are published in https://github.com/PaddlePaddle/PaddleRec/tree/master/models

rank/dnn, rank/wide_deep, recall/word2vec: These three models are already tested by QA
rank/deepfm: Still developing.

How to run rank/dnn, please see https://github.com/PaddlePaddle/Perf/tree/master/CtrDnn

jczaja · 2021-02-08T13:30:53Z

Regarding to our strategy in enabling bf16 training, We focus on word2vec with goal to reduce memory consumption. We want to do that by enabling bf16 training for most memory consuming ops like: lookup_table . Apart from ops we want to have optimizer working purely on bf16 as well. So ideally to reduce memory usage we will have pure bf16 training without the need of keeping master parameters in fp32.

lidanqing-intel · 2021-02-19T11:58:25Z

Hi, @luotao1
With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

Build Paddle

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12

Run word2vec

cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

MrChengmo · 2021-02-20T02:38:28Z

Hi, @luotao1
With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

Build Paddle

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12

Run word2vec

cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

There is a compatibility problem between single machine training and PS distributed training，train.py is designed for distributed training，you can use follow command to run word2vec

cd 2.0benchmark/ps/static/word2vec
fleetrun --worker_num=1 --server_num=1 ../train.py -c benchmark.yaml

We recommend using PaddleRec to run the model, You can refer to the following links：

lidanqing-intel · 2021-02-25T12:17:43Z

Hi @luotao1 Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

lidanqing-intel · 2021-02-25T12:32:10Z

luotao1 · 2021-02-26T09:57:26Z

Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

@lidanqing-intel Please see 日志数据.

The log of fleetrun --worker_num=4 --server_num=4 ../train.py -c benchmark.yaml is Word2Vec DataLoader 4机. We don't have one machine log.

Why we choose this log? You could see following parameters in benchmark.yaml

reader_type: "DataLoader"
sync_mode: "async" # sync / async /geo / heter （Not geo）

wozna · 2021-03-04T16:36:03Z

@luotao1 I have a question related to command fleetrun,
I checked that when I install paddlepaddle by pip it works fine.

Unfortunately, when I was building the paddle from the source with the mentioned command
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON ,
it shows me that the command fleetrun: command not found

Should anything be done to make fleetrun command available?

luotao1 · 2021-03-05T07:39:49Z

@wozna Please check your python install path. fleetrun will be installed in python.

Paddle/python/setup.py.in

Lines 542 to 544 in ffbf713

    
           entry_points={ 
        
               'console_scripts': [ 
        
                   'fleetrun = paddle.distributed.fleet.launch:launch'

Or you can specific your own python like /usr/local/python -m fleetrun ...

jczaja · 2021-03-08T12:01:59Z

@luotao1 Some update. So we are to enable bf16 training with word2vec. As a first milestone is to have bf16 training enabled with master weights using AMP (automatic mixed precision training) with lookup_table, elementwise_add, reshape ops. This will not redce memory consumption but we will check if our bf16 functionality works fine. After this we will go for bf16 training of word2vec but without usage of fp32 master weights e.g. data will be initialized as bf16. This require some creation of initializers for bf16 data and some othere changes.

jczaja · 2021-03-18T16:03:35Z

@luotao1 Please note PR #31093 it does add initial support of BF16 to AMP (automatic mixed precision) .

jczaja · 2021-03-18T17:15:56Z

@luotao1, @MrChengmo We able to run word2vec training via fleetrun , but after training is finished we can see : server_log.0 there was SIGTERM signal sent to process. My question is is this behaviour expected? Log is below:

    +=======================================================================================+
    |                PaddleRec Benchmark Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                hyper_parameters.neg_num                        5                      |
    |   hyper_parameters.optimizer.decay_rate                      0.999                    |
    |  hyper_parameters.optimizer.decay_steps                     100000                    |
    |hyper_parameters.optimizer.learning_rate                       1.0                     |
    |     hyper_parameters.sparse_feature_dim                       300                     |
    |  hyper_parameters.sparse_feature_number                     354051                    |
    |            hyper_parameters.window_size                        5                      |
    |     hyper_parameters.with_shuffle_batch                      False                    |
    |             static_benchmark.batch_size                       100                     |
    |          static_benchmark.dataset_debug                      False                    |
    |                 static_benchmark.epochs                        2                      |
    |   static_benchmark.example_count_method                      word                     |
    |               static_benchmark.geo_step                       400                     |
    |             static_benchmark.model_path               .//static_model.py              |
    |           static_benchmark.pipe_command           python .//static_reader.py          |
    |           static_benchmark.print_period                      1000                     |
    |            static_benchmark.reader_path               .//static_reader.py             |
    |            static_benchmark.reader_type                  QueueDataset                 |
    |        static_benchmark.save_model_path                    .//model                   |
    |        static_benchmark.split_file_list                      False                    |
    |              static_benchmark.sync_mode                      async                    |
    |         static_benchmark.test_data_path                  .//test_data                 |
    |             static_benchmark.thread_num                        1                      |
    |        static_benchmark.train_data_path                  .//train_data                |
    |               static_benchmark.use_cuda                        0                      |
    |   static_benchmark.word_count_dict_path           .//dict/word_count_dict.txt         |
    |      static_benchmark.word_id_dict_path            .//dict/word_id_dict.txt           |
    |                               workspace                       ./                      |
    |                               yaml_path                 benchmark.yaml                |
    +=======================================================================================+

2021-03-18 09:27:14,627 - INFO - cpu_num: 1
2021-03-18 09:27:14,627 - INFO - -- Role: PSERVER --
sync_mode: async
decay_steps: 100000
Epoch 0: ExponentialDecay set learning rate to 1.0.
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py:1201: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  % lr_decay_steps)
2021-03-18 09:27:14,659 - WARNING - ExponentialDecay is set, staircase = True, global learning rate decay step is [ 100000 ], Change decay steps as follow: 
	 strategy = paddle.distributed.fleet.DistributedStrategy() 
 	 strategy.a_sync = True 
	 strategy.a_sync_configs= { 'lr_decay_steps' : YOUR_DECAY_STEP } 

2021-03-18 09:27:14,659 - INFO - Run Server Begin
server: 
server_param {
    downpour_server_param {
    service_param {server_class: "BrpcPsServer" client_class: "BrpcPsClient" service_class: "BrpcPsService" start_server_port: 0 server_thread_num: 12 
    }
      downpour_table_param {table_id: 0 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "uniform_random&0&-0.0016666667070239782&0.0016666667070239782" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 1 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 1 

      }
      common {name: "sgd" table_name: "emb_b" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 1 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 2 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb_w" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 3 table_class: "GlobalStepTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      tensor {feed_var_name: "@LR_DECAY_COUNTER@" fetch_var_name: "tmp_3" startup_program_id: 0 main_program_id: 1 tensor_table_class: "GlobalStepTable" 

      }
      common {name: "" table_name: "@LR_DECAY_COUNTER@" trainer_num: 1 sync: false 

      }

      }
      downpour_table_param {table_id: 4 table_class: "BarrierTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      common {name: "" table_name: "barrier_table" trainer_num: 1 sync: false 

      }

      }
    }
}
I0318 09:27:14.673756 167466 service.cc:50] Init With Gflags:
I0318 09:27:17.066890 167466 server.cpp:1037] Server[paddle::distributed::BrpcPsService] is serving on port=52681.
I0318 09:27:17.067734 167466 server.cpp:1040] Check out http://broncos-clx01.jf.intel.com:52681 in web browser.
W0318 09:27:17.070726 167466 env.h:179] ps-host :127.0.0.1:52681, rank:0 already register, ignore register
W0318 09:32:21.994891 167586 socket.cpp:1739] Fail to keep-write into fd=12 SocketId=565@127.0.0.1:55290@52681: Broken pipe [32]
W0318 09:32:21.994884 167595 input_messenger.cpp:222] Fail to read from fd=12 SocketId=565@127.0.0.1:55290@52681: Connection reset by peer [104]
W0318 09:33:30.230756 167549 input_messenger.cpp:222] Fail to read from fd=10 SocketId=454@127.0.0.1:55288@52681: Connection reset by peer [104]
W0318 09:33:30.230809 167622 socket.cpp:1739] Fail to keep-write into fd=10 SocketId=454@127.0.0.1:55288@52681: Broken pipe [32]
W0318 09:39:26.518344 167587 input_messenger.cpp:222] Fail to read from fd=11 SocketId=1017@127.0.0.1:56008@52681: Connection reset by peer [104]
W0318 09:39:26.518379 167573 socket.cpp:1739] Fail to keep-write into fd=11 SocketId=1017@127.0.0.1:56008@52681: Broken pipe [32]
W0318 09:40:34.684377 167586 socket.cpp:1739] Fail to keep-write into fd=9 SocketId=904@127.0.0.1:56004@52681: Connection reset by peer [104]


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::distributed::FleetWrapper::RunServer(std::string const&, unsigned int)
1   paddle::distributed::BrpcPsServer::start(std::string const&, unsigned int)
2   paddle::framework::SignalHandle(char const*, int)
3   paddle::platform::GetCurrentTraceBackString()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1616085691 (unix time) try "date -d @1616085691" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0xada00900028dd4) received by PID 167466 (TID 0x7f9cb47da740) from PID 167380 ***]

wozna · 2021-03-22T09:00:01Z

I have a question related to initializators. Is the possibility to initialize data in FP16 data type in AMP FP16, or is data always created in FP32?

arlesniak · 2021-05-10T20:12:17Z

With recent changes to initializers, SGD, some operations in fwd and bkwd (already in develop/release 2.1 branches) you can use pure BF16 mode. It allows to convert a model operations, tensors and parameters to BF16.

Pure mode is part of AMP concept used in paddle.static.amp.bf16 module for mixed precision training. We followed the concept with changes as close to the AMP API as can be and enabling BF16 usage. Pure mode by default enables all registered BF16 ops from Paddle. For operations not implemented for BF16 it uses float op version with casting where needed.

We focused on enabling word2vec model with local run without fleet. We are able to run BF16 word2vec training for a number of iterations, observe the loss decreasing during the training and less memory used.

Essentially there are 2 places for the code change in model. First is decoration of the optimizer and call to amp_init after tensors are initialized. Example model changes needed for use BF16 pure mode in training:

Next steps:

we are currently working on a performance issue we noticed when we run the word2vec training.
we weren't able to use model save function because of the previously mentioned issues so it is an area to be addressed. For now we turn off the saving and let the training continue.
more operations in BF16

lidanqing-intel · 2021-08-18T09:42:20Z

Please note, ON_INFER and WITH_DISTRIBUTE should not to be turned on at the same time for compiling.

lidanqing-intel · 2021-10-11T13:55:29Z

继续开发，不随版

paddle-bot-old bot assigned lfchener Jan 19, 2021

seiriosPlus added BF16 Intel labels Jan 19, 2021

luotao1 unassigned lfchener Jan 19, 2021

luotao1 mentioned this issue Feb 25, 2021

During word2vec trainig, save_inference_model report NoneType error #31184

Closed

wozna mentioned this issue Apr 13, 2021

Crash when training Word2Vec using Debug build #32252

Closed

lidanqing-intel closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable BF16 on Paddle Parameter Server Distributed Training #30560

Enable BF16 on Paddle Parameter Server Distributed Training #30560

seiriosPlus commented Jan 19, 2021 •

edited

Loading

paddle-bot-old bot commented Jan 19, 2021

luotao1 commented Jan 26, 2021 •

edited

Loading

lidanqing-intel commented Jan 26, 2021

lidanqing-intel commented Jan 27, 2021 •

edited

Loading

luotao1 commented Jan 27, 2021

jczaja commented Feb 8, 2021

lidanqing-intel commented Feb 19, 2021

MrChengmo commented Feb 20, 2021

lidanqing-intel commented Feb 25, 2021 •

edited

Loading

lidanqing-intel commented Feb 25, 2021 •

edited

Loading

luotao1 commented Feb 26, 2021

wozna commented Mar 4, 2021

luotao1 commented Mar 5, 2021

jczaja commented Mar 8, 2021

jczaja commented Mar 18, 2021

jczaja commented Mar 18, 2021 •

edited by MrChengmo

Loading

wozna commented Mar 22, 2021

arlesniak commented May 10, 2021

lidanqing-intel commented Aug 18, 2021

lidanqing-intel commented Oct 11, 2021

Enable BF16 on Paddle Parameter Server Distributed Training #30560

Enable BF16 on Paddle Parameter Server Distributed Training #30560

Comments

seiriosPlus commented Jan 19, 2021 • edited Loading

paddle-bot-old bot commented Jan 19, 2021

luotao1 commented Jan 26, 2021 • edited Loading

lidanqing-intel commented Jan 26, 2021

lidanqing-intel commented Jan 27, 2021 • edited Loading

luotao1 commented Jan 27, 2021

jczaja commented Feb 8, 2021

lidanqing-intel commented Feb 19, 2021

MrChengmo commented Feb 20, 2021

lidanqing-intel commented Feb 25, 2021 • edited Loading

lidanqing-intel commented Feb 25, 2021 • edited Loading

BF16 Strategy

BF16 update each week (by @lidanqing-intel ), since Feb 26th

luotao1 commented Feb 26, 2021

wozna commented Mar 4, 2021

luotao1 commented Mar 5, 2021

jczaja commented Mar 8, 2021

jczaja commented Mar 18, 2021

jczaja commented Mar 18, 2021 • edited by MrChengmo Loading

wozna commented Mar 22, 2021

arlesniak commented May 10, 2021

lidanqing-intel commented Aug 18, 2021

lidanqing-intel commented Oct 11, 2021

seiriosPlus commented Jan 19, 2021 •

edited

Loading

luotao1 commented Jan 26, 2021 •

edited

Loading

lidanqing-intel commented Jan 27, 2021 •

edited

Loading

lidanqing-intel commented Feb 25, 2021 •

edited

Loading

lidanqing-intel commented Feb 25, 2021 •

edited

Loading

jczaja commented Mar 18, 2021 •

edited by MrChengmo

Loading