Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable BF16 on Paddle Parameter Server Distributed Training #30560

Closed
seiriosPlus opened this issue Jan 19, 2021 · 20 comments
Closed

Enable BF16 on Paddle Parameter Server Distributed Training #30560

seiriosPlus opened this issue Jan 19, 2021 · 20 comments

Comments

@seiriosPlus
Copy link
Collaborator

seiriosPlus commented Jan 19, 2021

Paddle的分布式参数服务器训练目前主要面向的是大数据量+浅层模型为主的推荐场景,一般训练在数十上百台高性能CPU服务器。

推荐场景一般使用Embedding的嵌入层表示来表征用户特征,规模在千万至千亿级别,因此会导致PServer端内存的急剧消耗(可能会占用数十T的内存),同时PServer端对于稀疏参数的获取/更新速度也是整个训练的瓶颈之一。

期望用BF降低PServer端内存的占用, 同时期望采用BF加速PServer端稀疏参数的获取/更新速度。

@paddle-bot-old
Copy link

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@luotao1
Copy link
Contributor

luotao1 commented Jan 26, 2021

We don't have BF16 Optimizer and training interface. But you can refer to static fp16 training interface: https://github.com/PaddlePaddle/models/blob/release/2.0-beta/PaddleNLP/benchmark/bert/run_pretrain_single.py#L241

amp_list = paddle.fluid.contrib.mixed_precision.AutoMixedPrecisionLists(
            custom_white_list=['layer_norm', 'softmax', 'gelu'])
optimizer = paddle.fluid.contrib.mixed_precision.decorate(
            optimizer,
            amp_list,
            init_loss_scaling=args.scale_loss,
            use_dynamic_loss_scaling=True)

Besides, maybe you should provide enable_mkldnn interface instead of FLAGS_use_mkldnn (see #27935)

@lidanqing-intel
Copy link
Contributor

Luotao means that we should better not use global variable anymore.

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Jan 27, 2021

Hi, @luotao1 What is the compiling option to run the recommender models? I have this error when I run the train.py

  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/fleet_base.py", line 1192, in minimize
    self._runtime_handle = RuntimeFactory()._create_runtime(context)
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/base/runtime_factory.py", line 32, in _create_runtime
    ps_runtime = TheOnePSRuntime()
  File "/home/li/repo/Paddle/build/python/paddle/distributed/fleet/runtime/the_one_ps.py", line 383, in __init__
    self._worker = fluid.core.DistFleetWrapper()
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'DistFleetWrapper

My option is

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON

Update

cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON

@luotao1
Copy link
Contributor

luotao1 commented Jan 27, 2021

@lidanqing-intel You should use -DWITH_DISTRIBUTE=ON.
Besides, discussed with @MrChengmo, above models are published in https://github.com/PaddlePaddle/PaddleRec/tree/master/models

  • rank/dnn, rank/wide_deep, recall/word2vec: These three models are already tested by QA
  • rank/deepfm: Still developing.

How to run rank/dnn, please see https://github.com/PaddlePaddle/Perf/tree/master/CtrDnn

@jczaja
Copy link
Contributor

jczaja commented Feb 8, 2021

Regarding to our strategy in enabling bf16 training, We focus on word2vec with goal to reduce memory consumption. We want to do that by enabling bf16 training for most memory consuming ops like: lookup_table . Apart from ops we want to have optimizer working purely on bf16 as well. So ideally to reduce memory usage we will have pure bf16 training without the need of keeping master parameters in fp32.

@lidanqing-intel
Copy link
Contributor

Hi, @luotao1
With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

  • Build Paddle
cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12
  • Run word2vec
cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

@MrChengmo
Copy link
Contributor

Hi, @luotao1
With newest develop branch, I can not save models anymore. Could you please give some suggestions ?

Epoch 0 Var LOSS        mean_0.tmp_0      - place: CPUPlace
  - shape: [1]
  - layout: NCHW
  - dtype: float
  - data: [3.401]
2021-02-19 06:07:15,918 - INFO - Epoch: 0, using time 29.075167655944824 second, ips 35434.53342011419 word/sec.
Traceback (most recent call last):
  File "../train.py", line 245, in <module>
    benchmark_main.run()
  File "../train.py", line 65, in run
    self.run_worker()
  File "../train.py", line 125, in run_worker
    self.infer_target_var)
  File "/home/li/miniconda3/envs/myenv_python3.6/lib/python3.6/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 544, in save_inference_model
    self._runtime_handle._save_inference_model(
AttributeError: 'NoneType' object has no attribute '_save_inference_model'

Reproduce steps

  • Build Paddle
cmake ..  -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON
make -j 12
  • Run word2vec
cd 2.0benchmark/ps/static/word2vec
python -u ../train.py -c benchmark.yaml

There is a compatibility problem between single machine training and PS distributed training,train.py is designed for distributed training,you can use follow command to run word2vec

cd 2.0benchmark/ps/static/word2vec
fleetrun --worker_num=1 --server_num=1 ../train.py -c benchmark.yaml

We recommend using PaddleRec to run the model, You can refer to the following links:

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Feb 25, 2021

Hi @luotao1 Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Feb 25, 2021

BF16 Strategy

  • Now Intel target bf16 training of word2vec now.

  • Focusing on both enabling bf16, python API, and reducing memory consumption.

  • Currently we are enabling bf16 grad ops

    • Checking working on changes bf16 training python API need. still trying
    • Reduce memory consumption by applying non-mkldnn block format in general

BF16 update each week (by @lidanqing-intel ), since Feb 26th

@luotao1
Copy link
Contributor

luotao1 commented Feb 26, 2021

Could you please provide a log of fully trained word2vec model for reference. if no CPU, then GPU is fine. Please attach under this issue.

@lidanqing-intel Please see 日志数据.

The log of fleetrun --worker_num=4 --server_num=4 ../train.py -c benchmark.yaml is Word2Vec DataLoader 4机. We don't have one machine log.

Why we choose this log? You could see following parameters in benchmark.yaml

  • reader_type: "DataLoader"
  • sync_mode: "async" # sync / async /geo / heter (Not geo)

@wozna
Copy link
Contributor

wozna commented Mar 4, 2021

@luotao1 I have a question related to command fleetrun,
I checked that when I install paddlepaddle by pip it works fine.

Unfortunately, when I was building the paddle from the source with the mentioned command
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DPY_VERSION=3.6 -DWITH_AVX=ON -DWITH_DISTRIBUTE=ON ,
it shows me that the command fleetrun: command not found

Should anything be done to make fleetrun command available?

@luotao1
Copy link
Contributor

luotao1 commented Mar 5, 2021

@wozna Please check your python install path. fleetrun will be installed in python.

Paddle/python/setup.py.in

Lines 542 to 544 in ffbf713

entry_points={
'console_scripts': [
'fleetrun = paddle.distributed.fleet.launch:launch'

Or you can specific your own python like /usr/local/python -m fleetrun ...

@jczaja
Copy link
Contributor

jczaja commented Mar 8, 2021

@luotao1 Some update. So we are to enable bf16 training with word2vec. As a first milestone is to have bf16 training enabled with master weights using AMP (automatic mixed precision training) with lookup_table, elementwise_add, reshape ops. This will not redce memory consumption but we will check if our bf16 functionality works fine. After this we will go for bf16 training of word2vec but without usage of fp32 master weights e.g. data will be initialized as bf16. This require some creation of initializers for bf16 data and some othere changes.

@jczaja
Copy link
Contributor

jczaja commented Mar 18, 2021

@luotao1 Please note PR #31093 it does add initial support of BF16 to AMP (automatic mixed precision) .

@jczaja
Copy link
Contributor

jczaja commented Mar 18, 2021

@luotao1, @MrChengmo We able to run word2vec training via fleetrun , but after training is finished we can see : server_log.0 there was SIGTERM signal sent to process. My question is is this behaviour expected? Log is below:

    +=======================================================================================+
    |                PaddleRec Benchmark Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                hyper_parameters.neg_num                        5                      |
    |   hyper_parameters.optimizer.decay_rate                      0.999                    |
    |  hyper_parameters.optimizer.decay_steps                     100000                    |
    |hyper_parameters.optimizer.learning_rate                       1.0                     |
    |     hyper_parameters.sparse_feature_dim                       300                     |
    |  hyper_parameters.sparse_feature_number                     354051                    |
    |            hyper_parameters.window_size                        5                      |
    |     hyper_parameters.with_shuffle_batch                      False                    |
    |             static_benchmark.batch_size                       100                     |
    |          static_benchmark.dataset_debug                      False                    |
    |                 static_benchmark.epochs                        2                      |
    |   static_benchmark.example_count_method                      word                     |
    |               static_benchmark.geo_step                       400                     |
    |             static_benchmark.model_path               .//static_model.py              |
    |           static_benchmark.pipe_command           python .//static_reader.py          |
    |           static_benchmark.print_period                      1000                     |
    |            static_benchmark.reader_path               .//static_reader.py             |
    |            static_benchmark.reader_type                  QueueDataset                 |
    |        static_benchmark.save_model_path                    .//model                   |
    |        static_benchmark.split_file_list                      False                    |
    |              static_benchmark.sync_mode                      async                    |
    |         static_benchmark.test_data_path                  .//test_data                 |
    |             static_benchmark.thread_num                        1                      |
    |        static_benchmark.train_data_path                  .//train_data                |
    |               static_benchmark.use_cuda                        0                      |
    |   static_benchmark.word_count_dict_path           .//dict/word_count_dict.txt         |
    |      static_benchmark.word_id_dict_path            .//dict/word_id_dict.txt           |
    |                               workspace                       ./                      |
    |                               yaml_path                 benchmark.yaml                |
    +=======================================================================================+

2021-03-18 09:27:14,627 - INFO - cpu_num: 1
2021-03-18 09:27:14,627 - INFO - -- Role: PSERVER --
sync_mode: async
decay_steps: 100000
Epoch 0: ExponentialDecay set learning rate to 1.0.
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
/home/jczaja/Paddle/build-relwithdebinfo/python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py:1201: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  % lr_decay_steps)
2021-03-18 09:27:14,659 - WARNING - ExponentialDecay is set, staircase = True, global learning rate decay step is [ 100000 ], Change decay steps as follow: 
	 strategy = paddle.distributed.fleet.DistributedStrategy() 
 	 strategy.a_sync = True 
	 strategy.a_sync_configs= { 'lr_decay_steps' : YOUR_DECAY_STEP } 

2021-03-18 09:27:14,659 - INFO - Run Server Begin
server: 
server_param {
    downpour_server_param {
    service_param {server_class: "BrpcPsServer" client_class: "BrpcPsClient" service_class: "BrpcPsService" start_server_port: 0 server_thread_num: 12 
    }
      downpour_table_param {table_id: 0 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "uniform_random&0&-0.0016666667070239782&0.0016666667070239782" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 1 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 1 

      }
      common {name: "sgd" table_name: "emb_b" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 1 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 2 table_class: "CommonSparseTable" shard_num: 256 type: PS_SPARSE_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 354051 embedx_dim: 300 

      }
      common {name: "sgd" table_name: "emb_w" entry: "none" trainer_num: 1 sync: false params: "Param" params: "LearningRate" dims: 300 dims: 1 initializers: "fill_constant&0.0" initializers: "fill_constant&1.0" 

      }

      }
      downpour_table_param {table_id: 3 table_class: "GlobalStepTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      tensor {feed_var_name: "@LR_DECAY_COUNTER@" fetch_var_name: "tmp_3" startup_program_id: 0 main_program_id: 1 tensor_table_class: "GlobalStepTable" 

      }
      common {name: "" table_name: "@LR_DECAY_COUNTER@" trainer_num: 1 sync: false 

      }

      }
      downpour_table_param {table_id: 4 table_class: "BarrierTable" shard_num: 256 type: PS_OTHER_TABLE
      accessor {accessor_class: "CommMergeAccessor" fea_dim: 0 embedx_dim: 0 

      }
      common {name: "" table_name: "barrier_table" trainer_num: 1 sync: false 

      }

      }
    }
}
I0318 09:27:14.673756 167466 service.cc:50] Init With Gflags:
I0318 09:27:17.066890 167466 server.cpp:1037] Server[paddle::distributed::BrpcPsService] is serving on port=52681.
I0318 09:27:17.067734 167466 server.cpp:1040] Check out http://broncos-clx01.jf.intel.com:52681 in web browser.
W0318 09:27:17.070726 167466 env.h:179] ps-host :127.0.0.1:52681, rank:0 already register, ignore register
W0318 09:32:21.994891 167586 socket.cpp:1739] Fail to keep-write into fd=12 SocketId=565@127.0.0.1:55290@52681: Broken pipe [32]
W0318 09:32:21.994884 167595 input_messenger.cpp:222] Fail to read from fd=12 SocketId=565@127.0.0.1:55290@52681: Connection reset by peer [104]
W0318 09:33:30.230756 167549 input_messenger.cpp:222] Fail to read from fd=10 SocketId=454@127.0.0.1:55288@52681: Connection reset by peer [104]
W0318 09:33:30.230809 167622 socket.cpp:1739] Fail to keep-write into fd=10 SocketId=454@127.0.0.1:55288@52681: Broken pipe [32]
W0318 09:39:26.518344 167587 input_messenger.cpp:222] Fail to read from fd=11 SocketId=1017@127.0.0.1:56008@52681: Connection reset by peer [104]
W0318 09:39:26.518379 167573 socket.cpp:1739] Fail to keep-write into fd=11 SocketId=1017@127.0.0.1:56008@52681: Broken pipe [32]
W0318 09:40:34.684377 167586 socket.cpp:1739] Fail to keep-write into fd=9 SocketId=904@127.0.0.1:56004@52681: Connection reset by peer [104]


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::distributed::FleetWrapper::RunServer(std::string const&, unsigned int)
1   paddle::distributed::BrpcPsServer::start(std::string const&, unsigned int)
2   paddle::framework::SignalHandle(char const*, int)
3   paddle::platform::GetCurrentTraceBackString()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1616085691 (unix time) try "date -d @1616085691" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0xada00900028dd4) received by PID 167466 (TID 0x7f9cb47da740) from PID 167380 ***]

@wozna
Copy link
Contributor

wozna commented Mar 22, 2021

I have a question related to initializators. Is the possibility to initialize data in FP16 data type in AMP FP16, or is data always created in FP32?

@arlesniak
Copy link
Contributor

With recent changes to initializers, SGD, some operations in fwd and bkwd (already in develop/release 2.1 branches) you can use pure BF16 mode. It allows to convert a model operations, tensors and parameters to BF16.

Pure mode is part of AMP concept used in paddle.static.amp.bf16 module for mixed precision training. We followed the concept with changes as close to the AMP API as can be and enabling BF16 usage. Pure mode by default enables all registered BF16 ops from Paddle. For operations not implemented for BF16 it uses float op version with casting where needed.

We focused on enabling word2vec model with local run without fleet. We are able to run BF16 word2vec training for a number of iterations, observe the loss decreasing during the training and less memory used.

Essentially there are 2 places for the code change in model. First is decoration of the optimizer and call to amp_init after tensors are initialized. Example model changes needed for use BF16 pure mode in training:

word2vec_diff

Next steps:

  • we are currently working on a performance issue we noticed when we run the word2vec training.
  • we weren't able to use model save function because of the previously mentioned issues so it is an area to be addressed. For now we turn off the saving and let the training continue.
  • more operations in BF16

@lidanqing-intel
Copy link
Contributor

Please note, ON_INFER and WITH_DISTRIBUTE should not to be turned on at the same time for compiling.

@lidanqing-intel
Copy link
Contributor

继续开发,不随版

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants