Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce time of test_TrainerOnePass #3296

Merged
merged 1 commit into from
Aug 7, 2017
Merged

reduce time of test_TrainerOnePass #3296

merged 1 commit into from
Aug 7, 2017

Conversation

luotao1
Copy link
Contributor

@luotao1 luotao1 commented Aug 7, 2017

partly fix #3259
test_TrainerOnePass的单测分成两种:

  1. TEST(checkRemoteUpdater, XXX):共有10个,每个消耗4s(在本地测)。调用checkRemoteParameterUpdaterTest函数,主要消耗在启动pserver的部分,对这一部分没有进行修改。下面是其中的一个具体时间。
64: [ RUN      ] checkRemoteUpdater.gpu2TrainerOldUpdater
64: I0807 12:06:49.846887  2287 test_TrainerOnePass.cpp:235]  useGpu=1 trainerCount=2 configFile=trainer/tests/sample_trainer_config.conf
64: I0807 12:06:49.846958  7112 LightNetwork.cpp:269] tcp server start 
64: [INFO 2017-08-07 12:06:49,858 networks.py:1491] The input order is [input, label]
64: [INFO 2017-08-07 12:06:49,858 networks.py:1497] The output order is [__cost_0__]
64: I0807 12:06:49.862642  2287 Trainer.cpp:165] trainer mode: Normal
64: I0807 12:06:49.862903  2287 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=2 numDevices=4
64: I0807 12:06:49.865258  2287 DataProvider.cpp:388] load data file trainer/tests/sample_data.txt
64: I0807 12:06:49.865303  2287 DataProvider.cpp:391] read done, num of instance=10 data size=30
64: I0807 12:06:49.865449  2287 DataProvider.cpp:388] load data file trainer/tests/sample_data.txt
64: I0807 12:06:49.865481  2287 DataProvider.cpp:391] read done, num of instance=10 data size=30
64: I0807 12:06:49.865561  2287 GradientMachine.cpp:85] Initing parameters..
64: I0807 12:06:49.867841  2287 GradientMachine.cpp:92] Init parameters done.
64: I0807 12:06:49.871497  2287 ParameterClient2.cpp:114] pserver 0 127.0.0.1:38110
64: I0807 12:06:49.871615  7121 LightNetwork.cpp:322] worker started, peer = 127.0.0.1
64: I0807 12:06:51.872315  7121 ParameterServer2.cpp:256] pserver: setParameter
64: I0807 12:06:51.872351  7121 ParameterServer2.cpp:302] pserver: new cpuvector: size=16384
64: I0807 12:06:51.872582  8344 ParameterClient2.cpp:114] pserver 0 127.0.0.1:38110
64: I0807 12:06:51.872742  8345 LightNetwork.cpp:322] worker started, peer = 127.0.0.1
64: I0807 12:06:53.879446  2287 test_TrainerOnePass.cpp:214] ___fc_layer_0__.w0  diff=0              
64: I0807 12:06:53.879509  2287 test_TrainerOnePass.cpp:214] ___fc_layer_1__.w0  diff=0              
64: I0807 12:06:53.879547  2287 test_TrainerOnePass.cpp:214] ___fc_layer_2__.w0  diff=0              
64: I0807 12:06:53.879582  2287 test_TrainerOnePass.cpp:214] sharew              diff=0              
64: I0807 12:06:53.879611  2287 test_TrainerOnePass.cpp:214] ___fc_layer_4__.w0  diff=0              
64: I0807 12:06:53.879647  2287 test_TrainerOnePass.cpp:214] ___fc_layer_5__.w0  diff=0              
64: I0807 12:06:53.879675  2287 test_TrainerOnePass.cpp:214] ___fc_layer_6__.w0  diff=0              
64: I0807 12:06:53.879709  2287 test_TrainerOnePass.cpp:214] ___fc_layer_7__.w0  diff=0              
64: I0807 12:06:53.879737  2287 test_TrainerOnePass.cpp:214] ___fc_layer_7__.wbiasdiff=0              
64: I0807 12:06:53.879771  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w0     diff=0              
64: I0807 12:06:53.879806  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w1     diff=0              
64: I0807 12:06:53.879842  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w2     diff=0              
64: I0807 12:06:53.879873  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w4     diff=0              
64: I0807 12:06:53.879906  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w5     diff=0              
64: I0807 12:06:53.879935  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w6     diff=0              
64: I0807 12:06:53.879969  2287 test_TrainerOnePass.cpp:214] ___mixed_0__.w7     diff=0              
64: I0807 12:06:53.880100  8344 SocketChannel.cpp:42] destory connection in socket channel, peer = 127.0.0.1
64: I0807 12:06:53.880110  8345 LightNetwork.cpp:339] worker begin to finish, peer = 127.0.0.1
64: I0807 12:06:53.880129  7121 ParameterServer2.cpp:564] pserver: getParameter
64: I0807 12:06:53.880147  8345 SocketChannel.cpp:42] destory connection in socket channel, peer = 127.0.0.1
64: I0807 12:06:53.880996  7121 LightNetwork.cpp:339] worker begin to finish, peer = 127.0.0.1
64: I0807 12:06:53.880998  2287 SocketChannel.cpp:42] destory connection in socket channel, peer = 127.0.0.1
64: I0807 12:06:53.881031  7121 SocketChannel.cpp:42] destory connection in socket channel, peer = 127.0.0.1
64: I0807 12:06:53.881778  7112 LightNetwork.cpp:215] pserver accept thread finish, addr= port=38110
64: I0807 12:06:53.881824  2287 SocketChannel.cpp:42] destory connection in socket channel, peer = 127.0.0.1
64: [       OK ] checkRemoteUpdater.gpu2TrainerOldUpdater (4035 ms)
  1. 其余调用trainerOnePassTest,通过设置稍小一点的data_size, num_pass, 时间从几十s下降到几s。

@@ -12,7 +12,7 @@

embedding = embedding_layer(
input=data_layer(
name="word_ids", size=65536),
name="word_ids", size=8192),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我有些好奇,我们的book的demo里面都出现了8192 这个数字,这个数字有什么特别的含义吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有特别的含义。这里我直接除了8。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

距离8192最近的素数是 8191

@luotao1
Copy link
Contributor Author

luotao1 commented Aug 7, 2017

TeamCity上的时间:下降到28s。

[13:45:56]	67/133 Test #68: test_TrainerOnePass ....................... Passed 28.45 sec

@luotao1 luotao1 requested a review from wangkuiyi August 7, 2017 05:50
Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞提速

@@ -1,6 +1,6 @@
from paddle.trainer_config_helpers import *

settings(batch_size=128, learning_method=AdaGradOptimizer(), learning_rate=1e-4)
settings(batch_size=16, learning_method=AdaGradOptimizer(), learning_rate=1e-4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我一般会选择素数作为参数,因为很多时候2的幂次不如素数那么容易导致错误。距离 16 最近的素数是 17.

@@ -12,7 +12,7 @@

embedding = embedding_layer(
input=data_layer(
name="word_ids", size=65536),
name="word_ids", size=8192),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

距离8192最近的素数是 8191

@luotao1 luotao1 closed this Aug 7, 2017
@luotao1 luotao1 reopened this Aug 7, 2017
@luotao1 luotao1 merged commit dda4217 into PaddlePaddle:develop Aug 7, 2017
@luotao1 luotao1 deleted the test_TrainerOnePass branch August 7, 2017 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_TrainerOnePass runs too slow
3 participants