Design doc: submit a distributed job #1770

Yancey1989 · 2017-04-11T17:27:16Z

here maybe better to view.

This PR has too many comments to hard to view, move to #2111 Please.
Thanks @helinwang @jacquesqiao @gongweibao @typhoonzero @wangkuiyi

jacquesqiao · 2017-04-12T02:05:21Z

doc/design/dist/submit-job.md

@@ -0,0 +1,70 @@
+
+# PaddlePaddle Client
+PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:


clinet => client

jacquesqiao · 2017-04-12T02:06:30Z

doc/design/dist/submit-job.md

+    --input=${INPUT_DIR}
+```
+
+- job-name: you can specify a name for every job, and the name should be uniq.


job-name可以自己制定，框架应该自动生成一个app-id，这个是uniq的

有道理。kubectl应该会为每个job返回一个id。这个不难。

只是job-name最好还是可以自动生成一个unique的。比如说 full_job_name = k8s_user_name + "/" + job_name ?

Maybe full_job_name=<k8s-user-name>/<job-name>/<job-id> ? because user will submit the same job name twice.

感觉可以考虑不允许用户指定job name。跟大家提到的一样：咱们自己给用户生成一个id就好了。（类似于docker run执行的时候会在stdout打印Docker生成的container ID）

通过手动指定一个唯一的job name看起来是一个比较常见的方法（Kubernetes和gcloud里都有手动指定唯一job name的习惯），用户在使用时也比较统一，比如以下常用的一系列操作：

paddle job submit <job-name> paddle job status <job-name> paddle job stop <job-name>

如果使用生成的job-id的话会需要在第一次提交时候注意返回的ID，如果忽略掉的话用户很难再找到刚才提交的是哪个job了。

jacquesqiao · 2017-04-12T02:08:45Z

doc/design/dist/submit-job.md

+
+- Data source
+
+  Input data should be saved on the distributed and you have the access.


access => access right?

jacquesqiao · 2017-04-12T02:10:05Z

doc/design/dist/submit-job.md

+      --pcakage-path=./sharding-data
+      --module-name=sharding-data.main
+      --input=${INPUT_DIR}
+      --output=${OUTPUT_DIR


是不是应该有个参数叫 sharding-count，分多少份

@jacquesqiao 确实，根据https://github.com/PaddlePaddle/Paddle/pull/1770#discussion_r111053711，通过设置环境变量来控制，看起来更通用一些。

感觉sharding不应该交给用户来做。trainer是10个的时候，一个数据集可能需要shard成500份，如果trainer是100个，可能就要5000份。既然都随着trainer数目变化了，用户改变trainer数目之前很可能需要再做一次sharding，会不会太麻烦？

我理解这里需要sharding的原因是：我们让用户上传了自己的数据，集群不知道这个格式是什么，所以就不知道如何sharding。
另一个选择是，用户在使用自己的数据集之前必须转换成集群支持的格式（据我了解google就是这样做的，对应的格式是sstable）。这样，集群自己去管用什么格式存储可以优化顺序读取（training是顺序读取），以及自己可以负责sharding。一切对用户是不可见的。sharding以及用什么格式教给用户做，他们不见得有兴趣做，也很难做好。

@helinwang 让用户来做文件sharding确实会带来各种问题，类似对于HDFS来说，文件会被切成多个block，trainer在启动获取一个未被读取过的block进行训练即可。更进一步的优化，如果存储部署在Kubernetes集群的计算节点，可以使trainer优先读取本地的block数据。

wangkuiyi · 2017-04-12T02:15:45Z

doc/design/dist/submit-job.md

@@ -0,0 +1,70 @@
+
+# PaddlePaddle Client
+PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:


这里有不少英语语法错误。建议用 Grammerly.com 做检查。我花了钱。以下是我检查的结果（我是付费用户，所以Grammerly.com为我做的检查比为免费用户做的多）：

wangkuiyi · 2017-04-12T02:16:35Z

doc/design/dist/submit-job.md

+PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:
+
+```bash
+paddle k8s job <job-name>


这里应该是

paddle k8s start <job-name>

吧？还需要有其他的一些命令，比如 stop, log, ...？

wangkuiyi · 2017-04-12T02:17:16Z

doc/design/dist/submit-job.md

+paddle k8s job <job-name>
+    --package-path= ./demo
+    --module=demo.train
+    --pserver-count=2


--pservers=2, --trainers=4

wangkuiyi · 2017-04-12T02:20:07Z

doc/design/dist/submit-job.md

+    --pserver-count=2
+    --trainer-count=4
+    --output=${OUTPUT_DIR}
+    --input=${INPUT_DIR}


我估计不需要input这个“标准”选项，而是可以通过 -e 参数指定一些环境变量或者 demo/train.py 接受的命令行参数吧？用法如：

paddle k8s start deepspeech2 \ -e "INPUT_DIR=/glusterfs/voice-team/voice-data" \ -e "MODEL_DIR=/glusterfs/yanxu/ds_2ps_4tn"

这样， demo/train.py 里可以有：

paddle.v2.train( reader=dataset.voice.train(os.getenv("INPUT_DIR")), ...)

wangkuiyi · 2017-04-12T02:23:42Z

doc/design/dist/submit-job.md

+    --input=${INPUT_DIR}
+```
+
+- job-name: you can specify a name for every job, and the name should be uniq.


有道理。kubectl应该会为每个job返回一个id。这个不难。

只是job-name最好还是可以自动生成一个unique的。比如说 full_job_name = k8s_user_name + "/" + job_name ?

wangkuiyi · 2017-04-12T02:24:52Z

doc/design/dist/submit-job.md

+# Master process
+- Setup master process
+
+  While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.


master 程序用什么语言开发？如何和trainer通信？

wangkuiyi · 2017-04-12T02:26:43Z

doc/design/dist/submit-job.md

+
+- Saving package
+
+  Master process will receive the package files and save on the distributed storage, for every job, master will generate a random id called *job-id*, the package file path looks like:


package file 是什么？我以为 paddle k8s start 会把 package path 里的内容打包成一个Docker image，然后push到Kubernetes可以访问的一个Docker registry里呢？

package file 是用户基于PaddlePaddle API开发的程序

我以为 paddle k8s start 会把 package path 里的内容打包成一个Docker image，然后push到Kubernetes可以访问的一个Docker registry里呢？

这确实是一个方法，但感觉在客户端做的事情有些多，既然只是需要这样一个package，那么上传到服务器并存储在分布式存储上，在job启动时volume这个目录看起来会简单一些。

如果我的 .py 文件里 import 了某些机群上没有Python packages怎么办？

在客户端上做可以是咱们的程序在下面做好了，根据指定的目录自动打包，不需要用户做。用户只用装docker。
这样的好处是，深度定制的用户可以直接传一个配好环境的docker image。

原来和 @emailweixu 商量过一个想法：通过访问Python的__all__ 这样的built-in variables来列举被import的Python packages。假设这些packages都是可以通过 pip install 安装的，我们可以自动生成一个Dockerfile，来安装这些packages。只是一个想法，上午具体的技术验证。

如果我的 .py 文件里 import 了某些机群上没有Python packages怎么办？

有一个办法是在package目录下创建一个文件requirements.txt，这个文件列出来依赖的python package，在系统准备环境时预先执行pip install -r requirements.txt来安装这些package。

wangkuiyi · 2017-04-12T02:29:30Z

doc/design/dist/submit-job.md

+
+- Data source
+
+  Input data should be saved on the distributed and you have the access.


数据源有几种

数据在Docker image里（因为之前放在了 package directory 里，所以被装进Docker image了）

数据在分布式文件系统里

数据是reader在线获得的。比如reader是一个HTTP server，等着data generator 发来数据）

数据是reader合成的，就像 fit_a_line 那个demo一样。

这几种方式都需仔细考虑。比如如果放在分布式文件系统上，文件格式应该是任何格式都行（因为用户可以自己写reader），还是必须是少数几种预先定义好的格式（SSTable或者String sequence file），用户使用预先定义好的reader来读取？

根据日前的讨论，这一版只考虑第2种，分布式文件系统的数据。提供默认的reader读取分布式文件系统的数据（GlusterFS），但在设计上要支持其他reader以插件的形式嵌入到系统中。

wangkuiyi · 2017-04-12T02:30:30Z

doc/design/dist/submit-job.md

+
+# Different cluster management
+
+PaddlePaddle client support plugins for different cluster management client, such as kubernetes looks like `paddle k8s ...`, MPI looks like `paddl mpi...`,but runinig process are the same:


paddle k8s start 需要生成 Docker image，但是 paddle mpi start 是不是就不需要 Docker image （也不需要package file）而是假设程序和数据都已经部署好了？

package file 包括了用户开发的训练程序，这个是必要的。 paddle mpi start确实不需要docker image，环境在master(MPI-master)进程运行的环境里，向MPI集群提交任务时会将用户上传的训练程序和环境文件一起提交到集群进行训练。

typhoonzero · 2017-04-12T02:12:23Z

doc/design/dist/submit-job.md

@@ -0,0 +1,70 @@
+
+# PaddlePaddle Client


Since you are describing the PaddlePaddle client, do you need to describe the full features of it, like local training, show version etc.

typhoonzero · 2017-04-12T02:14:35Z

doc/design/dist/submit-job.md

+PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:
+
+```bash
+paddle k8s job <job-name>


Should be paddle k8s_train ..., use a single subcommand or use the kubectl style: paddle submit k8s_job -f [job.yaml]

Except paddle k8s train, also paddle k8s [stop|status...] command, and paddle client also support submit MPI cluster train job looks like: paddle mpi train, so maybe subcommand looks better?

kubectl style is good point, shall we support the two style, simple job configuration for command line parameters and complex job configuration use yml file.

typhoonzero · 2017-04-12T02:34:46Z

doc/design/dist/submit-job.md

+# Master process
+- Setup master process
+
+  While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.


Why we need to setup master processes for each job?

For the import consideration, for each job, they are independent. Master is responsible for setup, stop, report status for the job, so maybe one master process for each job is a simple plan, it does not care about sth. such as user access etc..

On the other hand, pre start multiple master process maybe a better plan, user does need to setup a new master process when submit a job. If you have some idea, let's discuss this one.

helinwang · 2017-04-12T21:48:39Z

doc/design/dist/submit-job.md

+    --module=demo.train
+    --pserver-count=2
+    --trainer-count=4
+    --output=${OUTPUT_DIR}


这个客户端是装在用户电脑上的。我觉得不需要让用户可以设置输出路径，不然用户可能给错误的值。感觉可以job提交之后直接给用户指定一个路径。paddle k8s describe的时候打印出来。
paddle k8s 之后接的sub-command（submit, describe, ...）可以参考：https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/

helinwang · 2017-04-12T21:52:01Z

doc/design/dist/submit-job.md

+    --pserver-count=2
+    --trainer-count=4
+    --output=${OUTPUT_DIR}
+    --input=${INPUT_DIR}


这里的input指的是数据集吧。我觉得在.py文件里指定就行了。比如说：

reader = paddle.dist.dataset("mnist") # mnist digit recognition dataset

用户上传的数据集也可以起一个名字，在.py文件中引用。
在.py文件中引用的另一个好处是同时用到多个数据集比较方便。
另外，数据集和网络拓扑的输入格式是强相关的（比如：图像识别的网络拓扑的输入是图片，需要图片的数据集），既然是强相关的，感觉放在同一个地方（都在.py里面）比较好。

@helinwang 这里input指的是输入数据的路径，感觉dataset的名字不需要修改，在定义dataset时，可以使用input这个环境变量来指定输入路径？这样做的好处是，当在本地训练时，input是一个本地路径，云端是一个分布式存储的路径，而dataset的名字是同一个，只是reader的代码可能不一样？

@Yancey1989 明白了，你是指让用户上传数据集到一个地方，input这个环境变量代表输入路径（那个数据集被上传的地方）。好处是在本地和云端训练不用改代码（只用改环境变量）。
但是这个方法有个问题我没有想清楚：本地和云端执行的区别只有input，本地对应本地的文件夹，云端对应云端的文件夹。这样的话，云端执行的时候，数据还是同一个数据，那我们并不知道怎么去sharding（不知道用户用的什么格式。），也可能是我没想到，要是你有方法欢迎讨论哈。

我想象的是另外一种类似的方法，稍有不同：让用户上传数据集，但是必须让用户把数据集通过我们的程序转换成咱们的格式，储存在某个地方。转换的时候可以起名字，然后引用的是候只需要给出那个名字就好了，咱们的程序能自动查到具体的转换后的文件存在了哪里。
比如：

reader = paddle.dist.dataset("mnist") # mnist是公用数据集，不需要用户上传，只用指定名字就好。

或者

reader = paddle.dist.dataset("my-custom-dataset")

坏处是在本地和云端训练需要改代码（不过也只是需要改一行）。
好处是因为是我们的自定义格式，我们可以做sharding，可以把格式设计成顺序读取很快的格式。

helinwang · 2017-04-12T21:56:24Z

doc/design/dist/submit-job.md

+    --input=${INPUT_DIR}
+```
+
+- job-name: you can specify a name for every job, and the name should be uniq.


感觉可以考虑不允许用户指定job name。跟大家提到的一样：咱们自己给用户生成一个id就好了。（类似于docker run执行的时候会在stdout打印Docker生成的container ID）

helinwang · 2017-04-12T21:57:47Z

doc/design/dist/submit-job.md

+# Master process
+- Setup master process
+
+  While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.


it will setup -> PaddlePaddle will setup
是paddlepaddle创造的，不是用户直接创造的。改一下可以避免混淆。

helinwang · 2017-04-12T22:01:01Z

doc/design/dist/submit-job.md

+
+- Saving package
+
+  Master process will receive the package files and save on the distributed storage, for every job, master will generate a random id called *job-id*, the package file path looks like:


在客户端上做可以是咱们的程序在下面做好了，根据指定的目录自动打包，不需要用户做。用户只用装docker。
这样的好处是，深度定制的用户可以直接传一个配好环境的docker image。

helinwang · 2017-04-12T22:04:48Z

doc/design/dist/submit-job.md

+
+- Setup parameter and trainer
+
+  Master process will set up parameter server process and trainer process, on kubernetes, master will deploy two deployment for parameter server and trainer.


我觉得训练的master process只用训练（给trainer发task）就好了，不要管trainer，parameter server的启动。master, trainer, parameter server可以直接交给k8s来调度。

这里讲的应该是用户告诉master，经过master中转一下再交给k8s调度。不知道有啥好处？
如果有好处的话，这个master感觉可以直接做成一个服务，而不是叫训练里面的master process来担当这个职责。

这里讲的应该是用户告诉master，经过master中转一下再交给k8s调度。不知道有啥好处？

我理解master还需要一个协调pserver和trainer的启动流程问题，例如每个job的启动过程如下：

提交一个k8s的job来启动pserver

等待pserver的job为RUNINIG状态，获取每个pod的IP地址（这一步必须等到RUNINIG才可以拿到IP）

启动trainer的job，并且将上一步得到的pserver的IP地址传给train的job

步骤1和2之间是有时间间隔的，如果放在客户端来做的话，用户每提交一个任务都需要很长的等待时间才可以，不是很友好。并且将提交的过程放在服务端来做的话也比较便于简化客户端的逻辑。

关于这里master和训练里的master是否为一个进程，我觉得可以讨论一下，如果是同一个进程的话大概是这样子：

启动pserver，trainer的job，

和trainer通信，发送task，如果有trainer重启的情况需要更新连接池。

待trainer的job全部为completed状态后结束训练并退出。

为了提高可用性，master进程可以同时启两个，通过leader elect机制保证有有一个提供服务。

Service discovery可以通过etcd来完成。请参考https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist ，按照里面的描述：

master通过etcd看到trainer才开始分发task给他们。

trainer通过etcd看到parameter server个数足够之后再把自己的联系方式公告在etcd。

parameter server启动之后就把自己的联系方式公告在etcd。

这样貌似master不需要协调parameter server以及trainer的启动。

我最新的想法，在每个parameter server进程(Pod)启动后，Kubernetes会自动将IP地址等信息写在etcd中，我们是不是不需要直接读写etcd，而通过Kubernetes的API直接获取到每个Pod的联系方式了呢？

@Yancey1989 这个思路很有意思，我不知道Kubernetes也能把IP信息写到etcd中，通过Kubernetes API暴露出来，请问如何实现的？

我想了一下，有可能是我理解的不够透彻，有点担心的是这样会不会对kubernetes的依赖太高：以前是只依赖etcd（任意一个集群管理系统k8s，mesos，nomad都可以启动）。现在强依赖于k8s了。
另一个担心是我们需要多么定制化k8s才能完成直接通过k8s API获取每个pod的联系方式。

使用kubernetes的client-go，代码大概长这样子：

kubeconfig := flag.String("kubeconfig", "/Users/yanxu05/.kube/config", "absolute path to the kubeconfig file") flag.Parse() // uses the current context in kubeconfig config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig) if err != nil { panic(err.Error()) } // creates the clientset clientset, err := kubernetes.NewForConfig(config) if err != nil { panic(err.Error()) } pod, err := clientset.CoreV1().Pods("default").Get("nginx-2351037158-ph4ts", metav1.GetOptions{}) fmt.Println(pod.Status.PodIP)

看到issue #1807 了，感觉有道理的，使用Kubernetes的client确实会有安全性和可移植性的问题，多谢 @helinwang !!

helinwang · 2017-04-12T22:09:29Z

doc/design/dist/submit-job.md

+
+# Different cluster management
+
+PaddlePaddle client support plugins for different cluster management client, such as kubernetes looks like `paddle k8s ...`, MPI looks like `paddl mpi...`,but runinig process are the same:


mpi比较于k8s有好处／特长吗？如果没有的话，加上因为这里说mpi也是基于k8s的，既然反正集群都装了k8s，要不要mpi就不要了。

图画的有点问题，可能。
MPI receiver应该不是基于k8s集群的，和MPI cluter是一块的。目的是兼容公司内部的MPI集群环境。

@gongweibao receiver可以运行在公司内网的机器上，部署了mpi client即可，这里运行在Kubernetes上为了简化之前的部署流程。

helinwang · 2017-04-12T22:16:03Z

doc/design/dist/submit-job.md

+      --pcakage-path=./sharding-data
+      --module-name=sharding-data.main
+      --input=${INPUT_DIR}
+      --output=${OUTPUT_DIR


感觉sharding不应该交给用户来做。trainer是10个的时候，一个数据集可能需要shard成500份，如果trainer是100个，可能就要5000份。既然都随着trainer数目变化了，用户改变trainer数目之前很可能需要再做一次sharding，会不会太麻烦？

我理解这里需要sharding的原因是：我们让用户上传了自己的数据，集群不知道这个格式是什么，所以就不知道如何sharding。
另一个选择是，用户在使用自己的数据集之前必须转换成集群支持的格式（据我了解google就是这样做的，对应的格式是sstable）。这样，集群自己去管用什么格式存储可以优化顺序读取（training是顺序读取），以及自己可以负责sharding。一切对用户是不可见的。sharding以及用什么格式教给用户做，他们不见得有兴趣做，也很难做好。

wangkuiyi · 2017-04-17T22:35:39Z

doc/design/dist/submit-job.md

+
+The relation of PaddlePaddle, kubernetes and docker:
+
+<img src="./submit-job.png" width="500">


Questions with this figure:

I am not sure that pservers and trainers should be in two jobs. In our current configuration, each trainer has a pserver running on the same physical node to optimally overlay networking and computing.

paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too.

I am not sure that pservers and trainers should be in two jobs

现在的配置会有两个问题：

不利于trainer的大规模训练，由于trainer count==pserver count，当trainer数很大时，我们并不需要同样数量的pserver数，因为这会增加pserver失败的概率以及增加网络负载。

由于启在同一个container里，pserver和trainer有一个会以后台方式启动，不符合container的设计原则，不利于故障检测和恢复。

each trainer has a pserver running on the same physical node to optimally overlay networking and computing

看起来控制pserver的数量更可以达到优化网络的效果，而且trainer需要和所有的pserver通信，所以只有一个pserver 节点启在本地看起来效果也不是很大？

paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too

我觉得 @helinwang 的这个comment是有道理的，我paddle可以只和master进行通信，master作为一个service存在，并且这个master和 https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process 这里的master 并不是同一个master, 我会在下一版的更新中修改这部分的描述。

wangkuiyi · 2017-04-17T22:36:18Z

doc/design/dist/submit-job.md

+# Running your training locally
+Execute `paddle local train` to run your local train.
+```bash
+paddle local train


The first argument after paddle should be a command. local isn't even a verb. It seems that it could simply be paddle train --locally or just paddle train without Kubernetes related arguments.

wangkuiyi · 2017-04-17T22:39:45Z

doc/design/dist/submit-job.md

+- image: paddlepaddle production image
+- e: environment varible
+
+When you start the local train, the client starts a docker container like:


==> When users start a local training job ...

wangkuiyi · 2017-04-17T22:44:29Z

doc/design/dist/submit-job.md

+  --input=<input_dir>
+  --output=<output_dir>
+  --image=<paddle_image>
+  --e=NUM_PASS=4


--env == -e

wangkuiyi · 2017-04-17T22:46:15Z

doc/design/dist/submit-job.md

+Execute `paddle local train` to run your local train.
+```bash
+paddle local train
+  --pcakage-path=./demo


pcakage ==> package

wangkuiyi · 2017-04-17T22:47:04Z

doc/design/dist/submit-job.md

+  -v <input_dir>:/train/input
+  -v <output_dir>:/train/output
+  -v <package-path>:/train/package
+  -e NUM_PASS=4 <paddle_image>


I guess we also need -e "PYTHONPATH=/train/package so to enable Python finding imported packages there?

wangkuiyi · 2017-04-17T22:47:55Z

doc/design/dist/submit-job.md

+When you start the local train, the client starts a docker container like:
+```bash
+  docker run --rm
+  -v <input_dir>:/train/input


How about change /train/{input,output,package} into /{input,output,package}`?

wangkuiyi · 2017-04-17T22:50:31Z

doc/design/dist/submit-job.md

+You can use `paddle submit job <job-name>` to submit a distributed training job.
+
+```bash
+paddle job submit train <job-name>


paddle job submit train ==>
paddle train

wangkuiyi · 2017-04-17T22:56:21Z

doc/design/dist/submit-job.md

+- e: environment variable
+
+## Build your docker image
+Before submitting a distributed training, you should build your docker image, here


What I expected is that paddle train pack everything into a Docker image, other than users pack them?

helinwang · 2017-04-19T23:16:28Z

doc/design/dist/submit-job.md

+
+The relation of PaddlePaddle, kubernetes and docker:
+
+<img src="./submit-job.png" width="500">


图片里的"start up a local training job"需要PaddlePaddle Client来做吗？感觉直接本地python train.py就行了。

现在本地训练需要docker run ... python train.py才可以，设计PaddlePaddle Client支持本地训练也是为了让用户不必学习Docker相关的操作。

明白了，那这样用户还需要在下载docker image之外，另外下载一个运行脚本吗？

我们线下讨论下这个问题吧，还有Queue的实现是否使用etcd也需要一起讨论下：）

helinwang · 2017-04-19T23:17:15Z

doc/design/dist/submit-job.md

+<img src="./submit-job.png" width="500">
+
+
+# Running local training job


local training job感觉直接python train.py就行了，不需要命令行支持。

helinwang · 2017-04-19T23:21:09Z

doc/design/dist/submit-job.md

+```bash
+paddle train \
+  --job-name=cluster-quickstart \
+  --package-path=$PWD/quick_start \


这里是说要上传一个文件夹上去，作为执行时候的根目录，跟我所理解的pacakge貌似不是一个东西。感觉叫--env-path更容易理解。

env-path看起来更像是设置一个环境变量？这个package-path是包含了network configuration 的一个本地的目录， @helinwang 理解的package是指什么呢？

我把package理解成了编程语言里的一个包，在我的脑海里跟文件夹不一样（比如C++的包是头文件＋动态库／静态库，不是一个文件夹）。

helinwang · 2017-04-19T23:22:07Z

doc/design/dist/submit-job.md

+paddle train \
+  --job-name=cluster-quickstart \
+  --package-path=$PWD/quick_start \
+  --module=quick_start.train \


"--module=quick_start.train"建议改成："--entry-point=python train.py" #也可以是一个用户写的脚本，没必要现制成只能从python的模块执行。

Agree with --entry-point instead of --module

helinwang · 2017-04-19T23:27:46Z

doc/design/dist/submit-job.md

+  --job-name=cluster-quickstart \
+  --package-path=$PWD/quick_start \
+  --module=quick_start.train \
+  --input=<input-dir> \


按照昨晚关于数据集和reader的讨论，--input貌似不需要了。请参考：#1696 (comment)

我觉得--input参数还是需要的，在分布式的训练中，--input指向一个GlusterFS的路径，我们提交数据，提交任务的过程如下

paddle upload <host path> <clusterfs path>

paddle train --input=<clusterfs path>

在描述启动Trainer的Job时，需要将GlusterFS的Volume mount到Pod中，例如https://github.com/k8sp/tutorials/blob/develop/quickstart/paddle_dist_train/quickstart.yaml#L57

在reader中:

trainerid = fetch_trainerid_from_todo_queue() fp = open(os.path.join("/mnt/glusterfs", os.getenv("INPUT"), trainerid, ".list")) def reader(): for l in fp: yield ... return reader

@Yancey1989 武毅更改了一下他的上传数据及reader使用的PR，请看这里：https://github.com/typhoonzero/Paddle/blob/clusterdesign/doc/design/cluster_train/data_dispatch.md

helinwang · 2017-04-19T23:48:40Z

doc/design/dist/submit-job.md

+  PaddlePaddle support a default reader for reading data from distributed file system.
+- HTTP server
+
+  TODO


I think it's already mentioned in #1696 , maybe we can give a general introduction (as you already did) and reference there after that PR is merged.

Delete this section :)

helinwang · 2017-04-19T23:49:04Z

doc/design/dist/submit-job.md

+
+  TODO
+
+## PaddlePaddle client commands:


I think running in client just let user do python train.py

The same as https://github.com/PaddlePaddle/Paddle/pull/1770/files#r112356575

helinwang · 2017-04-19T23:50:19Z

doc/design/dist/submit-job.md

+## Work feature
+- V1
+  - Paddle server is a local version, build `runtime docker image`, deploy trainer and pserver job on user's host.
+  - implement `paddle train`, `paddle list`, `paddle cancel`


If we let user do python train.py for running on local, we can change paddle train to something like paddle submit.

It seems that we still need paddle train for a local train, as https://github.com/PaddlePaddle/Paddle/pull/1770/files#r112356575

helinwang · 2017-04-19T23:51:54Z

doc/design/dist/submit-job.md

+  - Paddle server is a local version, build `runtime docker image`, deploy trainer and pserver job on user's host.
+  - implement `paddle train`, `paddle list`, `paddle cancel`
+- V2
+  - Paddle server is running on kubernetes, users will only upload the package files and some setup parameters and building `runtime docker image` on kubernetes.


Please list the benefit of building docker image on the cloud. Otherwise it's not convincing that we should do this step.

helinwang · 2017-04-19T23:53:30Z

doc/design/dist/submit-job.md

+  - implement `paddle train`, `paddle list`, `paddle cancel`
+- V2
+  - Paddle server is running on kubernetes, users will only upload the package files and some setup parameters and building `runtime docker image` on kubernetes.
+  - implement `paddle prediction` and other feature.


I think this is too far away, and too much uncertainty: do we need to do load balancing for paddle prediction, etc... Maybe we can just remove it and add it later.

Delete paddle prediction, done.

helinwang · 2017-04-23T23:19:54Z

这个comment咱们开会的时候可能需要讨论下：#1770 (comment)

我觉得要封装docker到一个脚本里，让用户对docker无感知有点麻烦，因为还有单独下载一个脚本。而且通过命令行执行paddle，集群上的的jupyter貌似不是很方便支持。不知道其他人是怎么想的。

Yancey1989 · 2017-04-25T17:23:12Z

按讨论的结果做了更新。

typhoonzero · 2017-04-26T05:12:09Z

doc/design/dist/submit-job.md

+If a user wants to start up a local train, he will start up a PaddlePaddle product Docker container firstly, and then
+execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)
+
+If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.


把提交分布式任务放在前面，把提交本地的放在后面？这个文档的重点在描述如何提交分布式任务

typhoonzero · 2017-04-26T05:13:26Z

doc/design/dist/submit-job.md

+
+If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.
+
+The relation of PaddlePaddle, kubernetes and docker:


relation -> relationship

下面一行就直接是一级标题了？

done, 改成二级标题了，多谢指正。

typhoonzero · 2017-04-26T05:15:20Z

doc/design/dist/submit-job.md

+
+- Python Dependencies
+
+  Users will provide a `requirments.txt` file in trainer packages, to list python dependencies packages, such as:


Users will provide a requirments.txt file in trainer packages, to list python dependencies packages, such as:

You need to provide requirments.txt file in your "trainer" package. Example:

typhoonzero · 2017-04-26T05:15:39Z

doc/design/dist/submit-job.md

+The relation of PaddlePaddle, kubernetes and docker:
+
+
+# Runtime Environment On kubernetes


这里应该是二级标题？

typhoonzero · 2017-04-26T05:17:06Z

doc/design/dist/submit-job.md

+  pillow
+  protobuf==3.1.0
+  ```
+  some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html).


some other details about requirements is here.

More details about requirements.

typhoonzero · 2017-04-26T05:17:59Z

doc/design/dist/submit-job.md

+  ```
+  some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html).
+
+  Here is an example project:


An example project looks like:

typhoonzero · 2017-04-26T05:19:16Z

doc/design/dist/submit-job.md

+  ```
+  Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server)
+
+## Submit a Distributed Training Job In Python Code


Submit a Distributed Training Job In Python Code -> Submit Distributed Training Job With Python Code

typhoonzero · 2017-04-26T05:19:32Z

doc/design/dist/submit-job.md

+## Submit a Distributed Training Job In Python Code
+<img src="./submit-job-python.png" width="800">
+
+Users will call `paddle.dist_train` and provide distributed training configuration as the parameters.


Users will -> You can

typhoonzero · 2017-04-26T05:25:42Z

doc/design/dist/submit-job.md

+
+Users will call `paddle.dist_train` and provide distributed training configuration as the parameters.
+```python
+paddle.dist_train(


python提交和命令行提交的参数，增加下参数的默认值，可选，必选的说明

typhoonzero · 2017-04-26T05:25:49Z

doc/design/dist/submit-job.md

+  Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server)
+
+## Submit a Distributed Training Job In Python Code
+<img src="./submit-job-python.png" width="800">


job server是不是就和paddle cloud dashboard网站是同一个实例？

helinwang · 2017-04-28T21:04:16Z

doc/design/cluster_train/submit-job.md

+If a user wants to start up a distributed training job, he will submit the distributed training job with python code.
+
+If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
+execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)


.后面少了个空格。

helinwang · 2017-04-28T21:05:10Z

doc/design/cluster_train/submit-job.md

@@ -0,0 +1,111 @@
+# Submit a Distributed Training Job
+
+If a user wants to start up a distributed training job, he will submit the distributed training job with python code.


python需要大写首字母。

helinwang · 2017-04-28T21:05:36Z

doc/design/cluster_train/submit-job.md

+If a user wants to start up a distributed training job, he will submit the distributed training job with python code.
+
+If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
+execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)


句子结尾要有句号"."。

helinwang · 2017-04-28T21:08:02Z

doc/design/cluster_train/submit-job.md

+
+## Runtime Environment On Kubernetes
+
+For a distributed training job, there is two Docker image called **runtime Docker image** and **base Docker image**. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.


对不起，之前说错了，我评论里写的是斜体，给的markdown其实是粗体。新的名词应该用斜体：

*runtime Docker image* *base Docker image*

helinwang · 2017-04-28T21:09:27Z

doc/design/cluster_train/submit-job.md

+
+- Base Docker Image
+
+  Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access.


have the right access -> have the access right.

helinwang · 2017-04-28T21:17:45Z

doc/design/cluster_train/submit-job.md

+gpu_num|NO|1| if `use_gpu=true`, this parameter is required
+
+- Startup Parameter Server and Trainer Jobs
+  - Deploy parameter server job, it's a Kubernetes StatefulSet.


请问Parameter server用ReplicaSet能行吗，为什么需要StatefulSet？感觉要是简单的方法能用的话，就不要用复杂的方法吧。

用StatefulSet的好处是，当有一个Parameter Server的Pod挂掉，Kubernetes新启动的Pod会和之Pod的hostname保持一致，这样trainer就不需要再去获取新的Parameter Server Pod的地址了。

现在design doc里面写了如何进行service discovery，是不要求hostname或者ip一致的。

Done, 因为service discover使用了IP地址，所以保持hostname不变也没有必要了，使用简单的ReplicaSet替换StatefulSet。

helinwang · 2017-04-28T21:18:10Z

doc/design/cluster_train/submit-job.md

+- RESTful API
+
+  Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
+  - `POST   /v1/package` receive the trainer package and save them on GlustereFS


GlustereFS -> CephFS

helinwang · 2017-04-28T21:18:54Z

doc/design/cluster_train/submit-job.md

+  Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
+  - `POST   /v1/package` receive the trainer package and save them on GlustereFS
+  - `POST   /v1/trainer/job` submit a trainer job
+  - `GET    /v1/jobs/` list all job


list all job -> list all jobs

helinwang · 2017-04-28T21:20:53Z

doc/design/cluster_train/submit-job.md

+  - `POST   /v1/trainer/job` submit a trainer job
+  - `GET    /v1/jobs/` list all job
+  - `GET    /v1/jobs/<job-name>` the status of a job
+  - `DELETE /v1/jobs/<job-name>` cancel a job


Cancel的job对应的log需不需要仍然能让用户看到。如果需要的话，貌似Cancel就跟Delete不一样了，Delete是删掉，Cancel是把状态改到取消。不知道这种情况用HTTP DELTE还合不合适。

感觉前期可以只考虑Delete Job了，可以将日志存储在GlusterFS或者通过Logstash写在Elasticsearch中保存一段时间。

helinwang · 2017-04-28T21:22:02Z

doc/design/cluster_train/submit-job.md

+  `paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.
+
+  There are some benefits for building runtime Docker image on JobServer:
+  - **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safety.


it's not safety -> it's not safe

helinwang · 2017-05-05T01:30:46Z

doc/design/cluster_train/submit-job.md

+
+- Runtime Docker Image
+
+  The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image


helinwang · 2017-05-05T01:32:43Z

doc/design/cluster_train/submit-job.md

+  ```
+
+## Submit Distributed Training Job With Python Code
+<img src="./src/submit-job-python.png" width="800">


Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

helinwang · 2017-05-05T01:34:39Z

doc/design/cluster_train/submit-job.md

+  ```
+
+## Submit Distributed Training Job With Python Code
+<img src="./src/submit-job-python.png" width="800">


Docker build / Docker push应该是箭头那根线上的东西吧，这些都是动作，感觉应该放在线上，而不是图上（这里的图基本都是名词／一个概念）

helinwang · 2017-05-05T01:35:51Z

doc/design/cluster_train/submit-job.md

+    reader=reader,
+    paddle_job=PaddleJob(
+      job_name="quickstart",
+      pservers=4,


pservers=4但cpu=1好像不是很合理。。。cpu或是gpu＝1的情况是不用pserver的。

不使用pserver的情况应该属于在集群进行的单机训练？感觉这是不是需要一个单独的API来做这个事情呢？因为即使CPU=10也有可能用户只想做CPU=10的单机训练而已。。另外我建了一个issue: #2019 讨论下如何指定资源，感觉pservers也可以不要了。

Done，增加了对资源使用的描述。

helinwang · 2017-05-05T01:37:19Z

doc/design/cluster_train/submit-job.md

+
+- RESTful API
+
+  Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...


“list PaddlePaddle job etc...”要不要改成一个概括性的比如说："display job related informations"。写"etc..."有点含糊，感觉不适合在design doc里出现。

helinwang · 2017-05-05T01:38:04Z

doc/design/cluster_train/submit-job.md

+
+  `paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.
+
+  There are some benefits for building runtime Docker image on JobServer:


我们最后觉得是在JobServer上build image还是在本地build image？

这一期因为所有代码都是在云端存储上，所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

helinwang · 2017-05-05T18:37:45Z

doc/design/cluster_train/submit-job.md

+
+- Python Dependencies
+
+  You need to provide requirments.txt file in your "trainer" package. Example:


-> requirements.txt

helinwang · 2017-05-05T18:37:53Z

doc/design/cluster_train/submit-job.md

+      |-quick_start
+        |-trainer.py
+        |-dataset.py
+        |-requirments.txt


-> requirements.txt

helinwang · 2017-05-05T18:41:04Z

doc/design/cluster_train/submit-job.md

+
+## Submit Distributed Training Job With Python Code
+<img src="./src/submit-job-python.png" width="800">
+- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.


能不能查一下为什么paddle.job.dist_train()这些都没被正常的语法高亮，显示出来的是这样的：

Miss a blank line...Done.

helinwang · 2017-05-05T18:41:47Z

doc/design/cluster_train/submit-job.md

+
+## Submit Distributed Training Job With Python Code
+<img src="./src/submit-job-python.png" width="800">
+- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.


为了把我们第一个版本说得具体，能不能这样改：change "The first version, we only implement submit the PaddlePaddle job in paddle.job.dist_train()" to "For the first version, we will not prepare the runtime docker image, instead, we will mount the trainer package in a temporary folder into the base docker image. We will not support custom Python dependencies in the first version as well."

helinwang · 2017-05-05T18:44:02Z

doc/design/cluster_train/submit-job.md

+      cpu_num=4,
+      gpu_num=2,
+      memory="1G"
+      input=/quickstart/input,


我理解的是咱们会自动把用户根目录mount进去在指定的位置（比如/home/），所以这里input和output都不需要了吧？

helinwang · 2017-05-05T18:47:56Z

doc/design/cluster_train/submit-job.md

+parameter | required | default | explain
+  --- | --- | --- | ---
+job_name|YES||you should special a uniq job name which in a namespace
+trainer_package|YES|| entry point for startup trainer process


trainer_package是指的上传哪个文件夹吗？这个也不是entrypoint吧。我理解entrypoint是一个指令。
我想象中的是这样的：

(..., trainer_package="/path/to/folder", entrypoint="python train.py")

其实我不确定需不需要，是不是可以trainer_package用当前Python文件的目录，entrypoint用python trainer Python文件就行。

不好意思，这里少了一行。因为Kubernetes的Pod要能访问到trainer_package所指向的目录，所以trainer_package应该是CephFS或者Docker image里的一个目录，用当前Python文件的目录应该是不行的。

entrypoint可以直接是"python trainer %s" % __file__

helinwang · 2017-05-05T20:20:04Z

doc/design/cluster_train/submit-job.md

@@ -0,0 +1,115 @@
+# Submit a Distributed Training Job
+
+If a user wants to start up a distributed training job, he will submit the distributed training job with Python code.


文章题目是"Submit a Distributed Training Job"，是不是本地训练就不用说了（或者不用详细说，这里本地训练的介绍字数比远程训练的介绍字数还多）。
是不是可以改成：
The user can submit a distributed training job with Python code, rather than with a command-line interface.

helinwang · 2017-05-05T20:28:23Z

doc/design/cluster_train/submit-job.md

+
+  The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.
+
+- Python Dependencies


这里Python Dependencies跟Base / Runtime Docker image不是并列关系，是不是可以改成一个小标题

### Handle Python Dependencies

Done, 是不是用-更合适一些？小标题的话要#### Handler Python Dependencies，层级太深了。

@Yancey1989 主要是旁边有两条-开头的是并列关系，这里再加个-开头的感觉不太合适。其他情况我觉得都可以哈。

我看到更新之后的了，可以的～

helinwang · 2017-05-05T20:35:27Z

doc/design/cluster_train/submit-job.md

+
+  There are some benefits for building runtime Docker image on JobServer:
+  - **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safe.
+  - Users only need to upload the training package files, does not dependency docker engine, docker registry.


does not dependency docker engine, docker registry. -> does not need to install docker engine, docker registry as dependencies.

helinwang · 2017-05-05T20:46:05Z

doc/design/cluster_train/submit-job.md

+  `paddle.job.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.
+
+  There are some benefits for building runtime Docker image on JobServer:
+  - **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safe.


Docker in Docker加粗不是很合适：读者可能都不明白什么是Docker in Docker。
另外，我没有明白用Docker in Docker在k8s里不safe，为什么在JobServer就safe了。

如果代码在Paddle Cloud上运行，每个Jupyter Notebook 是一个Kubernetes的Pod，如果需要在这个Pod里去做Docker build,那么就需要将主机上的docker.sock mount到这个Pod里，那么用户可以在Notebook里写代码调用Docker的 REST API来访问本机的Docker Engine了。
而在JobServer里做Docker build比较安全的原因是在JobServer中会通过我们写好的一段bash，只做Docker build的事情，不会直接执行用户的代码。

helinwang · 2017-05-08T01:36:21Z

doc/design/cluster_train/submit-job.md

+                              paddle.updater.Adam(...)),
+    reader=reader,
+    paddle_job=PaddleJob(
+      pserver_bucket="stander",


stander貌似是拼写错误了。

helinwang · 2017-05-08T01:43:08Z

doc/design/cluster_train/submit-job.md

+      trainer.train(reader, num_passes, event_handler, feeding)
+```
+
+parameter | required | default | explain


explain -> explanation

helinwang · 2017-05-08T01:43:48Z

doc/design/cluster_train/submit-job.md

+
+parameter | required | default | explain
+  --- | --- | --- | ---
+job_name|YES||you should special a unique job name which in a namespace


special估计是想写specify。

我觉得参数说明应该是说清楚这个参数是做什么的，是陈述性的，这里翻译成中文是“你应该起一个在namespace唯一的job名字。”，这个语气不是陈述性的，内容用户看了可能也比较蒙：他们会不知道namespace是啥（貌似不需要知道，从他们的角度，没有namespace，只有他自己的任务）。

例子，docker run --name的说明：

If you do not assign a container name with the --name option, then the daemon generates a random string name for you. Defining a name can be a handy way to add meaning to a container. If you specify a name, you can use it when referencing the container within a Docker network.

写在这里可能太长了，可以考虑写成：
the unique name for the training job.

helinwang · 2017-05-08T01:45:27Z

doc/design/cluster_train/submit-job.md

+  --- | --- | --- | ---
+job_name|YES||you should special a unique job name which in a namespace
+entry_point|YES|| entry point for startup trainer process
+trainer_package|YES|| trainer package file path, you can special a cloud path with `pfs://home/paddle` or a normal path with `/home/paddle`


出现了多处special，意思好像错了，special是“特殊”的意思，貌似是想说“specify”。

helinwang · 2017-05-08T01:46:44Z

doc/design/cluster_train/submit-job.md

+  --- | --- | --- | ---
+job_name|YES||you should special a unique job name which in a namespace
+entry_point|YES|| entry point for startup trainer process
+trainer_package|YES|| trainer package file path, you can special a cloud path with `pfs://home/paddle` or a normal path with `/home/paddle`


不是很懂写cloud path的话，我们需要怎么执行云端训练。如果当前运行的train.py与cloud path中的train.py内容不一致怎么办？

PaddleCloud中用户是直接在云端存储上写代码的，所以这里是一个云端存储的路径。
如果是从本地提交，这应该是Docker Image中的一个路径，也就是一个本地路径。

改成了trainer package file path which user have the access right.
Done.

helinwang · 2017-05-08T02:05:12Z

doc/design/cluster_train/submit-job.md

+- PServer Resource
+  - Special `pserver_bucket`
+    - `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.
+    - `pserver_bueckt=stander`, many PServer processes.


并没有stander这个单词，是说standard吗？
我看到google cloud ML API用的是standard和premium。他们用premium我理解因为那个参数（premium）同时设定了trainer和parameter server的。这里，我觉得用single, medium, large就可以了。（更直白）

helinwang · 2017-05-08T02:05:38Z

doc/design/cluster_train/submit-job.md

+### Special Resource for a Distributed Training Job
+- PServer Resource
+  - Special `pserver_bucket`
+    - `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.


如果是single的话，bucket名字就写single吧。（更简单，我们尽量追求简单）

去掉了pserver_bucket, done.

helinwang · 2017-05-08T02:09:18Z

doc/design/cluster_train/submit-job.md

+    - `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.
+    - `pserver_bueckt=stander`, many PServer processes.
+    - `pserver_bucket=premium`, large PServer processes
+  - Custom PServer Resource


先不考虑了吧，太复杂了：）

Done, the same as #1770 (comment)

helinwang · 2017-05-08T02:10:52Z

doc/design/cluster_train/submit-job.md

+- Trainer Resource
+  - you *may* special `gpu_num` for the trainers totally used. By default, trainer count equal GPU count.
+  - you *must* special `cpu_num` for the trainers totally used. if `gpu_num=0`, trainer count equal CPU count.  
+  - you *must* special `memory` for the trainers totally used, you can express memory as a plain integer using one of these suffixes: E, P, T, G, M, K.


用户应该只关心每个trainer的memory情况，所以memory还是按照每个trainer来吧，如果是total，还需要根据有没有GPU的不同规则算有多少个trainer。。。

helinwang · 2017-05-08T02:12:41Z

doc/design/cluster_train/submit-job.md

+  `POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows:
+  - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
+  - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
+  - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.


只有一个master，需要ReplicaSet吗？（或者什么别的更合适一点），我不是特清楚部署的时候用哪个比较好，请教一下。

ReplicaSe比较简单了，主要是为了当Master节点挂掉时，Kubernetes会重新将Master调度起来，用Pod将会失去这个特性。

好的，明白了！

helinwang

Two more comments, Almost there!

helinwang · 2017-05-10T21:23:30Z

doc/design/cluster_train/submit-job.md

+<img src="./src/submit-job-python.png" width="800">
+
+- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
+- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.


第一版本是通过JobServer来submit job，还是本地Python直接来？

第一版就本地直接提交了，本地提交需要build runtime Docker image，这里补充了一句。
Done.

helinwang · 2017-05-10T21:35:25Z

doc/design/cluster_train/submit-job.md

+  - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
+  - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
+
+## Job Server


第一版本需要实现job server吗？https://github.com/PaddlePaddle/cloud/wiki/2017-05 中提到了“查看训练状态”是指网站通过job server的API取得信息，显示给用户看任务列表和任务log吗？

这一版是想在网站后台直接实现JobServer的部分接口，例如查看训练状态和任务列表。

jacquesqiao · 2017-05-11T03:48:28Z

doc/design/cluster_train/submit-job.md

+### specify Resource for a Distributed Training Job
+- PServer Resource
+  - specify `pserver_bucket`
+    - `pserver_bucket=single`, a single PServer process, it's suitable for learning how to Paddle Cloud.


how to Paddle Cloud

how to use Paddle Cloud?

去掉了这个参数，这个表述不太清楚，意思是说Paddle Cloud根据集群架构以及资源情况来自动计算出PServer的数量。

helinwang · 2017-05-11T18:26:04Z

doc/design/cluster_train/submit-job.md

+
+- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
+- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
+- *NOTE*: For the first version, we will not prepare the runtime Docker image on JobServer, instead, we will build the runtime Docker image on our host. If the code is running on PaddleCloud, we will mount the trainer package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.


"instead, we will build the runtime Docker image on our host"和后面的“we will mount the trainer package in a temporary folder into the base Docker image”有冲突，前面是说要build runtime image，后面说是直接用base image就好。
建议改成

NOTE: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle cloud, and Paddle cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.

Paddle cloud => Paddle Cloud
Done.

另外貌似用户不需要指定runtime Docker image, 所以在参数中也只保留了image，去掉了base_image和runtime_image

helinwang · 2017-05-11T18:27:51Z

doc/design/cluster_train/submit-job.md

+    paddle_job=PaddleJob(
+      runtime_image = "yancey1989/paddle-job",
+      job_name="paddle-job",
+      namespace="paddle-cloud",


不理解为什么需要namespace，我以为job_name就可以成为namespace了？

现在Kubernetes集群上的做法是为每个用户创建一个namespace，一个namespace下会跑多个Job。不过本地提交可以在~/.kube/config中获取namespace，PaddleCloud也可以通过环境变量来获取，所以这个参数可以去掉了：）
Done.

另外我在实现PaddleJob的时候发现dist_train可能只需要trainer, paddle_job两个参数即可，其中tariner是一个的function, PR中做了说明。
Done.

helinwang · 2017-05-11T18:34:39Z

doc/design/cluster_train/submit-job.md

+trainer_package | str | trainer package file path which user have the access right
+base_image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
+runtime_image|str| [runtime image](#runtime-docker-image)
+memory|str| memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K


已经有trainer_mem，memroy是不是可以删了。

Done.
去掉了第一种配置资源的方式。

helinwang · 2017-05-11T18:35:58Z

doc/design/cluster_train/submit-job.md

+trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K
+
+### Specify Resource for a Distributed Training Job
+- Specify Job Resource


这里讲了两种方法，我感觉我们只想清楚了第二种，也就准备支持第二种。那么第一种我们先删掉吧：）

Yancey1989 added 2 commits April 12, 2017 01:22

submit job

3331bda

small png

7faa331

Yancey1989 mentioned this pull request Apr 11, 2017

paddle dist architecture (draft) #1620

Merged

jacquesqiao reviewed Apr 12, 2017

View reviewed changes

wangkuiyi reviewed Apr 12, 2017

View reviewed changes

typhoonzero reviewed Apr 12, 2017

View reviewed changes

helinwang requested changes Apr 12, 2017

View reviewed changes

Yancey1989 added 5 commits April 15, 2017 13:45

update

718f901

add paddlepaddle commands

fd9e1c2

update png

c095a93

udpate png

d74d9ba

adjust sytle

f4c7bd2

wangkuiyi requested changes Apr 17, 2017

View reviewed changes

Yancey1989 added 2 commits April 18, 2017 19:39

update submit-job

bb7263f

update

005c3e1

helinwang requested changes Apr 19, 2017

View reviewed changes

helinwang mentioned this pull request Apr 20, 2017

Distributed training progress and todo #1820

Closed

Yancey1989 added 2 commits April 21, 2017 00:12

update

0f113d3

update paddle server

b6969f9

typhoonzero mentioned this pull request Apr 24, 2017

paddle cloud 计划内容 #1860

Closed

Yancey1989 added 4 commits April 25, 2017 17:42

update

1771707

update

a21743a

resize image

d643295

update

5827dc1

typhoonzero reviewed Apr 26, 2017

View reviewed changes

rename direcotry

1987b45

helinwang reviewed Apr 28, 2017

View reviewed changes

update

6f097a3

Yancey1989 changed the title ~~Submit a distributed job (draft)~~ Submit a distributed job May 3, 2017

Yancey1989 changed the title ~~Submit a distributed job~~ Design doc: submit a distributed job May 3, 2017

Yancey1989 added 2 commits May 4, 2017 19:59

trainer use replicaset instead of statefulset

bfdd1a3

update

b56e7e7

helinwang reviewed May 5, 2017

View reviewed changes

Yancey1989 added 2 commits May 5, 2017 13:20

update

b8e63d9

update

6cbf80d

helinwang reviewed May 5, 2017

View reviewed changes

Yancey1989 added 6 commits May 6, 2017 11:51

update

cb39a81

update

063805d

update

e2e6875

update

5ec1deb

update

53b5afa

update

198d0d1

helinwang reviewed May 8, 2017

View reviewed changes

This was referenced May 9, 2017

How to specify the distributed training job resource #2019

Closed

Simpler cluster train job submit code #2047

Closed

update

838509b

helinwang reviewed May 10, 2017

View reviewed changes

jacquesqiao reviewed May 11, 2017

View reviewed changes

update

259731a

helinwang reviewed May 11, 2017

View reviewed changes

Yancey1989 added 4 commits May 12, 2017 10:26

update

a5a0aeb

trainer function

080e633

delete specify resource paragraph

05d6e00

paramter image instead of base_image and runtime_image

8486227

Yancey1989 closed this May 12, 2017

		@@ -0,0 +1,70 @@

		# PaddlePaddle Client
		PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:


		- Data source

		Input data should be saved on the distributed and you have the access.


		- Saving package

		Master process will receive the package files and save on the distributed storage, for every job, master will generate a random id called job-id, the package file path looks like:


		# Different cluster management

		PaddlePaddle client support plugins for different cluster management client, such as kubernetes looks like `paddle k8s ...`, MPI looks like `paddl mpi...`,but runinig process are the same:


		- Setup parameter and trainer

		Master process will set up parameter server process and trainer process, on kubernetes, master will deploy two deployment for parameter server and trainer.


		The relation of PaddlePaddle, kubernetes and docker:

		<img src="./submit-job.png" width="500">

		<img src="./submit-job.png" width="500">


		# Running local training job


		If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.

		The relation of PaddlePaddle, kubernetes and docker:


		- Python Dependencies

		Users will provide a `requirments.txt` file in trainer packages, to list python dependencies packages, such as:

		The relation of PaddlePaddle, kubernetes and docker:


		# Runtime Environment On kubernetes

		@@ -0,0 +1,111 @@
		# Submit a Distributed Training Job

		If a user wants to start up a distributed training job, he will submit the distributed training job with python code.


		## Runtime Environment On Kubernetes

		For a distributed training job, there is two Docker image called runtime Docker image and base Docker image. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.


		- Base Docker Image

		Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access.


		- Runtime Docker Image

		The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image


		- RESTful API

		Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...


		`paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

		There are some benefits for building runtime Docker image on JobServer:


		- Python Dependencies

		You need to provide requirments.txt file in your "trainer" package. Example:

		@@ -0,0 +1,115 @@
		# Submit a Distributed Training Job

		If a user wants to start up a distributed training job, he will submit the distributed training job with Python code.

Design doc: submit a distributed job #1770

Design doc: submit a distributed job #1770

Conversation

Yancey1989 commented Apr 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 Apr 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 16, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 Apr 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

gongweibao Apr 13, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 Apr 13, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented Apr 11, 2017 •

edited

Loading

Yancey1989 Apr 12, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading

Yancey1989 Apr 13, 2017 •

edited

Loading

wangkuiyi Apr 13, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading

helinwang Apr 15, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading

helinwang Apr 16, 2017 •

edited

Loading

Yancey1989 Apr 16, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading

gongweibao Apr 13, 2017 •

edited

Loading

Yancey1989 Apr 13, 2017 •

edited

Loading

helinwang Apr 12, 2017 •

edited

Loading