Skip to content

Commit

Permalink
Merge pull request microsoft#143 from Microsoft/master
Browse files Browse the repository at this point in the history
merge master
  • Loading branch information
SparkSnail authored Mar 18, 2019
2 parents 5e777d2 + c1e6098 commit 6ff24a5
Show file tree
Hide file tree
Showing 27 changed files with 132 additions and 37 deletions.
6 changes: 3 additions & 3 deletions README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,18 +82,18 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包

## **使用场景**

* 在本地 Trial 不同的自动机器学习算法来训练模型
* 在本机尝试使用不同的自动机器学习(AutoML)算法来训练模型
* 在分布式环境中加速自动机器学习(如:远程 GPU 工作站和云服务器)。
* 定制自动机器学习算法,或比较不同的自动机器学习算法。
* 在自己的机器学习平台中支持自动机器学习
* 在机器学习平台中支持自动机器学习

## 相关项目

以开发和先进技术为目标,[Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) 发布了一些开源项目。

* [OpenPAI](https://github.com/Microsoft/pai):作为开源平台,提供了完整的 AI 模型训练和资源管理能力,能轻松扩展,并支持各种规模的私有部署、云和混合环境。
* [FrameworkController](https://github.com/Microsoft/frameworkcontroller):开源的通用 Kubernetes Pod 控制器,通过单个控制器来编排 Kubernetes 上所有类型的应用。
* [MMdnn](https://github.com/Microsoft/MMdnn)一个完成、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。
* [MMdnn](https://github.com/Microsoft/MMdnn)一个完整、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。

## **安装和验证**

Expand Down
5 changes: 5 additions & 0 deletions docs/en_US/ExperimentConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,11 @@ machineList:
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
* __debug__
* Description
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set debug be false.
* __maxTrialNum__
* Description
Expand Down
10 changes: 5 additions & 5 deletions docs/en_US/HowToUseDocker.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,19 @@ Note:

If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.

If you start your own docker image, you may need to install NNI package first, please [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/Installation.md).
If you start your own docker image, you may need to install NNI package first, please [refer](Installation.md).

If you want to run NNI's offical examples, you may need to clone NNI repo in github using
```
git clone https://github.com/Microsoft/nni.git
```
then you could enter `nni/examples/trials` to start an experiment.

After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)
After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](QuickStart.md)

## Using docker in remote platform

NNI support starting experiments in [remoteTrainingService](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.
NNI support starting experiments in [remoteTrainingService](RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.

### Step 1: Setting docker environment

Expand Down Expand Up @@ -78,7 +78,7 @@ If you use your own docker image as remote server, please make sure that this im

### Step3: Run NNI experiments

You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.
You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.

`port:` The host machine's port, mapping to docker's SSH port.

Expand All @@ -88,4 +88,4 @@ You could set your config file as remote platform, and setting the `machineList`

`sshKeyPath:` The path of private key of docker container.

After the configuration of config file, you could start an experiment, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)
After the configuration of config file, you could start an experiment, [refer](QuickStart.md)
7 changes: 7 additions & 0 deletions docs/en_US/NNICTLDOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ nnictl support commands:
|------|------|------|------|
|--config, -c| True| |YAML configure file of the experiment|
|--port, -p|False| |the port of restful server|
|--debug, -d|False||set debug mode|

Note:
```
Debug mode will disable version check function in Trialkeeper.
```
<a name="resume"></a>
* __nnictl resume__
Expand All @@ -65,6 +71,7 @@ nnictl support commands:
|------|------|------ |------|
|id| False| |The id of the experiment you want to resume|
|--port, -p| False| |Rest port of the experiment you want to resume|
|--debug, -d|False||set debug mode|
<a name="stop"></a>
* __nnictl stop__
Expand Down
2 changes: 1 addition & 1 deletion docs/en_US/Trials.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ with tf.Session() as sess:
batch_size = 128
for i in range(10000):
batch = mnist.train.next_batch(batch_size)
+ """@nni.variable(nni.choice(1, 5), name=dropout_rate)"""
+ """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
dropout_rate = 0.5
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1],
Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/common/manager.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ interface ExperimentParams {
trainingServicePlatform: string;
multiPhase?: boolean;
multiThread?: boolean;
versionCheck?: boolean;
tuner?: {
className: string;
builtinTunerName?: string;
Expand Down
15 changes: 14 additions & 1 deletion src/nni_manager/common/utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,19 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
});
}

/**
* get the version of current package
*/
async function getVersion(): Promise<string> {
const deferred : Deferred<string> = new Deferred<string>();
import(path.join(__dirname, '..', 'package.json')).then((pkg)=>{
deferred.resolve(pkg.version);
}).catch((error)=>{
deferred.reject(error);
});
return deferred.promise;
}

export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir,
getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address,
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect };
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect, getVersion };
11 changes: 10 additions & 1 deletion src/nni_manager/core/nnimanager.ts
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,11 @@ class NNIManager implements Manager {
if (expParams.multiPhase && this.trainingService.isMultiPhaseJobSupported) {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
}

// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('version_check', expParams.versionCheck.toString());
}

const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread);
this.log.debug(`dispatcher command: ${dispatcherCommand}`);
Expand Down Expand Up @@ -162,6 +166,11 @@ class NNIManager implements Manager {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
}

// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('versionCheck', expParams.versionCheck.toString());
}

const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread);
this.log.debug(`dispatcher command: ${dispatcherCommand}`);
Expand Down
5 changes: 3 additions & 2 deletions src/nni_manager/rest_server/restHandler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import { getLogger, Logger } from '../common/log';
import { ExperimentProfile, Manager, TrialJobStatistics} from '../common/manager';
import { ValidationSchemas } from './restValidationSchemas';
import { NNIRestServer } from './nniRestServer';
import { getVersion } from '../common/utils';

const expressJoi = require('express-joi-validator');

Expand Down Expand Up @@ -104,8 +105,8 @@ class NNIRestHandler {

private version(router: Router): void {
router.get('/version', async (req: Request, res: Response) => {
const pkg = await import(path.join(__dirname, '..', 'package.json'));
res.send(pkg.version);
const version = await getVersion();
res.send(version);
});
}

Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/rest_server/restValidationSchemas.ts
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ export namespace ValidationSchemas {
maxExecDuration: joi.number().min(0).required(),
multiPhase: joi.boolean(),
multiThread: joi.boolean(),
versionCheck: joi.boolean(),
advisor: joi.object({
builtinAdvisorName: joi.string().valid('Hyperband'),
codeDir: joi.string(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,6 @@ export enum TrialConfigMetadataKey {
PAI_CLUSTER_CONFIG = 'pai_config',
KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config',
NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config'
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
VERSION_CHECK = 'version_check'
}
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);

for(let taskRole of this.fcTrialConfig.taskRoles) {
const runScriptContent: string = this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
const runScriptContent: string = await this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
this.generateCommandScript(taskRole.command), curTrialSequenceId.toString(), taskRole.name, taskRole.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, { encoding: 'utf8' });
}
Expand Down Expand Up @@ -267,6 +267,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
return Promise.reject(new Error(error));
}
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
break;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber

// Write worker file content run_worker.sh to local tmp folders
if(kubeflowTrialConfig.worker) {
const workerRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const workerRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
kubeflowTrialConfig.worker.command, curTrialSequenceId.toString(), 'worker', kubeflowTrialConfig.worker.gpuNum);

await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_worker.sh'), workerRunScriptContent, { encoding: 'utf8' });
Expand All @@ -197,15 +197,15 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
if(this.kubeflowClusterConfig.operator === 'tf-operator') {
let tensorflowTrialConfig: KubeflowTrialConfigTensorflow = <KubeflowTrialConfigTensorflow>this.kubeflowTrialConfig;
if(tensorflowTrialConfig.ps){
const psRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const psRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
tensorflowTrialConfig.ps.command, curTrialSequenceId.toString(), 'ps', tensorflowTrialConfig.ps.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_ps.sh'), psRunScriptContent, { encoding: 'utf8' });
}
}
else if(this.kubeflowClusterConfig.operator === 'pytorch-operator') {
let pytorchTrialConfig: KubeflowTrialConfigPytorch = <KubeflowTrialConfigPytorch>this.kubeflowTrialConfig;
if(pytorchTrialConfig.master){
const masterRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const masterRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
pytorchTrialConfig.master.command, curTrialSequenceId.toString(), 'master', pytorchTrialConfig.master.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_master.sh'), masterRunScriptContent, { encoding: 'utf8' });
}
Expand Down Expand Up @@ -317,6 +317,9 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
return Promise.reject(new Error(error));
}
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
break;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,5 +71,5 @@ mkdir -p $NNI_OUTPUT_DIR
cp -rT $NNI_CODE_DIR $NNI_SYS_DIR
cd $NNI_SYS_DIR
sh install_nni.sh
python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} `
python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} --version '{11}'`
+ `1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr`
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ import * as path from 'path';
import { EventEmitter } from 'events';
import { getExperimentId, getInitTrialSequenceId } from '../../common/experimentStartupInfo';
import { getLogger, Logger } from '../../common/log';
import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address } from '../../common/utils';
import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address, getVersion } from '../../common/utils';
import {
TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService';
Expand Down Expand Up @@ -61,6 +61,7 @@ abstract class KubernetesTrainingService {
protected kubernetesCRDClient?: KubernetesCRDClient;
protected kubernetesJobRestServer?: KubernetesJobRestServer;
protected kubernetesClusterConfig?: KubernetesClusterConfig;
protected versionCheck?: boolean = true;

constructor() {
this.log = getLogger();
Expand Down Expand Up @@ -179,8 +180,8 @@ abstract class KubernetesTrainingService {
* @param command
* @param trialSequenceId sequence id
*/
protected generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): string {
protected async generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): Promise<string> {
let nvidia_script: string = '';
// Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs
// Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61
Expand All @@ -189,6 +190,7 @@ abstract class KubernetesTrainingService {
nvidia_script = `export CUDA_VISIBLE_DEVICES='0'`;
}
const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address();
const version = this.versionCheck? await getVersion(): '';
const runScript: string = String.Format(
KubernetesScriptFormat,
platform,
Expand All @@ -201,9 +203,10 @@ abstract class KubernetesTrainingService {
nvidia_script,
command,
nniManagerIp,
this.kubernetesRestServerPort
this.kubernetesRestServerPort,
version
);
return runScript;
return Promise.resolve(runScript);
}
protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}`);
Expand Down
2 changes: 1 addition & 1 deletion src/nni_manager/training_service/pai/paiData.ts
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ export const PAI_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4}
&& cd $NNI_SYS_DIR && sh install_nni.sh
&& python3 -m nni_trial_tool.trial_keeper --trial_command '{5}' --nnimanager_ip '{6}' --nnimanager_port '{7}'
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1'`;
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1' --version '{12}'`;

export const PAI_OUTPUT_DIR_FORMAT: string =
`hdfs://{0}:9000/`;
Expand Down
Loading

0 comments on commit 6ff24a5

Please sign in to comment.