How to debug "no machine can be scheduled"? #1852

diggerdu · 2019-12-15T07:15:50Z

Short summary about the issue/question:
My experiment works well on local mode, but it stucks on "'Scheduler: trialJob id xxxx, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU '" while I change the mode to remotemachine

Brief what process you are following:

add "debug: true" to the yaml file
"DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]'" shows repeatedly in the log
I execute tail -n 1 /tmp/root/nni/scripts/gpu_metrics manually
failing with error tail: cannot open ‘/tmp/root/nni/scripts/gpu_metrics’ for reading: No such file or directory
Then I manually execute bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector,
Outputs:

{'Timestamp': 'Sun Dec 15 15:09:21 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '59', 'gpuMemUtil': '29', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}
{'Timestamp': 'Sun Dec 15 15:09:26 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '14', 'gpuMemUtil': '11', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}
{'Timestamp': 'Sun Dec 15 15:09:32 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '15', 'gpuMemUtil': '11', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}

And previous command works flawlessly,

➜  ume_dev_debug tail -n 1 /tmp/root/nni/scripts/gpu_metrics
{"Timestamp": "Sun Dec 15 15:17:59 2019", "gpuCount": 2, "gpuInfos": [{"activeProcessNum": 0, "gpuMemUtil": "13", "gpuUtil": "17", "index": 0}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}]}

The NNImanager repeats

[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]

But the trials are still WAITING.
Any suggestions for debuging this error?

The text was updated successfully, but these errors were encountered:

chicm-ms · 2019-12-16T04:56:38Z

Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly.
Please also paste full nnimanager.log if you can ensure the configuration is right.

For debug suggestion,
While you search source code for TMP_NO_AVAILABLE_GPU , you can see that TMP_NO_AVAILABLE_GPU is only used in https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/remote_machine/gpuScheduler.ts
You can check/debug the scheduleMachine method to find the root cause.

diggerdu · 2019-12-16T05:10:00Z

Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly.

authorName: sinodragon
experimentName: leaf_s3
trialConcurrency: 1
maxExecDuration: 1000h
maxTrialNum: 1000
trainingServicePlatform: remote
debug: true
# The path to Search Space
searchSpacePath: search_space.json
useAnnotation: false
tuner:
  builtinTunerName: TPE
  classArgs:
    optimize_mode: minimize
# The path and the running command of trial
trial:
  command: zsh dist_train.sh 1 experiments/leaf_s3 enable_nni
  codeDir: ../../
  gpuNum: 1

machineList:
  - ip: 127.0.0.1
    username: root
    passwd: xxxx
    port: xxxx

GPU Stats

➜  leaf_s3 git:(master) ✗ nvidia-smi
Mon Dec 16 13:16:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:24:00.0 Off |                    0 |
| N/A   56C    P0    46W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

diggerdu · 2019-12-16T05:14:31Z

Please also paste full nnimanager.log if you can ensure the configuration is right.

[12/15/2019, 2:58:11 PM] INFO [ 'Starting experiment: OS9lH9lq' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'Setup tuner...' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Add event listeners' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 2:58:11 PM] INFO [ 'RestServer start' ]
[12/15/2019, 2:58:11 PM] INFO [ 'RestServer base port is 10162' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Run remote machine training service.' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[12/15/2019, 2:58:12 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ]
[12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}' ]
[12/15/2019, 2:58:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 2:58:16 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}}' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}, jobDetail: {"id":"aHsb8","status":"WAITING","submitTime":1576393096141,"workingDirectory":"/tmp/nni/experiments/OS9lH9lq/trials/aHsb8","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:17 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ]
[12/15/2019, 2:58:17 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ]
[12/15/2019, 2:58:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:20 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ]
[12/15/2019, 2:58:20 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ]

After I execute `bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector` manually

[12/15/2019, 3:14:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]

chicm-ms · 2019-12-16T10:21:44Z

This is a bug when local system is used as remote training service.
PR #1856 is submitted to fix this. You can use local training service as a temporary workaround.
If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.

diggerdu · 2019-12-16T10:38:58Z

This is a bug when local system is used as remote training service.
PR #1856 is submitted to fix this. You can use local training service as a temporary workaround.
If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.

But the bug is still exits after I using a remote machine.
The /tmp/root/nni/scripts/gpu_metrics does not exits in the remote machine

n108-114-238# pwd
/tmp/root/nni/scripts
n108-114-238# ls
pid

chicm-ms · 2019-12-16T10:51:04Z

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

diggerdu · 2019-12-16T10:56:58Z

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

[12/16/2019, 6:50:25 PM] DEBUG [ 'Database directory: /root/nni/experiments/ACQgoetz/db' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Datastore initialization done' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Construct remote machine training service.' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer base port is 10161' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Rest server listening on: http://0.0.0.0:10161' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n    "machine_list": [\n        {\n            "ip": "10.108.238.238",\n            "username": "root",\n            "passwd": "qwe",\n            "port": 123\n        }\n    ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: machine_list, value: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Connecting to remote machines: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [(umask 0 ; mkdir -p /tmp/root/nni/scripts)]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [chmod 777 /tmp/nni /tmp/nni/* /tmp/nni/scripts/*]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [bash -c \'echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector\']' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n    "trial_config": {\n        "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n        "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n        "gpuNum": 1\n    }\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"zsh dist_train.sh 1 experiments/debug enable_nni","codeDir":"/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../","gpuNum":1}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'POST: /experiment: body:\n{\n    "authorName": "duxingjian",\n    "experimentName": "debug",\n    "trialConcurrency": 1,\n    "maxExecDuration": 3600000,\n    "maxTrialNum": 1000,\n    "searchSpace": "{\\"optimizer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"torchadam\\", \\"adam_lrd?dropout=0.5\\"]}, \\"lr\\": {\\"_type\\": \\"choice\\", \\"_value\\": [0.001, 0.0001]}, \\"weight_decay\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1e-05, 0]}, \\"max_bin\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1487]}, \\"nb_layers\\": {\\"_type\\": \\"choice\\", \\"_value\\": [4]}, \\"hidden_size\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"enable_std\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"False\\"]}, \\"enable_dfsmn_ln\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"True\\", \\"False\\"]}, \\"hop_length\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"act_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"nn.LeakyReLU(0.2)\\", \\"MishCuda()\\"]}, \\"norm_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"none\\", \\"layer_norm\\"]}, \\"dropout_rate\\": {\\"_type\\": \\"uniform\\", \\"_value\\": [0, 0.5]}, \\"dilation\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1]}, \\"lookup_len\\": {\\"_type\\": \\"choice\\", \\"_value\\": [20, 40]}}",\n    "trainingServicePlatform": "remote",\n    "tuner": {\n        "builtinTunerName": "TPE",\n        "classArgs": {\n            "optimize_mode": "minimize"\n        },\n        "className": "TPE"\n    },\n    "versionCheck": false,\n    "clusterMetaData": [\n        {\n            "key": "machine_list",\n            "value": [\n                {\n                    "ip": "10.108.238.238",\n                    "username": "root",\n                    "passwd": "qwe",\n                    "port": 123\n                }\n            ]\n        },\n        {\n            "key": "trial_config",\n            "value": {\n                "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n                "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n                "gpuNum": 1\n            }\n        }\n    ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Starting experiment: ACQgoetz' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Setup tuner...' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Add event listeners' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer base port is 10162' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Run remote machine training service.' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}' ]
[12/16/2019, 6:50:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:36 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}, jobDetail: {"id":"Myh3Y","status":"WAITING","submitTime":1576493436156,"workingDirectory":"/tmp/nni/experiments/ACQgoetz/trials/Myh3Y","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'designated gpu indices: undefined' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/.nni]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'copyFileToRemote: localFilePath: /root/nni/experiments/ACQgoetz/trials-local/Myh3Y/parameter.cfg, remoteFilePath: /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/parameter.cfg' ]
[12/16/2019, 6:50:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]

➜  ume_dev_debug nnictl log stderr

(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:898745) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
INFO:hyperopt.tpe:tpe_transform took 0.004019 seconds
INFO:hyperopt.tpe:TPE using 0 trials
(node:898745) ExperimentalWarning: The fs.promises API is experimental
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 4)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 7)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 8)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 9)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 10)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 11)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 12)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 13)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 14)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 15)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 16)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 17)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 18)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab/home/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 19)

diggerdu · 2019-12-16T11:10:29Z

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

I print out the content of cmdresult.stdout

299             if (cmdresult !== undefined && cmdresult.stdout !== undefined) {
>>>
300                 this.log.warning(cmdresult.stdout);
<<<
301                 rmMeta.gpuSummary = JSON.parse(cmdresult.stdout);
302                 if (rmMeta.gpuSummary.gpuCount === 0) {
303                     this.log.warning(`No GPU found on remote machine ${rmMeta.ip}`);
304                     this.timer.unsubscribe(disposable);
305                 }

[12/16/2019, 7:08:55 PM] WARNING [ '{"Timestamp": "Mon Dec 16 19:08:52 2019", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]}\n' ]
[12/16/2019, 7:08:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 7:08:56 PM] WARNING [ '' ]
[12/16/2019, 7:08:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]

Is the \n in this json string causes the SyntaxError?

chicm-ms · 2019-12-16T11:39:03Z

It seems the cmdresult.stdout is correct, the \n won't cause SyntaxError. I dont have this problem with a real remote machine on my side.
You can add some log info in RemoteMachineTrainingService.prepareTrialJob method to check whether prepareTrialJob is finished correctly.

You can also try to clean up the remote machine environment and try again:
on remote machine:

remove /tmp/root, /tmp/nni
kill nni_gpu_tool.gpu_metrics_collector related process

And ensure same version of nni is installed on remote machine.

* Fix local system as remote machine issue #1852

scarlett2018 · 2019-12-30T10:48:48Z

It seems the cmdresult.stdout is correct, the \n won't cause SyntaxError. I dont have this problem with a real remote machine on my side.
You can add some log info in RemoteMachineTrainingService.prepareTrialJob method to check whether prepareTrialJob is finished correctly.

You can also try to clean up the remote machine environment and try again:
on remote machine:

remove /tmp/root, /tmp/nni

kill nni_gpu_tool.gpu_metrics_collector related process

And ensure same version of nni is installed on remote machine.

@diggerdu hope what @chicm-ms suggested works for you. in addition to github issue, you guys might also consider IM in NNI Gitter channel: https://gitter.im/Microsoft/nni

xuehui1991 added the user raised label Dec 16, 2019

scarlett2018 assigned chicm-ms Dec 16, 2019

chicm-ms added a commit to chicm-ms/nni that referenced this issue Dec 16, 2019

Fix local system as remote machine issue microsoft#1852

fa4ff9c

scarlett2018 mentioned this issue Dec 16, 2019

Iteration Plan for Dec 2019 #1794

Closed

44 tasks

chicm-ms added a commit that referenced this issue Dec 18, 2019

Fix local system as remote machine issue #1852 (#1856)

6d9f545

* Fix local system as remote machine issue #1852

leckie-chn mentioned this issue Dec 25, 2019

Endgame for Iteration Dec. 2019 #1872

Closed

19 tasks

scarlett2018 added the waiting user confirm label Dec 30, 2019

diggerdu closed this as completed Jan 3, 2020

chicm-ms mentioned this issue Feb 27, 2020

failed to train with remote mode #1578

Closed

scarlett2018 added debug GPU-usage and removed waiting user confirm labels Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug "no machine can be scheduled"? #1852

How to debug "no machine can be scheduled"? #1852

diggerdu commented Dec 15, 2019 •

edited

Loading

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019 •

edited

Loading

diggerdu commented Dec 16, 2019

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019 •

edited

Loading

diggerdu commented Dec 16, 2019 •

edited

Loading

chicm-ms commented Dec 16, 2019 •

edited

Loading

scarlett2018 commented Dec 30, 2019

How to debug "no machine can be scheduled"? #1852

How to debug "no machine can be scheduled"? #1852

Comments

diggerdu commented Dec 15, 2019 • edited Loading

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019 • edited Loading

diggerdu commented Dec 16, 2019

After I execute bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector manually

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019

chicm-ms commented Dec 16, 2019

diggerdu commented Dec 16, 2019 • edited Loading

diggerdu commented Dec 16, 2019 • edited Loading

chicm-ms commented Dec 16, 2019 • edited Loading

scarlett2018 commented Dec 30, 2019

diggerdu commented Dec 15, 2019 •

edited

Loading

diggerdu commented Dec 16, 2019 •

edited

Loading

After I execute `bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector` manually

diggerdu commented Dec 16, 2019 •

edited

Loading

diggerdu commented Dec 16, 2019 •

edited

Loading

chicm-ms commented Dec 16, 2019 •

edited

Loading