Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

How to debug "no machine can be scheduled"? #1852

Closed
diggerdu opened this issue Dec 15, 2019 · 10 comments
Closed

How to debug "no machine can be scheduled"? #1852

diggerdu opened this issue Dec 15, 2019 · 10 comments

Comments

@diggerdu
Copy link

diggerdu commented Dec 15, 2019

Short summary about the issue/question:
My experiment works well on local mode, but it stucks on "'Scheduler: trialJob id xxxx, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU '" while I change the mode to remotemachine

Brief what process you are following:

  1. add "debug: true" to the yaml file
  2. "DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]'" shows repeatedly in the log
  3. I execute tail -n 1 /tmp/root/nni/scripts/gpu_metrics manually
    failing with error tail: cannot open ‘/tmp/root/nni/scripts/gpu_metrics’ for reading: No such file or directory
  4. Then I manually execute bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector,
    Outputs:
{'Timestamp': 'Sun Dec 15 15:09:21 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '59', 'gpuMemUtil': '29', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}
{'Timestamp': 'Sun Dec 15 15:09:26 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '14', 'gpuMemUtil': '11', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}
{'Timestamp': 'Sun Dec 15 15:09:32 2019', 'gpuCount': 2, 'gpuInfos': [{'index': 0, 'gpuUtil': '15', 'gpuMemUtil': '11', 'activeProcessNum': 0}, {'index': 1, 'gpuUtil': '0', 'gpuMemUtil': '0', 'activeProcessNum': 0}]}

And previous command works flawlessly,

➜  ume_dev_debug tail -n 1 /tmp/root/nni/scripts/gpu_metrics
{"Timestamp": "Sun Dec 15 15:17:59 2019", "gpuCount": 2, "gpuInfos": [{"activeProcessNum": 0, "gpuMemUtil": "13", "gpuUtil": "17", "index": 0}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}]}

The NNImanager repeats

[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]

But the trials are still WAITING.
Any suggestions for debuging this error?

@chicm-ms
Copy link
Contributor

Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly.
Please also paste full nnimanager.log if you can ensure the configuration is right.

For debug suggestion,
While you search source code for TMP_NO_AVAILABLE_GPU , you can see that TMP_NO_AVAILABLE_GPU is only used in https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/remote_machine/gpuScheduler.ts
You can check/debug the scheduleMachine method to find the root cause.

@diggerdu
Copy link
Author

diggerdu commented Dec 16, 2019

Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly.

authorName: sinodragon
experimentName: leaf_s3
trialConcurrency: 1
maxExecDuration: 1000h
maxTrialNum: 1000
trainingServicePlatform: remote
debug: true
# The path to Search Space
searchSpacePath: search_space.json
useAnnotation: false
tuner:
  builtinTunerName: TPE
  classArgs:
    optimize_mode: minimize
# The path and the running command of trial
trial:
  command: zsh dist_train.sh 1 experiments/leaf_s3 enable_nni
  codeDir: ../../
  gpuNum: 1

machineList:
  - ip: 127.0.0.1
    username: root
    passwd: xxxx
    port: xxxx

GPU Stats

➜  leaf_s3 git:(master) ✗ nvidia-smi
Mon Dec 16 13:16:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:24:00.0 Off |                    0 |
| N/A   56C    P0    46W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@diggerdu
Copy link
Author

Please also paste full nnimanager.log if you can ensure the configuration is right.

[12/15/2019, 2:58:11 PM] INFO [ 'Starting experiment: OS9lH9lq' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'Setup tuner...' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Add event listeners' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 2:58:11 PM] INFO [ 'RestServer start' ]
[12/15/2019, 2:58:11 PM] INFO [ 'RestServer base port is 10162' ]
[12/15/2019, 2:58:11 PM] INFO [ 'Run remote machine training service.' ]
[12/15/2019, 2:58:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[12/15/2019, 2:58:12 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ]
[12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}' ]
[12/15/2019, 2:58:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 2:58:16 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}}' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}, jobDetail: {"id":"aHsb8","status":"WAITING","submitTime":1576393096141,"workingDirectory":"/tmp/nni/experiments/OS9lH9lq/trials/aHsb8","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ]
[12/15/2019, 2:58:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:17 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ]
[12/15/2019, 2:58:17 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ]
[12/15/2019, 2:58:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 2:58:20 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ]
[12/15/2019, 2:58:20 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ]

After I execute bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector manually

[12/15/2019, 3:14:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/15/2019, 3:14:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/15/2019, 3:14:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/15/2019, 3:14:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]

chicm-ms added a commit to chicm-ms/nni that referenced this issue Dec 16, 2019
@chicm-ms
Copy link
Contributor

This is a bug when local system is used as remote training service.
PR #1856 is submitted to fix this. You can use local training service as a temporary workaround.
If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.

@diggerdu
Copy link
Author

This is a bug when local system is used as remote training service.
PR #1856 is submitted to fix this. You can use local training service as a temporary workaround.
If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.

But the bug is still exits after I using a remote machine.
The /tmp/root/nni/scripts/gpu_metrics does not exits in the remote machine

n108-114-238# pwd
/tmp/root/nni/scripts
n108-114-238# ls
pid

@chicm-ms
Copy link
Contributor

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

@diggerdu
Copy link
Author

diggerdu commented Dec 16, 2019

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

[12/16/2019, 6:50:25 PM] DEBUG [ 'Database directory: /root/nni/experiments/ACQgoetz/db' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Datastore initialization done' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Construct remote machine training service.' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer base port is 10161' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Rest server listening on: http://0.0.0.0:10161' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n    "machine_list": [\n        {\n            "ip": "10.108.238.238",\n            "username": "root",\n            "passwd": "qwe",\n            "port": 123\n        }\n    ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: machine_list, value: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Connecting to remote machines: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [(umask 0 ; mkdir -p /tmp/root/nni/scripts)]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [chmod 777 /tmp/nni /tmp/nni/* /tmp/nni/scripts/*]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [bash -c \'echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector\']' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n    "trial_config": {\n        "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n        "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n        "gpuNum": 1\n    }\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"zsh dist_train.sh 1 experiments/debug enable_nni","codeDir":"/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../","gpuNum":1}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'POST: /experiment: body:\n{\n    "authorName": "duxingjian",\n    "experimentName": "debug",\n    "trialConcurrency": 1,\n    "maxExecDuration": 3600000,\n    "maxTrialNum": 1000,\n    "searchSpace": "{\\"optimizer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"torchadam\\", \\"adam_lrd?dropout=0.5\\"]}, \\"lr\\": {\\"_type\\": \\"choice\\", \\"_value\\": [0.001, 0.0001]}, \\"weight_decay\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1e-05, 0]}, \\"max_bin\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1487]}, \\"nb_layers\\": {\\"_type\\": \\"choice\\", \\"_value\\": [4]}, \\"hidden_size\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"enable_std\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"False\\"]}, \\"enable_dfsmn_ln\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"True\\", \\"False\\"]}, \\"hop_length\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"act_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"nn.LeakyReLU(0.2)\\", \\"MishCuda()\\"]}, \\"norm_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"none\\", \\"layer_norm\\"]}, \\"dropout_rate\\": {\\"_type\\": \\"uniform\\", \\"_value\\": [0, 0.5]}, \\"dilation\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1]}, \\"lookup_len\\": {\\"_type\\": \\"choice\\", \\"_value\\": [20, 40]}}",\n    "trainingServicePlatform": "remote",\n    "tuner": {\n        "builtinTunerName": "TPE",\n        "classArgs": {\n            "optimize_mode": "minimize"\n        },\n        "className": "TPE"\n    },\n    "versionCheck": false,\n    "clusterMetaData": [\n        {\n            "key": "machine_list",\n            "value": [\n                {\n                    "ip": "10.108.238.238",\n                    "username": "root",\n                    "passwd": "qwe",\n                    "port": 123\n                }\n            ]\n        },\n        {\n            "key": "trial_config",\n            "value": {\n                "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n                "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n                "gpuNum": 1\n            }\n        }\n    ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Starting experiment: ACQgoetz' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Setup tuner...' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Add event listeners' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer base port is 10162' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Run remote machine training service.' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}' ]
[12/16/2019, 6:50:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:36 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}, jobDetail: {"id":"Myh3Y","status":"WAITING","submitTime":1576493436156,"workingDirectory":"/tmp/nni/experiments/ACQgoetz/trials/Myh3Y","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'designated gpu indices: undefined' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/.nni]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'copyFileToRemote: localFilePath: /root/nni/experiments/ACQgoetz/trials-local/Myh3Y/parameter.cfg, remoteFilePath: /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/parameter.cfg' ]
[12/16/2019, 6:50:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
➜  ume_dev_debug nnictl log stderr

(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:898745) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
INFO:hyperopt.tpe:tpe_transform took 0.004019 seconds
INFO:hyperopt.tpe:TPE using 0 trials
(node:898745) ExperimentalWarning: The fs.promises API is experimental
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 4)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 7)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 8)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 9)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 10)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 11)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 12)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 13)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 14)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 15)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 16)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 17)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 18)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab/home/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 19)

@diggerdu
Copy link
Author

diggerdu commented Dec 16, 2019

Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr

I print out the content of cmdresult.stdout

299             if (cmdresult !== undefined && cmdresult.stdout !== undefined) {
>>>
300                 this.log.warning(cmdresult.stdout);
<<<
301                 rmMeta.gpuSummary = JSON.parse(cmdresult.stdout);
302                 if (rmMeta.gpuSummary.gpuCount === 0) {
303                     this.log.warning(`No GPU found on remote machine ${rmMeta.ip}`);
304                     this.timer.unsubscribe(disposable);
305                 }
[12/16/2019, 7:08:55 PM] WARNING [ '{"Timestamp": "Mon Dec 16 19:08:52 2019", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]}\n' ]
[12/16/2019, 7:08:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 7:08:56 PM] WARNING [ '' ]
[12/16/2019, 7:08:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]

Is the \n in this json string causes the SyntaxError?

@chicm-ms
Copy link
Contributor

chicm-ms commented Dec 16, 2019

It seems the cmdresult.stdout is correct, the \n won't cause SyntaxError. I dont have this problem with a real remote machine on my side.
You can add some log info in RemoteMachineTrainingService.prepareTrialJob method to check whether prepareTrialJob is finished correctly.

You can also try to clean up the remote machine environment and try again:
on remote machine:

  1. remove /tmp/root, /tmp/nni
  2. kill nni_gpu_tool.gpu_metrics_collector related process

And ensure same version of nni is installed on remote machine.

chicm-ms added a commit that referenced this issue Dec 18, 2019
* Fix local system as remote machine issue #1852
@scarlett2018
Copy link
Member

It seems the cmdresult.stdout is correct, the \n won't cause SyntaxError. I dont have this problem with a real remote machine on my side.
You can add some log info in RemoteMachineTrainingService.prepareTrialJob method to check whether prepareTrialJob is finished correctly.

You can also try to clean up the remote machine environment and try again:
on remote machine:

  1. remove /tmp/root, /tmp/nni
  2. kill nni_gpu_tool.gpu_metrics_collector related process

And ensure same version of nni is installed on remote machine.

@diggerdu hope what @chicm-ms suggested works for you. in addition to github issue, you guys might also consider IM in NNI Gitter channel: https://gitter.im/Microsoft/nni

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants