-
Notifications
You must be signed in to change notification settings - Fork 1.8k
How to debug "no machine can be scheduled"? #1852
Comments
Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly. For debug suggestion, |
GPU Stats
|
Please also paste full nnimanager.log if you can ensure the configuration is right.
After I execute
|
This is a bug when local system is used as remote training service. |
But the bug is still exits after I using a remote machine.
|
Can you paste nnimanager.log again for the real remote machine? And the output of command |
|
I print out the content of
Is the |
It seems the cmdresult.stdout is correct, the You can also try to clean up the remote machine environment and try again:
And ensure same version of nni is installed on remote machine. |
@diggerdu hope what @chicm-ms suggested works for you. in addition to github issue, you guys might also consider IM in NNI Gitter channel: https://gitter.im/Microsoft/nni |
Short summary about the issue/question:
My experiment works well on local mode, but it stucks on "'Scheduler: trialJob id xxxx, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU '" while I change the mode to remotemachine
Brief what process you are following:
tail -n 1 /tmp/root/nni/scripts/gpu_metrics
manuallyfailing with error
tail: cannot open ‘/tmp/root/nni/scripts/gpu_metrics’ for reading: No such file or directory
bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector
,Outputs:
And previous command works flawlessly,
The NNImanager repeats
But the trials are still WAITING.
Any suggestions for debuging this error?
The text was updated successfully, but these errors were encountered: