-
Notifications
You must be signed in to change notification settings - Fork 549
quick-start-kubespray.sh failed #5306
Comments
fatal: [uniubi-alg057]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 9 specified but only found"} |
Thanks for reporting. This may due to the different outputs of |
Could you please give the result of command |
Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust |
any progress? |
have you solved problem? |
now I still on 1.3, because of there are lots of problems since 1.4 |
thanks, maybe I should try 1.3... could you give me your email? |
@HaoLiuHust The environmental check works on my machines. Will have a further investigation. @lbin For the GPU metrics problem, would you please submit a new issue for it? |
@HaoLiuHust I'm debugging this issue. Would you please help to execute the following steps on your dev box machine?
Please replace
Here's my log:
|
|
@HaoLiuHust Looks like there's OK with Could you please try the following ansible playbook using the same command? Just make sure ---
- hosts: all
gather_facts: false
tasks:
- name: "check full"
raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
register: nvidia_gpu_count
failed_when: false
changed_when: false
check_mode: false
environment: {}
- name: debug
debug:
var: nvidia_gpu_count.stdout_lines
- name: debug
debug:
var: nvidia_gpu_count
- name: debug
debug:
var: nvidia_gpu_count.stdout_lines[0]|int
- name: set_fact
set_fact:
debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"
- name: debug
debug:
var: debug_string |
@hzy46 TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only found"}
skipping: [gpu-cluster-node002]
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only found"}
skipping: [localhost]
|
Could you please save the following content to /tmp/test.yml, and run ---
- hosts: all
gather_facts: false
tasks:
- name: "check full"
raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
register: nvidia_gpu_count
failed_when: false
changed_when: false
check_mode: false
environment: {}
- name: debug
debug:
var: nvidia_gpu_count.stdout_lines
- name: debug
debug:
var: nvidia_gpu_count
- name: debug
debug:
var: nvidia_gpu_count.stdout_lines[0]|int
- name: set_fact
set_fact:
debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"
- name: debug
debug:
var: debug_string |
Sorry, I am a little busy these days, I will try it later |
pai@dev-box:~/pai/contrib/kubespray$ ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001
/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
from cryptography.exceptions import InvalidSignature
[WARNING]: Unable to parse /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
[WARNING]: Could not match supplied host pattern, ignoring: gpu-cluster-node001
PLAY [all] ******************************************************************************************************************************************************
skipping: no hosts matched
PLAY RECAP ****************************************************************************************************************************************************** |
I don`t find ~/pai-deploy/kubespray/inventory/pai/hosts.yml. pai@dev-box:~/pai/contrib/kubespray$ ls /home/pai/pai-deploy/kubespray/inventory/pai/
group_vars inventory.ini |
@MengS1024 It should be on the dev box machine. Generated by pai deployment script. |
Yes, it`s on the dev box. |
@hzy46 any update? |
@MengS1024 can you use Or you can create one inventory file on your own. |
/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
from cryptography.exceptions import InvalidSignature
[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user
configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
PLAY [all] ******************************************************************************************************************************************************
TASK [check full] ***********************************************************************************************************************************************
ok: [gpu-cluster-node001]
TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
"nvidia_gpu_count.stdout_lines": [
"2"
]
}
TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
"nvidia_gpu_count": {
"changed": false,
"failed": false,
"failed_when_result": false,
"rc": 0,
"stderr": "Shared connection to 10.10.30.13 closed.\r\n",
"stderr_lines": [
"Shared connection to 10.10.30.13 closed."
],
"stdout": "2\r\n",
"stdout_lines": [
"2"
]
}
}
TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
"nvidia_gpu_count.stdout_lines[0]|int": "2"
}
TASK [set_fact] *************************************************************************************************************************************************
ok: [gpu-cluster-node001]
TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
"debug_string": "found 2 gpus"
}
PLAY RECAP ******************************************************************************************************************************************************
gpu-cluster-node001 : ok=6 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
|
@MengS1024 The script works as expected. Can you run |
@hzy46 I have already run this command TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only found"}
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only found"}
|
@hzy46 |
it is ok |
It's a really strange problem. You can see the source code: https://github.com/microsoft/pai/blob/master/contrib/kubespray/roles/requirement/computing-devices/nvidia.com_gpu/tasks/main.yml#L36-L57 The code is the same but gives different results. Could you:
|
@hzy46 |
Thanks @MengS1024
The stdout_lines becomes ["", "2"] when the script is run. I don't know the root reason. I will submit a PR to fix this issue. |
Thanks, please let me know when it`s fixed. |
Please see #5353. You can try it by switching to branch |
@hzy46 TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************
fatal: [gpu-cluster-node002]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.10.30.14:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.008623", "end": "2021-03-05 23:22:36.548779", "msg": "non-zero return code", "rc": 1, "start": "2021-03-05 23:22:36.540156", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused\n\nerror #0: dial tcp 10.10.30.14:2379: connect: connection refused", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused", "", "error #0: dial tcp 10.10.30.14:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
included: /home/autox-it/pai-deploy/kubespray/roles/etcd/tasks/refresh_config.yml for gpu-cluster-node002
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
|
pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9a79e7f6df59 quay.io/coreos/etcd:v3.3.10 "/usr/local/bin/etcd" 6 seconds ago Created etcd1
|
@MengS1024 You can check whether the master port 2379 is blocked or not. Or check the log of the etcd container. |
I don`t see the port 2379 is listening on the master node. Where is the log of etcd? |
it's a docker container. if it is started, use docker log to see its log. |
The status is created, it is not running. |
Please find out why it is not running. One possible reason could be the network issue: you cannot download the image. |
pai@gpu-cluster-node002:~$ sudo docker images | grep etcd
quay.io/coreos/etcd v3.3.10 643c21638c1c 2 years ago 39.5MB
pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2ba82adf4330 quay.io/coreos/etcd:v3.3.10 "/usr/local/bin/etcd" 3 seconds ago Created etcd1
|
I'm not sure about the root cause. Could you refer to kubespray for more help? e.g. https://github.com/kubernetes-sigs/kubespray/search?q=Check+if+etcd+cluster+is+healthy&type=issues It should be an issue with kubespray. |
I have set it up with version 1.4.0 |
In layout.yaml file, there is a key for gpu type 'model', I wonder how to set it, I have a machine with 1080Ti, when I set model to 1080 or 1080Ti, kube can not find it
The text was updated successfully, but these errors were encountered: