quick-start-kubespray.sh failed #5306

HaoLiuHust · 2021-02-20T07:11:00Z

In layout.yaml file, there is a key for gpu type 'model', I wonder how to set it, I have a machine with 1080Ti, when I set model to 1080 or 1080Ti, kube can not find it

HaoLiuHust · 2021-02-20T08:27:13Z

fatal: [uniubi-alg057]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 9 specified but only found"}
fatal: [uface-gpu-server]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 7 specified but only found"}

HaoLiuHust · 2021-02-20T08:29:09Z

HaoLiuHust · 2021-02-20T09:07:21Z

change line in roles/requirements/nvidia.com_gpu/tasks/main.yml to:

"nvidia_gpu_count.stdout_lines[0]|int != computing_device_count"
to - "nvidia_gpu_count.stdout_lines[-1]|int != computing_device_count"

sovled this problem

hzy46 · 2021-02-22T02:21:34Z

Thanks for reporting. This may due to the different outputs of nvidia-smi.

hzy46 · 2021-02-22T02:27:27Z

Could you please give the result of command nvidia-smi --query-gpu=gpu_name --format=csv and nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l ?

lbin · 2021-02-22T05:14:53Z

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

HaoLiuHust · 2021-02-22T06:01:14Z

@hzy46

HaoLiuHust · 2021-02-23T11:01:02Z

any progress?

HaoLiuHust · 2021-02-24T09:58:12Z

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

lbin · 2021-02-24T10:03:51Z

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

now I still on 1.3, because of there are lots of problems since 1.4

HaoLiuHust · 2021-02-24T10:28:56Z

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

now I still on 1.3, because of there are lots of problems since 1.4

thanks, maybe I should try 1.3... could you give me your email?

hzy46 · 2021-02-24T12:17:28Z

@HaoLiuHust The environmental check works on my machines. Will have a further investigation.

@lbin For the GPU metrics problem, would you please submit a new issue for it?

lbin · 2021-02-25T02:34:09Z

Due to the issue, I don't have the 1.5 pai.

hzy46 · 2021-02-25T04:14:21Z

@HaoLiuHust I'm debugging this issue. Would you please help to execute the following steps on your dev box machine?

Save the following content to /tmp/test.yml:

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

Run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=<node-name>

Please replace <node-name> with your worker name.

Provide the outputs.

Here's my log:

PLAY [all] **********************************************************************************************************************************************************************

TASK [check full] ***************************************************************************************************************************************************************ok: [node4]

TASK [debug] ********************************************************************************************************************************************************************ok: [node4] => {
    "nvidia_gpu_count.stdout_lines": [
        "4"
    ]
}

TASK [debug] ********************************************************************************************************************************************************************ok: [node4] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.151.40.224 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.151.40.224 closed."
        ],
        "stdout": "4\r\n",
        "stdout_lines": [
            "4"
        ]
    }
}

PLAY RECAP **********************************************************************************************************************************************************************node4                      : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

HaoLiuHust · 2021-02-25T06:40:36Z

[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, 
this will change, but still be user configurable on deprecation. This feature will be removed in version 2.10. Deprecation 
warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [all] *******************************************************************************************************************

TASK [check full] ************************************************************************************************************
ok: [uface-gpu-server]

TASK [debug] *****************************************************************************************************************
ok: [uface-gpu-server] => {
    "nvidia_gpu_count.stdout_lines": [
        "8"
    ]
}

TASK [debug] *****************************************************************************************************************
ok: [uface-gpu-server] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.1.9.55 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.1.9.55 closed."
        ],
        "stdout": "8\r\n",
        "stdout_lines": [
            "8"
        ]
    }
}

PLAY RECAP *******************************************************************************************************************
uface-gpu-server           : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

@hzy46

hzy46 · 2021-03-01T03:14:51Z

@HaoLiuHust Looks like there's OK with nvidia_gpu_count.stdout_lines. So it's strange that it didn't work during the first time

Could you please try the following ansible playbook using the same command? Just make sure |int and nvidia_gpu_count.stdout_lines[0] works as expected.

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

MengS1024 · 2021-03-03T12:02:19Z

@hzy46
I also met the same error when I run quick-start-kubespray.sh , the version of openpai is v1.5.0, my card is 2080Ti.

TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
skipping: [gpu-cluster-node002]
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
skipping: [localhost]

hzy46 · 2021-03-04T04:11:20Z

@MengS1024

Could you please save the following content to /tmp/test.yml, and run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001?

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

HaoLiuHust · 2021-03-04T06:35:54Z

@HaoLiuHust Looks like there's OK with nvidia_gpu_count.stdout_lines. So it's strange that it didn't work during the first time

Could you please try the following ansible playbook using the same command? Just make sure |int and nvidia_gpu_count.stdout_lines[0] works as expected.

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

Sorry, I am a little busy these days, I will try it later

MengS1024 · 2021-03-04T11:46:29Z

@hzy46
#5306 (comment)

pai@dev-box:~/pai/contrib/kubespray$ ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001
/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
  from cryptography.exceptions import InvalidSignature
[WARNING]: Unable to parse /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
[WARNING]: Could not match supplied host pattern, ignoring: gpu-cluster-node001

PLAY [all] ******************************************************************************************************************************************************
skipping: no hosts matched

PLAY RECAP ******************************************************************************************************************************************************

MengS1024 · 2021-03-04T12:14:59Z

@MengS1024

Could you please save the following content to /tmp/test.yml, and run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001?

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

I don`t find ~/pai-deploy/kubespray/inventory/pai/hosts.yml.

pai@dev-box:~/pai/contrib/kubespray$ ls /home/pai/pai-deploy/kubespray/inventory/pai/
group_vars  inventory.ini

hzy46 · 2021-03-05T02:22:39Z

@MengS1024 It should be on the dev box machine. Generated by pai deployment script.

MengS1024 · 2021-03-05T03:56:05Z

@MengS1024 It should be on the dev box machine. Generated by pai deployment script.

Yes, it`s on the dev box.

MengS1024 · 2021-03-05T08:03:28Z

@hzy46 any update?

hzy46 · 2021-03-05T08:10:41Z

@MengS1024 can you use ~/pai-deploy/cluster-cfg/hosts.yml instead of /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml?

Or you can create one inventory file on your own.

MengS1024 · 2021-03-05T08:20:38Z

@MengS1024 can you use ~/pai-deploy/cluster-cfg/hosts.yml instead of /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml?

Or you can create one inventory file on your own.

/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
  from cryptography.exceptions import InvalidSignature
[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user 
configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [all] ******************************************************************************************************************************************************

TASK [check full] ***********************************************************************************************************************************************
ok: [gpu-cluster-node001]

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count.stdout_lines": [
        "2"
    ]
}

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.10.30.13 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.10.30.13 closed."
        ],
        "stdout": "2\r\n",
        "stdout_lines": [
            "2"
        ]
    }
}

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count.stdout_lines[0]|int": "2"
}

TASK [set_fact] *************************************************************************************************************************************************
ok: [gpu-cluster-node001]

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "debug_string": "found 2 gpus"
}

PLAY RECAP ******************************************************************************************************************************************************
gpu-cluster-node001        : ok=6    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

hzy46 · 2021-03-05T08:25:06Z

@MengS1024 The script works as expected. Can you run bash requirement.sh -c -l <path-to-layout.yaml> -c <path-to-config.yaml> under <pai-source-code>/contrib/kubespray? I think it should work.

MengS1024 · 2021-03-05T08:32:52Z

@hzy46 I have already run this command ./requirement.sh -l config/layout.yaml -c config/config.yaml but I got the same error.

TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}

MengS1024 · 2021-03-05T08:36:17Z

MengS1024 · 2021-03-05T08:38:21Z

@hzy46
Is it right here?

hzy46 · 2021-03-05T08:45:22Z

@hzy46
Is it right here?

it is ok

hzy46 · 2021-03-05T08:45:42Z

It's a really strange problem. You can see the source code: https://github.com/microsoft/pai/blob/master/contrib/kubespray/roles/requirement/computing-devices/nvidia.com_gpu/tasks/main.yml#L36-L57 The code is the same but gives different results.

Could you:

modify requirement.sh: add -vvv after ansible-playbook
run ./requirement.sh -l config/layout.yaml -c config/config.yaml
provide the full log here

MengS1024 · 2021-03-05T09:25:21Z

@hzy46
Here is the log.
requirement.log

hzy46 · 2021-03-05T09:50:43Z

Thanks @MengS1024

[gpu-cluster-node004] => {
    "changed": false,
    "failed_when_result": false,
    "rc": 0,
    "stderr": "Shared connection to 10.10.30.16 closed.\r\n",
    "stderr_lines": [
        "Shared connection to 10.10.30.16 closed."
    ],
    "stdout": "\r\n2\r\n",
    "stdout_lines": [
        "",
        "2"
    ]

The stdout_lines becomes ["", "2"] when the script is run. I don't know the root reason.

I will submit a PR to fix this issue.

MengS1024 · 2021-03-05T09:53:21Z

Thanks, please let me know when it`s fixed.

hzy46 · 2021-03-05T09:59:17Z

@MengS1024

Please see #5353.

You can try it by switching to branch zhiyuhe/fix_nvidia_gpu_count

MengS1024 · 2021-03-05T15:34:27Z

@hzy46
Thanks, but I met a new problem.

TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************
fatal: [gpu-cluster-node002]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.10.30.14:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.008623", "end": "2021-03-05 23:22:36.548779", "msg": "non-zero return code", "rc": 1, "start": "2021-03-05 23:22:36.540156", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused\n\nerror #0: dial tcp 10.10.30.14:2379: connect: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused", "", "error #0: dial tcp 10.10.30.14:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
included: /home/autox-it/pai-deploy/kubespray/roles/etcd/tasks/refresh_config.yml for gpu-cluster-node002
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).

MengS1024 · 2021-03-05T15:46:20Z

pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS              PORTS               NAMES
9a79e7f6df59        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   6 seconds ago       Created                                 etcd1

hzy46 · 2021-03-08T02:33:32Z

@MengS1024 You can check whether the master port 2379 is blocked or not. Or check the log of the etcd container.

MengS1024 · 2021-03-08T03:26:43Z

@MengS1024 You can check whether the master port 2379 is blocked or not. Or check the log of the etcd container.

I don`t see the port 2379 is listening on the master node. Where is the log of etcd?

hzy46 · 2021-03-08T03:27:54Z

it's a docker container. if it is started, use docker log to see its log.

MengS1024 · 2021-03-08T03:38:39Z

The status is created, it is not running.

hzy46 · 2021-03-08T03:49:38Z

The status is created, it is not running.

Please find out why it is not running. One possible reason could be the network issue: you cannot download the image.

MengS1024 · 2021-03-08T04:57:06Z

The status is created, it is not running.

Please find out why it is not running. One possible reason could be the network issue: you cannot download the image.

pai@gpu-cluster-node002:~$ sudo docker images | grep etcd
quay.io/coreos/etcd                                v3.3.10                           643c21638c1c        2 years ago         39.5MB
pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS              PORTS               NAMES
2ba82adf4330        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   3 seconds ago       Created                                 etcd1

hzy46 · 2021-03-08T08:19:04Z

@MengS1024

I'm not sure about the root cause. Could you refer to kubespray for more help? e.g. https://github.com/kubernetes-sigs/kubespray/search?q=Check+if+etcd+cluster+is+healthy&type=issues It should be an issue with kubespray.

HaoLiuHust · 2021-03-12T03:52:49Z

I have set it up with version 1.4.0

HaoLiuHust changed the title ~~how to set 'model' in layout.yaml~~ quick-start-kubespray.sh failed Feb 20, 2021

hzy46 mentioned this issue Feb 22, 2021

fixed get wrong nvidia gpu card num. #5302

Closed

fanyangCS assigned hzy46 Feb 22, 2021

HaoLiuHust mentioned this issue Mar 15, 2021

failed to fetch logs after use https #5377

Closed

HaoLiuHust closed this as completed Mar 28, 2022

quick-start-kubespray.sh failed #5306

quick-start-kubespray.sh failed #5306

Comments

HaoLiuHust commented Feb 20, 2021

HaoLiuHust commented Feb 20, 2021

HaoLiuHust commented Feb 20, 2021

HaoLiuHust commented Feb 20, 2021 • edited Loading

hzy46 commented Feb 22, 2021

hzy46 commented Feb 22, 2021

lbin commented Feb 22, 2021

HaoLiuHust commented Feb 22, 2021 • edited Loading

HaoLiuHust commented Feb 23, 2021

HaoLiuHust commented Feb 24, 2021

lbin commented Feb 24, 2021 • edited Loading

HaoLiuHust commented Feb 24, 2021 • edited Loading

hzy46 commented Feb 24, 2021 • edited Loading

lbin commented Feb 25, 2021

hzy46 commented Feb 25, 2021

HaoLiuHust commented Feb 25, 2021 • edited Loading

hzy46 commented Mar 1, 2021 • edited Loading

MengS1024 commented Mar 3, 2021

hzy46 commented Mar 4, 2021 • edited Loading

HaoLiuHust commented Mar 4, 2021

MengS1024 commented Mar 4, 2021 • edited Loading

MengS1024 commented Mar 4, 2021

hzy46 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 5, 2021 • edited Loading

MengS1024 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 5, 2021

hzy46 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

MengS1024 commented Mar 5, 2021

hzy46 commented Mar 8, 2021 • edited Loading

MengS1024 commented Mar 8, 2021

hzy46 commented Mar 8, 2021

MengS1024 commented Mar 8, 2021

hzy46 commented Mar 8, 2021

MengS1024 commented Mar 8, 2021

hzy46 commented Mar 8, 2021 • edited Loading

HaoLiuHust commented Mar 12, 2021

HaoLiuHust commented Feb 20, 2021 •

edited

Loading

HaoLiuHust commented Feb 22, 2021 •

edited

Loading

lbin commented Feb 24, 2021 •

edited

Loading

HaoLiuHust commented Feb 24, 2021 •

edited

Loading

hzy46 commented Feb 24, 2021 •

edited

Loading

HaoLiuHust commented Feb 25, 2021 •

edited

Loading

hzy46 commented Mar 1, 2021 •

edited

Loading

hzy46 commented Mar 4, 2021 •

edited

Loading

MengS1024 commented Mar 4, 2021 •

edited

Loading

hzy46 commented Mar 5, 2021 •

edited

Loading

hzy46 commented Mar 8, 2021 •

edited

Loading

hzy46 commented Mar 8, 2021 •

edited

Loading