Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

quick-start-kubespray.sh failed #5306

Closed
HaoLiuHust opened this issue Feb 20, 2021 · 46 comments
Closed

quick-start-kubespray.sh failed #5306

HaoLiuHust opened this issue Feb 20, 2021 · 46 comments
Assignees

Comments

@HaoLiuHust
Copy link

In layout.yaml file, there is a key for gpu type 'model', I wonder how to set it, I have a machine with 1080Ti, when I set model to 1080 or 1080Ti, kube can not find it

@HaoLiuHust
Copy link
Author

fatal: [uniubi-alg057]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 9 specified but only found"}
fatal: [uface-gpu-server]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 7 specified but only found"}

@HaoLiuHust
Copy link
Author

image

@HaoLiuHust HaoLiuHust changed the title how to set 'model' in layout.yaml quick-start-kubespray.sh failed Feb 20, 2021
@HaoLiuHust
Copy link
Author

HaoLiuHust commented Feb 20, 2021

change line in roles/requirements/nvidia.com_gpu/tasks/main.yml to:
image

  • "nvidia_gpu_count.stdout_lines[0]|int != computing_device_count"
    to - "nvidia_gpu_count.stdout_lines[-1]|int != computing_device_count"

sovled this problem

@hzy46
Copy link
Contributor

hzy46 commented Feb 22, 2021

Thanks for reporting. This may due to the different outputs of nvidia-smi.

@hzy46
Copy link
Contributor

hzy46 commented Feb 22, 2021

Could you please give the result of command nvidia-smi --query-gpu=gpu_name --format=csv and nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l ?

@lbin
Copy link

lbin commented Feb 22, 2021

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

@HaoLiuHust
Copy link
Author

HaoLiuHust commented Feb 22, 2021

image

image

@hzy46

@HaoLiuHust
Copy link
Author

any progress?

@HaoLiuHust
Copy link
Author

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

@lbin
Copy link

lbin commented Feb 24, 2021

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

now I still on 1.3, because of there are lots of problems since 1.4

@HaoLiuHust
Copy link
Author

HaoLiuHust commented Feb 24, 2021

Not really solve the problem, in Dashboard, GPU metrics still show 0 gpu @hzy46 @HaoLiuHust

have you solved problem?

now I still on 1.3, because of there are lots of problems since 1.4

thanks, maybe I should try 1.3... could you give me your email?

@hzy46
Copy link
Contributor

hzy46 commented Feb 24, 2021

@HaoLiuHust The environmental check works on my machines. Will have a further investigation.

@lbin For the GPU metrics problem, would you please submit a new issue for it?

@lbin
Copy link

lbin commented Feb 25, 2021

screen

Due to the issue, I don't have the 1.5 pai.

@hzy46
Copy link
Contributor

hzy46 commented Feb 25, 2021

@HaoLiuHust I'm debugging this issue. Would you please help to execute the following steps on your dev box machine?

  1. Save the following content to /tmp/test.yml:
---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count
  1. Run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=<node-name>

Please replace <node-name> with your worker name.

  1. Provide the outputs.

Here's my log:

PLAY [all] **********************************************************************************************************************************************************************

TASK [check full] ***************************************************************************************************************************************************************ok: [node4]

TASK [debug] ********************************************************************************************************************************************************************ok: [node4] => {
    "nvidia_gpu_count.stdout_lines": [
        "4"
    ]
}

TASK [debug] ********************************************************************************************************************************************************************ok: [node4] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.151.40.224 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.151.40.224 closed."
        ],
        "stdout": "4\r\n",
        "stdout_lines": [
            "4"
        ]
    }
}

PLAY RECAP **********************************************************************************************************************************************************************node4                      : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

@HaoLiuHust
Copy link
Author

HaoLiuHust commented Feb 25, 2021

[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, 
this will change, but still be user configurable on deprecation. This feature will be removed in version 2.10. Deprecation 
warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [all] *******************************************************************************************************************

TASK [check full] ************************************************************************************************************
ok: [uface-gpu-server]

TASK [debug] *****************************************************************************************************************
ok: [uface-gpu-server] => {
    "nvidia_gpu_count.stdout_lines": [
        "8"
    ]
}

TASK [debug] *****************************************************************************************************************
ok: [uface-gpu-server] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.1.9.55 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.1.9.55 closed."
        ],
        "stdout": "8\r\n",
        "stdout_lines": [
            "8"
        ]
    }
}

PLAY RECAP *******************************************************************************************************************
uface-gpu-server           : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

@hzy46

@hzy46
Copy link
Contributor

hzy46 commented Mar 1, 2021

@HaoLiuHust Looks like there's OK with nvidia_gpu_count.stdout_lines. So it's strange that it didn't work during the first time

Could you please try the following ansible playbook using the same command? Just make sure |int and nvidia_gpu_count.stdout_lines[0] works as expected.

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

@MengS1024
Copy link

@hzy46
I also met the same error when I run quick-start-kubespray.sh , the version of openpai is v1.5.0, my card is 2080Ti.

TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
skipping: [gpu-cluster-node002]
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
skipping: [localhost]

@hzy46
Copy link
Contributor

hzy46 commented Mar 4, 2021

@MengS1024

Could you please save the following content to /tmp/test.yml, and run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001?

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

@HaoLiuHust
Copy link
Author

@HaoLiuHust Looks like there's OK with nvidia_gpu_count.stdout_lines. So it's strange that it didn't work during the first time

Could you please try the following ansible playbook using the same command? Just make sure |int and nvidia_gpu_count.stdout_lines[0] works as expected.

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

Sorry, I am a little busy these days, I will try it later

@MengS1024
Copy link

MengS1024 commented Mar 4, 2021

@hzy46
#5306 (comment)

pai@dev-box:~/pai/contrib/kubespray$ ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001
/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
  from cryptography.exceptions import InvalidSignature
[WARNING]: Unable to parse /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
[WARNING]: Could not match supplied host pattern, ignoring: gpu-cluster-node001

PLAY [all] ******************************************************************************************************************************************************
skipping: no hosts matched

PLAY RECAP ******************************************************************************************************************************************************

@MengS1024
Copy link

@MengS1024

Could you please save the following content to /tmp/test.yml, and run ansible-playbook -i ~/pai-deploy/kubespray/inventory/pai/hosts.yml /tmp/test.yml --limit=gpu-cluster-node001?

---
- hosts: all
  gather_facts: false
  tasks:

  - name: "check full"
    raw: "nvidia-smi --query-gpu=gpu_name --format=csv | tail --lines=+2 | wc -l"
    register: nvidia_gpu_count
    failed_when: false
    changed_when: false
    check_mode: false
    environment: {}

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines

  - name: debug
    debug:
      var: nvidia_gpu_count

  - name: debug
    debug:
      var: nvidia_gpu_count.stdout_lines[0]|int

  - name: set_fact
    set_fact:
      debug_string: "found {{ nvidia_gpu_count.stdout_lines[0] }} gpus"

  - name: debug
    debug:
      var: debug_string

I don`t find ~/pai-deploy/kubespray/inventory/pai/hosts.yml.

pai@dev-box:~/pai/contrib/kubespray$ ls /home/pai/pai-deploy/kubespray/inventory/pai/
group_vars  inventory.ini

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

@MengS1024 It should be on the dev box machine. Generated by pai deployment script.

@MengS1024
Copy link

@MengS1024 It should be on the dev box machine. Generated by pai deployment script.

Yes, it`s on the dev box.

@MengS1024
Copy link

@hzy46 any update?

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

@MengS1024 can you use ~/pai-deploy/cluster-cfg/hosts.yml instead of /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml?

Or you can create one inventory file on your own.

@MengS1024
Copy link

@MengS1024 can you use ~/pai-deploy/cluster-cfg/hosts.yml instead of /home/pai/pai-deploy/kubespray/inventory/pai/hosts.yml?

Or you can create one inventory file on your own.

/usr/local/lib/python3.5/dist-packages/ansible/parsing/vault/__init__.py:44: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release ofcryptography. Please upgrade your Python.
  from cryptography.exceptions import InvalidSignature
[DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to allow bad characters in group names by default, this will change, but still be user 
configurable on deprecation. This feature will be removed in version 2.10. Deprecation warnings can be disabled by setting deprecation_warnings=False in 
ansible.cfg.
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [all] ******************************************************************************************************************************************************

TASK [check full] ***********************************************************************************************************************************************
ok: [gpu-cluster-node001]

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count.stdout_lines": [
        "2"
    ]
}

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count": {
        "changed": false,
        "failed": false,
        "failed_when_result": false,
        "rc": 0,
        "stderr": "Shared connection to 10.10.30.13 closed.\r\n",
        "stderr_lines": [
            "Shared connection to 10.10.30.13 closed."
        ],
        "stdout": "2\r\n",
        "stdout_lines": [
            "2"
        ]
    }
}

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "nvidia_gpu_count.stdout_lines[0]|int": "2"
}

TASK [set_fact] *************************************************************************************************************************************************
ok: [gpu-cluster-node001]

TASK [debug] ****************************************************************************************************************************************************
ok: [gpu-cluster-node001] => {
    "debug_string": "found 2 gpus"
}

PLAY RECAP ******************************************************************************************************************************************************
gpu-cluster-node001        : ok=6    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

@MengS1024 The script works as expected. Can you run bash requirement.sh -c -l <path-to-layout.yaml> -c <path-to-config.yaml> under <pai-source-code>/contrib/kubespray? I think it should work.

@MengS1024
Copy link

@hzy46 I have already run this command ./requirement.sh -l config/layout.yaml -c config/config.yaml but I got the same error.

TASK [display unmet requirements] *******************************************************************************************************************************
fatal: [gpu-cluster-node004]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}
fatal: [gpu-cluster-node001]: FAILED! => {"changed": false, "msg": "The following requirements are not met: NVIDIA GPU card number is not matched: 2 specified but only  found"}

@MengS1024
Copy link

image

@MengS1024
Copy link

@hzy46
Is it right here?
image

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

@hzy46
Is it right here?
image

it is ok

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

It's a really strange problem. You can see the source code: https://github.com/microsoft/pai/blob/master/contrib/kubespray/roles/requirement/computing-devices/nvidia.com_gpu/tasks/main.yml#L36-L57 The code is the same but gives different results.

Could you:

  1. modify requirement.sh: add -vvv after ansible-playbook
  2. run ./requirement.sh -l config/layout.yaml -c config/config.yaml
  3. provide the full log here

@MengS1024
Copy link

@hzy46
Here is the log.
requirement.log

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

Thanks @MengS1024

[gpu-cluster-node004] => {
    "changed": false,
    "failed_when_result": false,
    "rc": 0,
    "stderr": "Shared connection to 10.10.30.16 closed.\r\n",
    "stderr_lines": [
        "Shared connection to 10.10.30.16 closed."
    ],
    "stdout": "\r\n2\r\n",
    "stdout_lines": [
        "",
        "2"
    ]

The stdout_lines becomes ["", "2"] when the script is run. I don't know the root reason.

I will submit a PR to fix this issue.

@MengS1024
Copy link

Thanks, please let me know when it`s fixed.

@hzy46
Copy link
Contributor

hzy46 commented Mar 5, 2021

@MengS1024

Please see #5353.

You can try it by switching to branch zhiyuhe/fix_nvidia_gpu_count

@MengS1024
Copy link

@hzy46
Thanks, but I met a new problem.

TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************
fatal: [gpu-cluster-node002]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.10.30.14:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.008623", "end": "2021-03-05 23:22:36.548779", "msg": "non-zero return code", "rc": 1, "start": "2021-03-05 23:22:36.540156", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused\n\nerror #0: dial tcp 10.10.30.14:2379: connect: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.10.30.14:2379: connect: connection refused", "", "error #0: dial tcp 10.10.30.14:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
included: /home/autox-it/pai-deploy/kubespray/roles/etcd/tasks/refresh_config.yml for gpu-cluster-node002
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).

@MengS1024
Copy link

pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS              PORTS               NAMES
9a79e7f6df59        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   6 seconds ago       Created                                 etcd1

@hzy46
Copy link
Contributor

hzy46 commented Mar 8, 2021

@MengS1024 You can check whether the master port 2379 is blocked or not. Or check the log of the etcd container.

@MengS1024
Copy link

@MengS1024 You can check whether the master port 2379 is blocked or not. Or check the log of the etcd container.

I don`t see the port 2379 is listening on the master node. Where is the log of etcd?

@hzy46
Copy link
Contributor

hzy46 commented Mar 8, 2021

it's a docker container. if it is started, use docker log to see its log.

@MengS1024
Copy link

The status is created, it is not running.

@hzy46
Copy link
Contributor

hzy46 commented Mar 8, 2021

The status is created, it is not running.

Please find out why it is not running. One possible reason could be the network issue: you cannot download the image.

@MengS1024
Copy link

The status is created, it is not running.

Please find out why it is not running. One possible reason could be the network issue: you cannot download the image.

pai@gpu-cluster-node002:~$ sudo docker images | grep etcd
quay.io/coreos/etcd                                v3.3.10                           643c21638c1c        2 years ago         39.5MB
pai@gpu-cluster-node002:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS              PORTS               NAMES
2ba82adf4330        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   3 seconds ago       Created                                 etcd1

@hzy46
Copy link
Contributor

hzy46 commented Mar 8, 2021

@MengS1024

I'm not sure about the root cause. Could you refer to kubespray for more help? e.g. https://github.com/kubernetes-sigs/kubespray/search?q=Check+if+etcd+cluster+is+healthy&type=issues It should be an issue with kubespray.

@HaoLiuHust
Copy link
Author

I have set it up with version 1.4.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants