Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] CUDA CI jobs failing: "Certificate verification failed" #4646

Closed
jameslamb opened this issue Oct 3, 2021 · 5 comments
Closed

[ci] CUDA CI jobs failing: "Certificate verification failed" #4646

jameslamb opened this issue Oct 3, 2021 · 5 comments

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 3, 2021

Description

CUDA CI jobs in this project have been failing for the last few days with errors like the following.

CMake Error at CMakeLists.txt:27 (cmake_minimum_required):
CMake 3.16 or higher is required. You are running version 3.10.2

-- Configuring incomplete, errors occurred!
make: *** No rule to make target '_lightgbm'. Stop.

These errors are happening because installations of cmake are failing.

Err:7 https://apt.kitware.com/ubuntu bionic Release
Certificate verification failed: The certificate is NOT trusted. The certificate chain uses expired certificate. Could not handshake: Error in the certificate verification. [IP: 66.194.253.25 443]
Hit:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Reading package lists...
E: The repository 'https://apt.kitware.com/ubuntu bionic Release' does not have a Release file.

I agree with #4636 (comment), based on the timing it seems like this could be related to the recent expiration of the root certificate used by Let's Encrypt (https://scotthelme.co.uk/lets-encrypt-old-root-expiration/).

Reproducible example

This has been happening on CUDA CI jobs for the last few days.

For example, saw that on jobs for #4636, like https://github.com/microsoft/LightGBM/pull/4636/checks?check_run_id=3759447547.

Environment info

LightGBM CUDA CI jobs.

Additional Comments

The CUDA CI jobs run in docker containers nvcr.io/nvidia/cuda:${cuda_version}-devel

docker_img="nvcr.io/nvidia/cuda:${cuda_version}-devel"

This issue should be resolved by those images being updated upstream. I think it could also be worked around by forcing an update of openssl at runtime.

Some relevant links:

  • official recommendations from the CMake maintainers for how to install with apt: https://apt.kitware.com/
  • relevant code in LightGBM's CI:

    LightGBM/.ci/setup.sh

    Lines 85 to 106 in a77260f

    if [[ $TASK == "cuda" ]]; then
    echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
    apt-get update
    apt-get install --no-install-recommends -y \
    curl \
    graphviz \
    libxau6 \
    libxext6 \
    libxrender1 \
    lsb-release \
    software-properties-common
    if [[ $COMPILER == "clang" ]]; then
    apt-get install --no-install-recommends -y \
    clang \
    libomp-dev
    fi
    curl -sL https://apt.kitware.com/keys/kitware-archive-latest.asc | apt-key add -
    apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main" -y
    apt-get update
    apt-get install --no-install-recommends -y \
    cmake
    else
  • issues for nvidia/cuda images: https://gitlab.com/nvidia/container-images/cuda/-/issues
  • issues for cmake: https://gitlab.kitware.com/cmake/cmake/-/issues
@jameslamb
Copy link
Collaborator Author

It's hard to tell if this is still an issue by looking at more recent CI jobs, since they're now failing before trying to install cmake, due to #4645.

But I can see at https://ngc.nvidia.com/catalog/containers/nvidia:cuda/tags that none of the nvidia/cuda images have been updated since September 2021.

@jameslamb
Copy link
Collaborator Author

I was able to reproduce this in docker locally.

docker run -it nvcr.io/nvidia/cuda:9.0-devel /bin/bash

echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get update
apt-get install --no-install-recommends -y \
    curl \
    lsb-release \
    software-properties-common

curl \
    -s \
    -L \
    --insecure \
    https://apt.kitware.com/keys/kitware-archive-latest.asc \
| apt-key add -

#curl -sL https://apt.kitware.com/keys/kitware-archive-latest.asc | apt-key add -

apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main" -y
apt-get update
apt-get install --no-install-recommends -y \
    cmake

apt-get update produces the following errors

Err:13 https://apt.kitware.com/ubuntu xenial/main amd64 Packages
server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
Ign:14 https://apt.kitware.com/ubuntu xenial/main all Packages
Reading package lists... Done
W: The repository 'https://apt.kitware.com/ubuntu xenial Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Failed to fetch https://apt.kitware.com/ubuntu/dists/xenial/main/binary-amd64/Packages server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
E: Some index files failed to download. They have been ignored, or old ones used instead.

cmake is installed, but from a different repository, because a very old version is installed.

cmake --version

# cmake version 3.5.1

v3.5.1 was released in March 2016.

@jameslamb
Copy link
Collaborator Author

I don't think the issue is with kitware's apt package channel, and now I'm more convinced that it is about outdated certificates in the NVIDIA images.

Following the instructions at https://apt.kitware.com/, I'm able to successfully install cmake 3.20.5 in an ubuntu:16.04 image.

docker run -it ubuntu:16.04 /bin/bash

apt-get update
apt-get install apt-transport-https wget

wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null \
| gpg --dearmor - \
| tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null

echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ xenial main' \
| tee /etc/apt/sources.list.d/kitware.list >/dev/null

apt-get update
apt-get install -y --no-install-recommends \
    cmake

cmake version

cmake --version

# cmake version 3.20.5

@jameslamb
Copy link
Collaborator Author

@StrikerRUS I've opened an issue with NVIDIA documenting the challenges we faced.

https://gitlab.com/nvidia/container-images/cuda/-/issues/140

StrikerRUS added a commit that referenced this issue Jan 22, 2022
StrikerRUS added a commit that referenced this issue Jan 23, 2022
* Revert "[ci] ignore certificates for kitware apt channel in CUDA jobs (fixes #4646) (#4648)"

This reverts commit 10e0edc.

* update cuda at CI
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot removed the blocking label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant