Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build RL8+OFED image in CI #427

Merged
merged 18 commits into from
Sep 6, 2024
Merged

Build RL8+OFED image in CI #427

merged 18 commits into from
Sep 6, 2024

Conversation

MoteHue
Copy link
Contributor

@MoteHue MoteHue commented Aug 14, 2024

Uses OFED for both RockyLinux 8 and 9 image builds.

  • Fixes fact gathering in the ofed role; facts were only gathered once due to smart gathering which happens before updating host packages. As such, the ofed_distro_version was e.g. 8.9 due to the base image despite the dnf update step having upgraded the distro to something later.
  • Changes both RL8 and RL9 to LTS version of OFED (now this is supported for RL9).
  • Fixes bug where package update wouldn't correctly trigger a reboot.
  • Image build workflow now always runs R8+OFED, R9+OFED, RL9+OFED+CUDA versions.
  • Test workflow always runs on RL8+OFED and RL9+OFED images.
  • Changes skeleton, caas and stackhpc ansible.cfg to fail if any inventory can't be parsed.

@MoteHue MoteHue requested a review from sjpb August 14, 2024 14:33
@MoteHue MoteHue changed the title Rl8 ofed fixes RL8 OFED fixes Aug 14, 2024
@MoteHue
Copy link
Contributor Author

MoteHue commented Aug 14, 2024

Might want to switch to forcing a reboot. /var/run/reboot-required isn't populated, despite a slightly newer kernel getting installed:

[rocky@openhpc-ofed-rl8-240814-1513-eeb1f9fc ~]$ dnf list --installed kernel
Installed Packages
kernel.x86_64                                                                                                                                         4.18.0-553.el8_10                                                                                                                                              @anaconda
kernel.x86_64                                                                                                                                         4.18.0-553.16.1.el8_10                                                                                                                                         @baseos  

@jovial
Copy link
Collaborator

jovial commented Aug 15, 2024

Might want to switch to forcing a reboot. /var/run/reboot-required isn't populated, despite a slightly newer kernel getting installed:

[rocky@openhpc-ofed-rl8-240814-1513-eeb1f9fc ~]$ dnf list --installed kernel
Installed Packages
kernel.x86_64                                                                                                                                         4.18.0-553.el8_10                                                                                                                                              @anaconda
kernel.x86_64                                                                                                                                         4.18.0-553.16.1.el8_10                                                                                                                                         @baseos  

Might want to switch to forcing a reboot. /var/run/reboot-required isn't populated, despite a slightly newer kernel getting installed:

[rocky@openhpc-ofed-rl8-240814-1513-eeb1f9fc ~]$ dnf list --installed kernel
Installed Packages
kernel.x86_64                                                                                                                                         4.18.0-553.el8_10                                                                                                                                              @anaconda
kernel.x86_64                                                                                                                                         4.18.0-553.16.1.el8_10                                                                                                                                         @baseos  

If I understood your point correctly, you were suggesting that why might want to reboot prior to building ofed? I don't think this is necessary as we can compile the kernel modules for a kernel we are not running (as long as we have the kernel headers). The new kernel and modules will be loaded when we boot the new image.

Or is this during the ansible run of site.yml afterwards?

@sjpb
Copy link
Collaborator

sjpb commented Aug 16, 2024

necessary as we can compile the kernel modules for a kernel we are not running (as long as we have the kernel headers). The new kernel and modules will be loaded when we boot the new image.

Hmm, possibly, but to me that seems too risky/confusing. Especially given we may be e.g. installing CUDA afterwards which has kernel-* and ofed dependencies. We have to reboot on selinux changes anyway, so we are expecting a reboot somewhere during the build process. We just don't want to do it twice, if possible.

I think doing this is appropriate:

  1. rebooting if the update step has changed, rather than just if the reboot sentinel file exists (i.e. don't trust latter as it's been wrong at least once, probably a packaging bug/oversight I guess)
  2. always dumping facts (b/c meta: can't be conditional)
  3. always rerunning setup (b/c 2.)

@sjpb
Copy link
Collaborator

sjpb commented Aug 16, 2024

Actually there's another reason: boostrap.yml runs both from site.yml and from fatimage.yml, which is good (as it reduces code duplication). So really, if we need a reboot we should do it, rather than having to determine we're running in a build and therefore can ignore that as we will get a "reboot" when creating an instance with the image ...

Make sense?

@jovial
Copy link
Collaborator

jovial commented Aug 16, 2024

Actually there's another reason: boostrap.yml runs both from site.yml and from fatimage.yml, which is good (as it reduces code duplication). So really, if we need a reboot we should do it, rather than having to determine we're running in a build and therefore can ignore that as we will get a "reboot" when creating an instance with the image ...

Make sense?

Yep, I missed that you run this against live nodes too. That makes more sense. Isn't /var/run/reboot-required a Debian/Ubuntuism? Redhat seem to suggest using: https://access.redhat.com/solutions/27943

@MoteHue
Copy link
Contributor Author

MoteHue commented Aug 20, 2024

Isn't /var/run/reboot-required a Debian/Ubuntuism?

Looks like you're right, explains why it wasn't working then :P
I'll get this to always reboot, as per Steve's reasoning.

4.18.0-553.16.1.el8_9.x86_64
4.18.0-553.el8_9.x86_64
These would fail with the error:

'<' not supported between instances of 'str' and 'int'.

as the community.general.version_sort was trying to compare the `el8_9` of the latter with the `16` of the former.

Strip the last two chunks so we just compare numbers.
@MoteHue MoteHue marked this pull request as ready for review August 21, 2024 12:43
@MoteHue MoteHue requested a review from a team as a code owner August 21, 2024 12:43
jovial
jovial previously approved these changes Aug 21, 2024
Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to match what steve specified to me 👍

@sjpb
Copy link
Collaborator

sjpb commented Sep 4, 2024

@MoteHue @jovial this has been approved but the first comment in the PR makes it look like this is draft, can someone clarify/set status/comments appropriately please?

@MoteHue
Copy link
Contributor Author

MoteHue commented Sep 4, 2024

@MoteHue @jovial this has been approved but the first comment in the PR makes it look like this is draft, can someone clarify/set status/comments appropriately please?

It's no longer a draft

@sjpb sjpb force-pushed the rl8-ofed-fixes branch 2 times, most recently from 370deda to 634434b Compare September 6, 2024 09:38
@sjpb
Copy link
Collaborator

sjpb commented Sep 6, 2024

Checked at 53baab2 that a reboot didn't happen after the CI reimage - i.e the conditional on the reboot appears to be working OK again.

@sjpb
Copy link
Collaborator

sjpb commented Sep 6, 2024

@sjpb
Copy link
Collaborator

sjpb commented Sep 6, 2024

Checked reboot did happen during image build above

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@sjpb sjpb changed the title RL8 OFED fixes Build RL8+OFED image in CI Sep 6, 2024
@sjpb sjpb requested a review from jovial September 6, 2024 14:15
ansible/bootstrap.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me

@jovial jovial merged commit 9dc9992 into main Sep 6, 2024
2 checks passed
@jovial jovial deleted the rl8-ofed-fixes branch September 6, 2024 16:45
MaxBed4d pushed a commit that referenced this pull request Oct 15, 2024
* Check major version for RL8 package installs

* Gather facts on ofed role

* Support kernel checks with mismatching version length

4.18.0-553.16.1.el8_9.x86_64
4.18.0-553.el8_9.x86_64
These would fail with the error:

'<' not supported between instances of 'str' and 'int'.

as the community.general.version_sort was trying to compare the `el8_9` of the latter with the `16` of the former.

Strip the last two chunks so we just compare numbers.

* Move to LTS version now RL9.4 is supported

* Fail when any inventory source cannot be parsed

* Always reboot after selinux and package updates

* Cleat facts before OFED so install will match newest kernel

* Clear facts after reboot so OFED install will match newest kernel

* fail caas and stackhpc if any inventory can't be read

* make reboot conditional on package or SELinux changes again

* include OFED in both RL8 and RL9 builds

* always run CI tests on RL8 and RL9

* allow concurrent RL8/RL9 CI tests

* mark pending reboot check as not a change

* fix workflow matrix definitions

* bump CI images - now both OFED

* use reboot hint for checking reboot required

---------

Co-authored-by: Steve Brasier <steveb@stackhpc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants