Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

tkucherera-lenovo · 2024-08-07T13:54:54Z

This is a recipe that uses confluent for cluster provisioning.

Assumptions

DNS is setup
There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

Note

The makerepo.sh file does not check for Rocky Linux so had to modify it to check if os is Rocky
the ohpc repo and epel repos directories are in /var/lib/confluent/public whereas on the compute nodes they can be reached via web root confluent-public

adrianreber · 2024-08-07T13:59:04Z

Thanks, this is great. I will try it out on our CI systems.

adrianreber · 2024-08-07T14:01:26Z

1. DNS is setup

We have /etc/hosts. I hope that is enough.

2. There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

That is also needed for all other recipes. So, no problem.

adrianreber · 2024-08-07T14:04:33Z

The resulting RPMS can be found in the GitHub Actions for the next 24 hours.

github-actions · 2024-08-07T14:11:35Z

Test Results

18 files - 6 18 suites - 6 27s ⏱️ -1s
53 tests - 10 49 ✅ - 3 4 💤 - 7 0 ❌ ±0
66 runs - 20 62 ✅ - 6 4 💤 - 14 0 ❌ ±0

Results for commit 3abbea1. ± Comparison against base commit 611b01f.

This pull request removes 10 tests.

conman ‑ [ConMan] Verify conman binary available
conman ‑ [ConMan] Verify man page availability
conman ‑ [ConMan] Verify rpm version matches binary
ipmitool ‑ [OOB] ipmitool exists
ipmitool ‑ [OOB] ipmitool local bmc ping
ipmitool ‑ [OOB] ipmitool power status
ipmitool ‑ [OOB] ipmitool read CPU1 sensor data
ipmitool ‑ [OOB] ipmitool read sel log
ipmitool ‑ [OOB] istat exists
warewulf-ipmi ‑ [warewulf-ipmi] ipmitool lanplus protocol

♻️ This comment has been updated with latest results.

tkucherera-lenovo · 2024-08-07T15:03:53Z

Yes, /etc/hosts should be enough

adrianreber · 2024-08-07T17:32:07Z

docs/recipes/install/rocky9/x86_64/confluent/slurm/steps.tex

+\input{common/install_ohpc_components_intro}
+
+\subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo}
+\input{common/enable_local_ohpc_repo_confluent}


I am not aware of the history behind this line from the xcat recipe.In all other recipes we enable the OpenHPC repository by installing the OpenHPC release RPM which enabled a dnf repository from the OpenHPC repository server. Hardcoding the downloading of the repository tar files feels unnecessary especially as we do not do it at all in any of our current testing. Please try to work with the online repository if that would work for you.

If you need it for your testing we should put it behind some variable, so that it can be disabled.

Is this strictly necessary for you or can you work with the online repository?

Sure noted, l will look to work with the online repo.

adrianreber · 2024-08-07T17:34:10Z

docs/recipes/install/common/confluent_init_os_images_rocky.tex

+\subsubsection{Build initial BOS image} \label{sec:assemble_bos}
+The following steps illustrate the process to build a minimal, default image for use with \Confluent{}. To begin, you will
+first need to have a local copy of the ISO image available for the underlying OS. In this recipe, the relevant ISO image
+is \texttt{Rocky-9.4-x86\_64-dvd1.iso} (available from the Rocky


The image I downloaded does not have a "1" in the file name. The filename should be a variable so that it can be easily updated.

adrianreber · 2024-08-07T17:35:47Z

The main point which is currently not clear to me is if Confluent comes with a DHCP server? I was running the script a couple of times and the two compute nodes were always waiting for DHCP answers in the PXE boot step of the firmware.

adrianreber · 2024-08-09T13:08:46Z

@tkucherera-lenovo Should I try again? Is there now a DHCP server configured, somehow?

For the final merge you can squash the commits. For the main repository it makes no sense to keep your development history with fixups. If you want you do separate commits for the docs/ part and the components/ part, that would make sense to me.

Please also add a Signed-off-by to your commit messages as described in https://github.com/openhpc/ohpc/blob/3.x/CONTRIBUTING.md. git commit -s usually does that automatically.

tkucherera-lenovo · 2024-08-09T14:03:07Z

Yes, you can try again. Confluent does have its own dhcp server and by default it will respond to DHCP requests. If an environment has its own DHCP server, it is possible to configure confluent to not respond to DHCP requests. In this case though l believe there was a bug where the setting for allowing deployment using pxe was not being set because the variable needed was missing from the input.local file l have added a fix for that now.

going forward l will squash all commits and also add the signed-off-by to commits

adrianreber · 2024-08-09T14:08:51Z

@tkucherera-lenovo Is there an easy way to reset the host machine without reinstalling. Where does confluent store its state? Is there a directory I can delete to start from scratch?

tkucherera-lenovo · 2024-08-09T14:24:31Z

The state is stored /etc/confluent/*. So stopping confluent and running rm -rf /etc/confluent/*. l would also recommend removing the os profile under /var/lib/confluent/public/os dir

adrianreber · 2024-08-10T10:56:59Z

Now I see that the compute nodes are trying to boot:

==> audit <==
Aug 10 10:43:07 {"operation": "update", "target": "/noderange/compute/boot/nextdevice", "allowed": true}
Aug 10 10:43:12 {"operation": "update", "target": "/noderange/compute/power/state", "allowed": true}

==> events <==
Aug 10 10:46:16 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:18 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}
Aug 10 10:46:25 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:28 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}

==> /var/log/httpd/access_log <==
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"

But after that nothing seems to happen. On the console I see:

Any recommendations how to continue?

Also there seems to be no point during the installation where the script waits for the compute nodes to be ready, so most commands are run when the compute nodes are not available. All the customization fails with:

+ nodeshell compute echo '"10.241.58.134:/home' /home nfs nfsvers=3,nodev,nosuid 0 '0"' '>>' /etc/fstab
c1: ssh: connect to host c1 port 22: No route to host
c2: ssh: connect to host c2 port 22: No route to host

adrianreber · 2024-08-11T09:05:03Z

Now the installation is working but now it fails in post-installation scripts. I see on the server following error:

Aug 11 09:04:06 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpf19vlfb1/etc/shadow
Aug 11 09:04:07 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp2qhhryvt/etc/shadow

tkucherera-lenovo · 2024-08-11T14:15:28Z

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

Additionally just to be able to help me with debug. If you the command:

confluent_selfcheck -n <nodename>

That output is sometimes helpful in debug. Thanks.

adrianreber · 2024-08-11T17:14:29Z

docs/recipes/install/common/add_to_compute_confluent_intro.tex

+[sms](*\#*) mkdir -p $epel_repo_dir_confluent
+[sms](*\#*) (*\install*) dnf-plugins-core createrepo
+# Download required EPEL packages
+[sms](*\#*) dnf download --destdir $epel_repo_dir_confluent fping libconfuse libunwind


This seems strange, why don't we just enable EPEL on the compute nodes?

adrianreber · 2024-08-11T17:17:13Z

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

I just copied usklpt without the a. Retrying with the additional a now.

adrianreber · 2024-08-11T17:31:01Z

Now the compute nodes are provisioned, but I cannot login:

# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

Using warewulf 3 provisioning the ssh keys from /root/.ssh are automatically part of the compute nodes and ssh works. Can confluent also use one of those existing keys and add it to the compute node?

Also, the current recipe does not wait until the compute nodes are provisioned. It immediately continues and all commands like nodeshell fail, because the provisioning is not finished.

adrianreber · 2024-08-12T10:45:57Z

Ah, so the problem is, is that I have SSH keys in different formats and the last in the list is using an unsupported algorithm.

In /opt/confluent/lib/python/confluent/sshutil.py all SSH keys are copied to the provisioning image, but instead of overwriting the previous key it would probably make more sense to append all keys.

Following code change seems to work for me:

--- /opt/confluent/lib/python/confluent/sshutil.py	2023-11-15 16:30:46.000000000 +0000
+++ /opt/confluent/lib/python/confluent/sshutil.py.new	2024-08-12 09:10:48.601474767 +0000
@@ -214,10 +214,14 @@
     else:
         suffix = 'rootpubkey'
     for auth in authorized:
-        shutil.copy(
-            auth,
+        local_key = open(auth, 'r')
+        dest = open(
             '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
-                    myname, suffix))
+                    myname, suffix), 'a')
+        dest.write(local_key.read())
+    if os.path.exists(
+            '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
+                myname, suffix)):
         os.chmod('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
                 myname, suffix), 0o644)
         os.chown('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(

Instead of copying all the files and overwriting everything with the last file, this appends all public keys.

adrianreber · 2024-08-12T10:54:33Z

Now SSH works, but provisioning fails again:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp9t4o6x20/etc/shadow
Aug 12 10:48:09 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node
    output, stderr = util.run(
  File "/opt/confluent/lib/python/confluent/util.py", line 48, in run
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpszubn5dq.synctoc2/', 'root@[10.241.58.133]:/']' returned non-zero exit status 23.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpw077yd0x/etc/shadow

It makes kind of sense, because /tmp/tmpw077yd0x/etc/shadow is indeed 000 but I am not sure what is going on, running the same rsync command as root works without errors.

Currently I am again stuck in provisioning:

# nodedeploy compute
c1: pending: rocky-9.4-x86_64-default
c2: pending: rocky-9.4-x86_64-default
# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

jjohnson42 · 2024-08-13T13:11:35Z

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

jjohnson42 · 2024-08-13T13:12:42Z

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

adrianreber · 2024-08-13T15:23:01Z

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

How could this be best automated in a recipe like we are trying to build here? Any recommendations?

jjohnson42 · 2024-08-13T15:30:16Z

I'd probably offer some example choices:
-Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only
-Give confluent read access to /etc/shadow
-Make a blessed /etc/shadow copy for confluent to distribute
-Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

adrianreber · 2024-08-13T15:33:15Z

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

xcat2/confluent#159

adrianreber · 2024-08-13T15:38:47Z

I'd probably offer some example choices:

Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only

Give confluent read access to /etc/shadow

Make a blessed /etc/shadow copy for confluent to distribute

Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

As this recipe is contributed by you (upstream confluent) I would let you decide how to design and implement it. Also with the proper warnings in the documentation. But whatever makes most sense for you. If the recipe results in a working cluster we are happy to include it. Maybe merge support makes sense as we do not use passwords anyway much (at all) or the blessed copy. I would defer this to you and your experience what makes most sense.

adrianreber · 2024-08-13T16:03:07Z

With a chmod 644 /etc/shadow I have a workaround. We should still have a proper solution in the recipe to handle /etc/shadow.

Following things needs to be fixed at this point:

the recipe needs to wait until the compute nodes are ready
epel-release needs to be installed on the compute nodes
ohpc-release needs to be installed on the compute nodes

For warewulf we do:

export CHROOT=/opt/ohpc/admin/images/rocky9.3
wwmkchroot -v rocky-9 $CHROOT
dnf -y --installroot $CHROOT install epel-release
cp -p /etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

As confluent first does the installation and then changes the running compute node, this approach will not work.
For Rocky and AlmaLinux something like this will work:

# nodeshell compute dnf -y  install epel-release
# nodeshell compute dnf -y  install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm

The following commands are unnecessary or do not work:

# nodeshell compute dnf -y  install ntp
# nodeshell compute dnf -y  install  --enablerepo=powertools lmod-ohpc #powertools does not exist, it is called crb and already enabled earlier
# nodeshell compute systemctl restart nfs
c1: Failed to restart nfs.service: Unit nfs.service not found.
c2: Failed to restart nfs.service: Unit nfs.service not found.

This is needed: nodeshell compute dnf -y install nfs-utils

The existing /etc/hosts from the SMS is not synced to the compute nodes.

Besides the items mentioned here we seem to be able to get a cluster with two compute nodes running.

The nice thing for OpenHPC is that with this recipe we would finally have a stateful provisioned recipe again.

When we used to have a XCAT stateful recipe, it was explicitly marked to be stateful, not sure how you want to do this. Do you want to have one recipe which can either do stateful or stateless? Or two recipes?

jjohnson42 · 2024-08-13T17:39:18Z

So if I'm understanding
Need to wait for nodedeploy to show:

 # nodedeploy r3u23
r3u23: completed: alma-9.4-x86_64-default

Changes to syncfiles to include:
/etc/hosts
/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

And in post.d, to install epel-release

For nfs-utils, we could add it to the pkglist, or add a 'dnf -y install nfs-utils' as a 'post.d' script.

For diskless, maybe a different recipe. It will be more 'warewulf' like, with 'imgutil build' and 'imgutil exec'. There's also been a suggestion to make the 'installimage' script work for those instead of just clones.

adrianreber · 2024-08-13T18:03:01Z

/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

Either install the repo file, but this requires to also copy the keys, or install the ohpc-release RPM via dnf.

jjohnson42 · 2024-08-14T19:27:02Z

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

adrianreber · 2024-08-14T19:28:59Z

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

I already did at xcat2/confluent#159

jjohnson42 · 2024-08-14T20:15:39Z

Thanks, sorry for not noticing sooner. I accepted and amended it just a tad (to empty out the file before writing, and using 'with' to manage open/close of the files.

jjohnson42 · 2024-08-28T17:13:58Z

@adrianreber FYI, confluent 3.11.0 has been released including your change for ssh pubkey handling.

tkucherera-lenovo · 2024-09-16T13:47:43Z

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:

# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

adrianreber · 2024-09-16T13:54:58Z

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:
# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

Hmm, I see. In our test setup all nodes have internet access that is why I didn't really think about it.

I would say we mention that the nodes need internet access for all the steps somewhere in the documentation and leave it to the user to configure NAT or a proxy or whatever. That would be the easiest solution and would be acceptable for me. As we do not talk about network setup or network securing the nodes or the head node it sounds acceptable for me.

What do you think?

For our testing we actually set up a proxy server to reduce re-downloading of RPMs, so even with internet access we already change the network setup slightly.

tkucherera-lenovo · 2024-09-16T13:59:35Z

Having the nodes set up to access the internet also works for me.

tkucherera-lenovo · 2024-09-18T15:14:29Z

@adrianreber l have made some changes to include the mentioned discussions

including adding epel-release and ohpc repo to nodes
installing nfs-utils on the computes
syncing /etc/hosts
fix documentation bugs

Note: The error you were getting with nfs.service not found could be that NFS is not installed on the master node. According to section 1.2 of the ohpc install guide, NFS is hosted on the master node, but I do not see in the guides, warewolf or xcat, where it is installed. Is it assumed that it is already installed? Please advise.

adrianreber · 2024-10-06T14:02:49Z

So with the latest changes I am to run a full test suite with no errors. I still have to do some minor changes.

The following changes are currently still necessary:

dns_servers needs to be set but it does not seem to be part of docs/recipes/install/rocky9/input.local.template. Can you add it to make sure users are setting it.
dns_domain also needs to be set. There is already a variable from previous xcat recipes called domain_name. Can this be reused?
The assumption seems to be that all traffic going through the SMS: net.ipv4_gateway=${sms_ip}. Can confluent automatically pickup the default gateway of the SMS during provisioning? Or how can we set the default gateway without depending on the SMS IP. Maybe introduce a new variable?
remove --enablerepo=powertools. The name has changed to crb. What I am doing currently is: sed -e "s,epel-release,epel-release; /usr/bin/crb enable,g" -i "${recipeFile}". This way each time the epel-release package is installed the CRB repository is enabled. Please use /usr/bin/crb enable. That seems to be the recommended way of doing it.
We switched from gnu13 to gnu14. Please update the recipe to install the gnu14 variant of all packages. Maybe this will be solved by rebasing your PR.
Not a change for this PR, but the way /etc/profile.d/confluent_env.sh extends the MANPATH always adds additional : at the end of $MANPATH is empty. This breaks one of our tests, but nothing you really need to change.
warewulf adds lines to /etc/hosts which our testing expects. I see there is confluent2hosts but I was not able to get it running. Basically an entry for each compute node in the format 10.1.5.132 c1 c1.local would be nice to have. What is the right way to call confluent2hosts? This is a step that could be included in the recipe.
Is there a way to automatically have the update repository active during compute node installation? That way I could remove one additional reboot from our test scripts. Because currently what happens is that the compute node is installed and during the recipe you run dnf -y update on all nodes but the new packages are not active. If this could be part of the installation that would be helpful.

adrianreber · 2024-10-06T14:04:50Z

Oh, and please squash your commits. For a new feature like this is would make sense to have it all in one commit without fixup commits.

adrianreber · 2024-10-15T06:48:12Z

docs/recipes/install/rocky9/x86_64/confluent/slurm/steps.tex

@@ -203,7 +203,7 @@ \subsubsection{Add \OHPC{} components} \label{sec:add_components}
 [sms](*\#*) (*\chrootinstall*) kernel

 # Include modules user environment
-[sms](*\#*) (*\chrootinstall*) --enablerepo=powertools lmod-ohpc
+[sms](*\#*) (*\chrootinstall*) /usr/bin/crb enable


Without trying it, this is now missing the installation of lmod-ohpc.

adrianreber · 2024-10-15T06:49:31Z

docs/recipes/install/rocky9/x86_64/confluent/slurm/steps.tex

@@ -156,7 +156,7 @@ \subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo}

 % begin_ohpc_run
 \begin{lstlisting}[language=bash,keywords={},basicstyle=\fontencoding{T1}\fontsize{8.0}{10}\ttfamily,literate={ARCH}{\arch{}}1 {-}{-}1]
-[sms](*\#*)  (*\install*) epel-release
+[sms](*\#*) (*\install*) epel-release


The recommendation when installing epel-release it to do /usr/bin/crb enable as a second command. My recommendation would be to install epel-release on the SMS and on the compute nodes as well as run /usr/bin/crb enable on the SMS and compute nodes.

adrianreber · 2024-10-16T06:16:50Z

Looks like something with the Latex content is broken. You probably have to escape underscores like ipv4_address and ipv6_address.

adrianreber · 2024-10-16T14:29:31Z

Can you do another squash and avoid the merge commit. Something like:

$ git pull --rebase

and then do the squashing? I will use this for one more test run, but it should be really close to be ready and smaller fixups can also be done later.

tkucherera-lenovo · 2024-10-16T16:57:01Z

@adrianreber you want me to squash those commits including the merge commit and have just one commit?

adrianreber · 2024-10-16T16:58:42Z

Yes, just a single commit and no merge commits.

adrianreber · 2024-10-17T06:08:47Z

docs/recipes/install/rocky9/input.local.template

@@ -31,12 +34,25 @@ bmc_password="${bmc_password:-unknown}"
 # Additional time to wait for compute nodes to provision (seconds)
 provision_wait="${provision_wait:-180}"

-# Local domainname for cluster (xCAT recipe only)
+# DNS Local domainname for cluster (xCAT and Confluent recipe only)
+dns_servers="${dns_sersers:-172.30.0.254}"


Here is a typo. It says "sersers". Please fix.

adrianreber · 2024-10-17T06:09:08Z

docs/recipes/install/rocky9/input.local.template

@@ -21,6 +21,9 @@ sms_eth_internal="${sms_eth_internal:-eth1}"
 # Subnet netmask for internal cluster network
 internal_netmask="${internal_netmask:-255.255.0.0}"

+# ipv4 gateway 
+ipv4_gateway="${ipv4_gateway:-172.16.0.2}


Closing " missing.

adrianreber · 2024-10-17T06:11:01Z

Sorry for being pedantic, but could you also rework the commit message. Currently it is the result of the squash. Just make it from a single commit. The more information the better, but not what it is now. It has multiple "Signed-off-by" and some fixup information.

adrianreber · 2024-10-17T07:11:22Z

So, another test shows that beside the mentioned typo, the missing " and the commit message this is ready.

Recipe to support using Confluent as a system manager and provisioner. When setting up an ohpc cluster. Signed-off-by: tkucherera <tkucherera@lenovo.com>

tkucherera-lenovo · 2024-10-17T13:22:11Z

@adrianreber made the change and added a much more descriptive commit message. thanks.

adrianreber · 2024-10-17T14:41:22Z

Thank you so much for working with us. I will wait for CI to do a last check, but then I will merge it.

adrianreber reviewed Aug 7, 2024

View reviewed changes

adrianreber reviewed Aug 11, 2024

View reviewed changes

tkucherera-lenovo force-pushed the confluent_slurm branch 3 times, most recently from b48b5a8 to 10fe611 Compare September 18, 2024 14:42

adrianreber reviewed Oct 15, 2024

View reviewed changes

tkucherera-lenovo force-pushed the confluent_slurm branch from 89cf968 to 09c9f85 Compare October 15, 2024 14:51

tkucherera-lenovo force-pushed the confluent_slurm branch from 09c9f85 to 714fede Compare October 16, 2024 14:04

tkucherera-lenovo force-pushed the confluent_slurm branch from 714fede to 9ac3f2b Compare October 16, 2024 18:16

adrianreber reviewed Oct 17, 2024

View reviewed changes

Statefull Rocky 9.4 Base OS Confluent/Slurm Edition for Linux

3abbea1

Recipe to support using Confluent as a system manager and provisioner. When setting up an ohpc cluster. Signed-off-by: tkucherera <tkucherera@lenovo.com>

tkucherera-lenovo force-pushed the confluent_slurm branch from 9ac3f2b to 3abbea1 Compare October 17, 2024 13:20

adrianreber merged commit 7db68dd into openhpc:3.x Oct 17, 2024
20 checks passed

Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

Conversation

tkucherera-lenovo commented Aug 7, 2024

Assumptions

Note

adrianreber commented Aug 7, 2024

adrianreber commented Aug 7, 2024

adrianreber commented Aug 7, 2024

github-actions bot commented Aug 7, 2024 • edited Loading

Test Results

tkucherera-lenovo commented Aug 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianreber commented Aug 7, 2024

adrianreber commented Aug 9, 2024

tkucherera-lenovo commented Aug 9, 2024

adrianreber commented Aug 9, 2024

tkucherera-lenovo commented Aug 9, 2024

adrianreber commented Aug 10, 2024

adrianreber commented Aug 11, 2024

tkucherera-lenovo commented Aug 11, 2024

Choose a reason for hiding this comment

adrianreber commented Aug 11, 2024

adrianreber commented Aug 11, 2024

adrianreber commented Aug 12, 2024

adrianreber commented Aug 12, 2024

jjohnson42 commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

adrianreber commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 14, 2024

adrianreber commented Aug 14, 2024

jjohnson42 commented Aug 14, 2024

jjohnson42 commented Aug 28, 2024

tkucherera-lenovo commented Sep 16, 2024

adrianreber commented Sep 16, 2024

tkucherera-lenovo commented Sep 16, 2024

tkucherera-lenovo commented Sep 18, 2024

adrianreber commented Oct 6, 2024

adrianreber commented Oct 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianreber commented Oct 16, 2024

adrianreber commented Oct 16, 2024

tkucherera-lenovo commented Oct 16, 2024

adrianreber commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianreber commented Oct 17, 2024

adrianreber commented Oct 17, 2024

tkucherera-lenovo commented Oct 17, 2024

adrianreber commented Oct 17, 2024

github-actions bot commented Aug 7, 2024 •

edited

Loading