Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Machine requirement: Linux/x64 equinix dockerhost replacement #3352

Closed
Tracked by #3292
sxa opened this issue Jan 23, 2024 · 23 comments
Closed
Tracked by #3292

New Machine requirement: Linux/x64 equinix dockerhost replacement #3352

sxa opened this issue Jan 23, 2024 · 23 comments

Comments

@sxa
Copy link
Member

sxa commented Jan 23, 2024

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): Skytap
  • Desired usage: Replacement for the two dockerhost x64 systems currently hosted on Equinix
  • Any unusual specification/setup required: docker for running dockerhost containers and build pipelines
  • How many of them are required: 1 (for now)

Please explain what this machine is needed for: Replacement for Equinix systems which we have to decommission as per #3292

@sxa
Copy link
Member Author

sxa commented Jan 24, 2024

System provisioned at skytap with 24 cores, 64Gb RAM, and a 256Gb filesystem on /var/lib/docker
IP 20.61.136.254 and it calls itself dockerhost-skytap-ubuntu2204-x64-1
I'm not clear yet whether it will accept inbound connections on high numbered ports so if that's not fixable we'll have to make it call into the jenkins server over JNLP for any containers we have on there.

@sxa sxa moved this from Todo to In Progress in 2024 1Q Adoptium Plan Jan 25, 2024
@sxa
Copy link
Member Author

sxa commented Jan 25, 2024

I'm not clear yet whether it will accept inbound connections on high numbered ports

Not a problem - they're not restricted by default.

I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻
This container is not intended to be retained after this test, so it does not have the ci.role.test label on it

@sxa sxa added the currency label Jan 25, 2024
@sxa
Copy link
Member Author

sxa commented Jan 25, 2024

Host machine has been tested with docker builds at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-x64-temurin/471/console on dockerhost-skytap-ubuntu2204-x64-1 so I'll aim to get this activated properly for the weekend runs or on Monday, subject to there being no risk to any outstanding items in the release cycle.
Ran in about 13 minutes vs around 8 on the Equinix systems.
Skytap: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (24 core)
Equinix: AMD EPYC 7401P 24-Core Processor or Intel(R) Xeon(R) Gold 6314U CPU @ 2.30GHz

@sxa
Copy link
Member Author

sxa commented Jan 26, 2024

I've connected a container for experiental purposes running Fedora 39 to jenkins and running an AQA run at https://ci.adoptium.net/job/AQA_Test_Pipeline/206 🤞🏻

EDIT: extended grinder re-run stopped after 10 hours - trying at Grinder#8675

Others were ok.

@sxa
Copy link
Member Author

sxa commented Jan 29, 2024

I've installed temurin-8-jdk as a package so that JDK8 is the default on the machine. This appears to be required for the gradle version we use in the installer process. JDK21 is still available (installed via tarball) and is being used for the jenkins agent.

+ ./gradlew packageJdkAlpine checkJdkAlpine --parallel -PPRODUCT=temurin -PPRODUCT_VERSION=8 -PARCH=x86_64 -PGPG_KEY=****
Picked up _JAVA_OPTIONS: -Xmx4g
Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

* Where:
Settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'

* What went wrong:
Could not compile settings file '/home/jenkins/workspace/adoptium-packages-linux-pipeline_new@2/settings.gradle'.
> startup failed:
  General error during conversion: Unsupported class file major version 65
  
  java.lang.IllegalArgumentException: Unsupported class file major version 65

@sxa
Copy link
Member Author

sxa commented Jan 30, 2024

The three executors are running build jobs that can each take quite a bit of space on the jenkins workspace sine the build volumes are mapped from the host. Also the installer generations can use quite a bit of space on the host workspace. See #3362

At present there are up to 6Gb (I think a full build of the latest release might take close to 10Gb) on various directories on the host file system.

256Gb filesystem on /var/lib/docker

I'm going to redo this file system with about 100Gb for /home/jenkins/workspace and the rest as /var/lib/docker. The current dockerhost-equinix-ubuntu2004-x64-1 machine has 62Gb in the jenkins workspace (That may need to be looked at as it's quite high) so 100Gb should be enough.

@sxa
Copy link
Member Author

sxa commented Jan 30, 2024

Noting that the Fedora 39 container is working as well as most of the other systems as per adoptium/aqa-tests#5012 (comment)

@sxa
Copy link
Member Author

sxa commented Feb 1, 2024

Noting that https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-x64-temurin/393/ and the equivalent on other versions appears to insist on running on one of the equinix dockerhosts at the moment as it's looking for build&&alpine-linux&&x64&&dockerBuild - we'll need to think about that labelling convention ...

23:30:52  [NODE SHIFT] MOVING INTO DOCKER NODE MATCHING LABELNAME build&&alpine-linux&&x64&&dockerBuild...
[Pipeline] node
23:31:07  Still waiting to schedule task
23:31:07  ‘[dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/)’ is offline
23:59:27  Running on [dockerhost-equinix-ubuntu2204-x64-1](https://ci.adoptium.net/computer/dockerhost%2Dequinix%2Dubuntu2204%2Dx64%2D1/) in /home/jenkins/workspace/build-scripts/jobs/jdk17u/jdk17u-alpine-linux-x64-temurin
[Pipeline] {

@sxa
Copy link
Member Author

sxa commented Feb 1, 2024

Inventory PR for this system: #3358
I've added it to Bastillion, @steelhead31 is managing Nagios installation prior to merging that PR

@steelhead31
Copy link
Contributor

Nagios & Wazuh installed successfully.

@sxa
Copy link
Member Author

sxa commented Feb 1, 2024

Note: I've added alpine-linux to the labels on the machine for now until we look at alternate solutions in the issue mentioned above.

@sxa
Copy link
Member Author

sxa commented Feb 14, 2024

Initial machine is in place and working. While we may wish to add additional containers onto this machine that can be done at a later date so I shall close this. Noting that #3378 covers setting up a second machine for the same purpose.

@sxa
Copy link
Member Author

sxa commented Apr 4, 2024

This machine was offline due to our monthly x64 credits at Skytap having expired. It has been changed from its original configuration to have 16GB RAM and six vCPUs and brought online again, but it still has a number of static docker containers defined.

The machine has been up for 2 days, 7h01 (My working assumption is that the rollover date for the credits is on the month boundary, but that may not be true) and it's currently showing this:
image

@sxa
Copy link
Member Author

sxa commented Apr 4, 2024

@Haroon-Khel I'm struggling to bring the machines back online - has the port information in the jenkins agent definitions become de-synchronised from what is on the host?
e.g. https://ci.adoptium.net/computer/test%2Ddocker%2Dubi8%2Dx64%2D3/log
which seems to be on a different port - is this expected?

CONTAINER ID   IMAGE        COMMAND               CREATED       STATUS      PORTS                                             NAMES
b67b6d5f2601   aqa_ubi8     "/usr/sbin/sshd -D"   5 weeks ago   Up 2 days   0.0.0.0:32771->22/tcp, :::32771->22/tcp           UBI8.32790

I've changed that particular agent definition to be on 32771 and it has come up ok but would be good to understand some of the others. I'd quite like to get at least one other container live on there (any more may cause a problem with the restricted number of CPU cores).
Since I've fixed that one, https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2204%2Dx64%2D4/log is an example of the failure.

@sxa sxa reopened this Apr 4, 2024
@Haroon-Khel
Copy link
Contributor

Yeah Im seeing this in #3486 (comment) too. Not sure what caused docker to reassign ports. Looking into it

@Haroon-Khel
Copy link
Contributor

Its caused because we now dont specify a port (allowing docker to randomly assign one),

command: docker run --restart unless-stopped -p 22 --cpuset-cpus="0-3" --memory=6G --detach --name {{ docker_image | upper }}.PORT aqa_{{ docker_image }}

Then when the dockerhost machine is restarted, docker will randomly assign a port again instead of giving the containers their previous port. TLDR a port needs to be specified on container startup instead of relying on docker to give a random one

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

That's another thing that won't be a problem if we switch over the connecting the containers over JNLP ;-)

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Apr 5, 2024

The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons). The problem should not reoccur with the existing containers. I need to change

command: docker run --restart unless-stopped -p 22 --cpuset-cpus="0-3" --memory=6G --detach --name {{ docker_image | upper }}.PORT aqa_{{ docker_image }}
to specify a port number to prevent this from happening in the future

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

Sounds good thanks - Jenkins logs should be clearer now after today's cleanups. Need to wait for Ludovic to come back to fix the RISC-V ones but that should be another load of warnings to disappear from Jenkins 👍

@sxa
Copy link
Member Author

sxa commented Apr 8, 2024

The containers are back online (https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log refuses to come back up for other reasons).

Do you know what the reason is? It's "curious" to note that the port number is 32768, exactly 2^15

@sxa
Copy link
Member Author

sxa commented Apr 24, 2024

I'm going to close this now. Any future work can happen under other issues if required.

@Haroon-Khel
Copy link
Contributor

https://ci.adoptium.net/computer/test-docker-ubuntu2004-x64-4/log is back online, I recreated its container and now the jenkins agent has no trouble connecting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Done
Development

No branches or pull requests

3 participants