Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Platform linux/arm64 to Docker Build #10441

Merged
merged 10 commits into from
Aug 25, 2024
Merged

Conversation

smashedr
Copy link
Contributor

@smashedr smashedr commented Aug 6, 2024

It would be beneficial to provide multi-architecture docker images to allow deployment to linux/arm64 servers.

Instead of setting up my own build process I figured it would be useful to incorporate these changes upstream. This was going to be a feature request, but submitting as a PR so I can be of assistance if necessary.

Personally, I don't use Docker Hub, but I do know GHCR is seamlessly compatible with multi-architecture builds and can be deployed to any architecture using a single latest tag.

My main swarm cluster uses linux/arm64 and after an initial build/deploy of this project, seems to run with no issues from GHCR.

Let me know your thoughts on this request, or any additional work you would like done. Thanks.

Copy link
Contributor

github-actions bot commented Aug 6, 2024

Messages
📖 ✨ Thanks for your contribution to Shields, @smashedr!
📖

Thanks for contributing to our documentation. We ❤️ our documentarians!

Generated by 🚫 dangerJS against 6d2f485

@smashedr smashedr marked this pull request as ready for review August 7, 2024 02:13
@chris48s
Copy link
Member

chris48s commented Aug 7, 2024

I'm in 2 minds about this.

On the one hand, adding the extra architecture here is not a lot of code and is unlikely to create a lot of churn or maintenance relating to these lines of code added in this PR. Also this is a thing that has been asked for previously. #9192

My main reservation about this is that with the amd64 image, we use it ourselves. That's what our own servers are deployed from. As such we are pretty likely to spot any problems with it. An arm64 image would be something we don't use ourselves so we are unlikely to notice if we break it in some arch-specific way. We also do not run our test suite on arm64. If a user pops up and raises an issue saying "I'm using the official linux/arm64 shieldsio/shields image and I've spotted problem X", I basically have no easy way to reproduce that or help fix it.

I do acknowledge that shields is written in javascript not C. The majority of what we're doing is sufficiently high level that it is very unlikely to be architecture specific. That said, we do build on top of a bunch of packages, libraries and images that are all working at lower levels of the stack.

I guess my point is: Offering this also comes with some responsibility to have some confidence we are not breaking it as we change things, and some ability to reproduce problems. We don't really have that.

So here's another question for you: You've already built and pushed an image to https://github.com/smashedr/shields/pkgs/container/shields Is there anything more we could do that would make it easier for you (or others) to easily maintain/update a 3rd party "shields, but built for [architecture I care about]" without actually taking on the support for those things ourselves?

@chris48s chris48s added the operations Hosting, monitoring, and reliability for the production badge servers label Aug 7, 2024
@smashedr
Copy link
Contributor Author

smashedr commented Aug 8, 2024

@chris48s

My main reservation about this is that with the amd64 image, we use it ourselves. That's what our own servers are deployed from. As such we are pretty likely to spot any problems with it. An arm64 image would be something we don't use ourselves so we are unlikely to notice if we break it in some arch-specific way. We also do not run our test suite on arm64. If a user pops up and raises an issue saying "I'm using the official linux/arm64 shieldsio/shields image and I've spotted problem X", I basically have no easy way to reproduce that or help fix it.

I do acknowledge that shields is written in javascript not C. The majority of what we're doing is sufficiently high level that it is very unlikely to be architecture specific. That said, we do build on top of a bunch of packages, libraries and images that are all working at lower levels of the stack.

I guess my point is: Offering this also comes with some responsibility to have some confidence we are not breaking it as we change things, and some ability to reproduce problems. We don't really have that.

The chance that the code would break on one architecture and not another, is slim to none. I have been using an arm64 swarm cluster for a while now, and have never had any issues. The only issue I run into, is people not offering multi-architecture builds, like this project.

So here's another question for you: You've already built and pushed an image to https://github.com/smashedr/shields/pkgs/container/shields Is there anything more we could do that would make it easier for you (or others) to easily maintain/update a 3rd party "shields, but built for [architecture I care about]" without actually taking on the support for those things ourselves?

Maintaining your own build for an external project is not very easy and requires quite a few steps and workflows to work optimally. Including but not limited too, updating repository permissions, adding actions credentials, linking the package and setting visibility, creating workflows with custom logic to determine when to re-build, linking it all to your deploy, and debugging the many steps/issues that can arise.

All compared to simply deploying badges/shields:next and clicking re-deploy or using git-ops to automatically update the deployment.

I recently setup my own build for github-readme-stats after the public deployments paused issue and while it works just fine, I still need to take the time to write a custom workflow on a cron that checks for upstream updates and re-builds when found. And the only reason I took the time to do this is because they don't offer a docker image to begin with. For reference: https://github.com/smashedr/github-readme-stats

At the end of the day, since docker images are already offered by this project, I feel at a minimum the additional architectures should be provided as is, as a convenience, for people using those architectures; with no support guarantee. The image I built is currently running on my linux/arm64 swarm cluster and everything I have tested seems to work just fine. If you want to poke around yourself, just note I have not setup any credentials/tokens yet: https://shields.cssnr.com/

Providing support for people trying to setup their own builds would require a detailed guide that would most likely need many revisions before it stops getting issues opened against it, as well as maintaining workflows for to use.

Let me know what you think about just providing additional architectures without a support guarantee as I feel this is the most convenient solution solution.

@calebcartwright
Copy link
Member

Chris raises a good point, though I also agree it's something we could resolve with some of the existing suggestions noted previously.

Not too long ago, Shields built and provided an "official" docker image but we didn't use the image itself and trying to help people troubleshoot issues with it was a bit of a stretch sometimes. Even since we started using the the image in the production shields.io environment, we do still occasionally get contacted by folks who are having issues running it in kubernetes environments, where our ability/willingness to help troubleshoot is comparatively limited.

One of the things we do in the Rust project is the notion of platform/target "tiers" that may be worth trying to adopt (at least in part) here, even if it's just something more simplistic and binary (e.g. x86 linux is tier 1 that the Shields project builds and tests for, everything else is no-guarantees & best-effort)

@chris48s
Copy link
Member

chris48s commented Aug 9, 2024

everything I have tested seems to work just fine. If you want to poke around yourself..

So just to be clear, my question isn't "does it work now?" it is more like "how do we ensure future changes do not accidentally break it?".

I guess my (imperfect) comparison point here is Windows compatibility. I think you'd also quite reasonably say "its unlikely we're going to do anything that is OS dependent" in the same way as "its unlikely we're going to do anything that is architecture dependent", but it has happened before ( #8350 , #8786 ). In that case, I was at least able to boot up a VM running Windows to fix the issues and we added a Windows CI build to prevent future problems.

However, another comparison point here is: We are a website. People occasionally report a bug in the frontend that only manifests in Safari. That is also a giant pain in the ass to reproduce because I don't have a mac either. Sometimes people report a problem that only exists on a platform we don't use 🤷

I did do a bit of reading and have a look around to see if there is any good pattern for

  • build an image
  • boot a container on each of the target architectures it is built for
  • run some smoke tests against each container
  • push it to the registry if they all pass

but as far as I can see, most public projects I looked at just seem to build and chuck it over the wall. Maybe "it builds" is enough of a smoke test? It would seem at least comparable with what other projects are doing. Perhaps one of the reasons for this is that although GitHub recently launched linux/arm64 runners, they are currently only available to GitHub Team or GitHub Enterprise Cloud plans.

everything else is no-guarantees & best-effort

Tbh, I wouldn't want to classify any of our operations as above "no-guarantees & best-effort"


I think on balance building/pushing both but documenting in https://github.com/badges/shields/blob/master/doc/self-hosting.md#docker that:

  • We use the linux/amd64 image ourselves
  • We push a linux/arm64 image, but it is basically untested

is probably a reasonable approach here.

@calebcartwright
Copy link
Member

everything I have tested seems to work just fine. If you want to poke around yourself..

So just to be clear, my question isn't "does it work now?" it is more like "how do we ensure future changes do not accidentally break it?".

I guess my (imperfect) comparison point here is Windows compatibility. I think you'd also quite reasonably say "its unlikely we're going to do anything that is OS dependent" in the same way as "its unlikely we're going to do anything that is architecture dependent", but it has happened before ( #8350 , #8786 ). In that case, I was at least able to boot up a VM running Windows to fix the issues and we added a Windows CI build to prevent future problems.

However, another comparison point here is: We are a website. People occasionally report a bug in the frontend that only manifests in Safari. That is also a giant pain in the ass to reproduce because I don't have a mac either. Sometimes people report a problem that only exists on a platform we don't use 🤷

I did do a bit of reading and have a look around to see if there is any good pattern for

* build an image

* boot a container on each of the target architectures it is built for

* run some smoke tests against each container

* push it to the registry if they all pass

but as far as I can see, most public projects I looked at just seem to build and chuck it over the wall. Maybe "it builds" is enough of a smoke test? It would seem at least comparable with what other projects are doing. Perhaps one of the reasons for this is that although GitHub recently launched linux/arm64 runners, they are currently only available to GitHub Team or GitHub Enterprise Cloud plans.

everything else is no-guarantees & best-effort

Tbh, I wouldn't want to classify any of our operations as above "no-guarantees & best-effort"

I think on balance building/pushing both but documenting in https://github.com/badges/shields/blob/master/doc/self-hosting.md#docker that:

  • We use the linux/amd64 image ourselves
  • We push a linux/arm64 image, but it is basically untested

is probably a reasonable approach here.

I'd agree 👍

This allows there to be an "official" image, one published by the Shields project which allows for easier, and perhaps more comfortable, consumption (i.e. there's some portion of the user base that'd' likely feel more at ease pulling an image produced by the project as opposed to 3rd party produced). At the same time, it doesn't overcommit the maintainer team

@chris48s
Copy link
Member

I've pushed another commit to this branch adding a note to the docs. Does that seem reasonable?

@smashedr
Copy link
Contributor Author

For testing, I know Amazon Web Services has a free tier of EC2 that allows for 750 hours of a t2.micro per month (always free) and is an ARM instance. If anyone is not using and willing to donate an instance I can configure a GitHub Actions runner on the instance that can easily be used by any current Actions for testing.

Additionally, If any issues ever do get opened against the arm64 build that are not replicated in the amd64 build I would be more than happy to help resolve the issue in any way possible. Just make sure to assign or mention me on the issue.

@chris48s
Copy link
Member

OK, so I've realised there is another issue with this that I had not spotted before :(

I've just run a couple of builds and I've noticed that building for both architectures disporportionately increases the time it takes to build docker images.

If I look at the last few builds where we are just building an amd64 image, these take just under 5 minutes to complete:

Screenshot at 2024-08-21 19-35-54

If I look at the builds on this branch, they all consistently took around 35 minutes:

Screenshot at 2024-08-21 19-51-16

I'm fine with publishing more images, but we can't have a CI build that runs for 35 mins on every pull request. That is just too slow. Any idea why building for amd64 takes so much longer? I accept we're now building 2 images so I think we can accept it will take about double the amount of time, but not 7x.

@chris48s
Copy link
Member

If there is no way to speed it up, maybe one thing we could consider is doing multi arch builds only on the tagged releases (server-YYYY-MM-DD), but not every pull request and push event?

@calebcartwright
Copy link
Member

If there is no way to speed it up, maybe one thing we could consider is doing multi arch builds only on the tagged releases (server-YYYY-MM-DD), but not every pull request and push event?

agreed. i feel like our main goal here is to provide a convenience in the form of a project-produced image for arm, so no need to bog down CI with something we're explicitly not planning to use nor test

@smashedr
Copy link
Contributor Author

I agree too, that is way to big of an increase in build time. When I get some time today I will see if I can reduce the build times, otherwise, I can set it to only build these on tagged server-YYYY-MM-DD releases.

@chris48s
Copy link
Member

This is the workflow where we push the tagged releases
https://github.com/badges/shields/blob/master/.github/workflows/create-release.yml

@smashedr
Copy link
Contributor Author

I did some research and testing. The emulation layer GitHub runners use to build non-native architectures can be extremely slow on complex builds, in this case to the tune of 5x.

firefox-20240822-140111229

I assume we want to build the arm platform on release and next builds, and just remove it from the CI builds. I have updated the PR to reflect this.

@chris48s
Copy link
Member

Thanks for having a look at it. I am surprised to learn our build is considered "complex".

I've pushed a couple more commits to this branch.

My final proposal on this is we push linux/amd64 and linux/arm64 only for the monthly snapshot builds, but we only push linux/amd64 images to the next tag.
The reason I say this is because we run that next build on every push to master and it is what we use to deploy shields.io . If we have to wait ~35 mins from merging a PR to the image being available to deploy from, that is going to be too much of an inconvenience day-to-day to provide this feature.

That does mean that linux/arm64 users won't be able to track the next tag, but they will be able to use the monthly snapshot releases. This seems like a reasonable tradeoff.

I think in order to change my mind on this we'd need to be able to either:

  • Build the linux/arm64 faster (and it seems like this is a dead end)
  • Find a way to split the builds up so that the linux/amd64 image can be pushed as soon as it is ready without having to wait for the linux/arm64 image to finish building first

In any case, adding linux/arm64 to the monthly snapshots only seems like a reasonable first step to me.

@smashedr
Copy link
Contributor Author

smashedr commented Aug 24, 2024

So, If I do a matrix build, the amd build will be pushed within 5 minutes, then 35 minutes later, the arm build will be pushed.

How do you feel about me adding a matrix? Its quite simple...

In hindsight, the matrix approach is probably better, that way one architecture, does not effect the other. I pushed the matrix changes up for you to look at.

@chris48s
Copy link
Member

Thanks. I tried this out on my fork and agree this seems like the right way to do it 👍

I've updated the docs and will merge this.

@chris48s chris48s added this pull request to the merge queue Aug 25, 2024
Merged via the queue into badges:master with commit 4a37203 Aug 25, 2024
23 checks passed
@smashedr
Copy link
Contributor Author

@chris48s I don't think this is working correctly on Docker Hub. That was the one thing I was unable to test myself. It seems one tag is overwriting the other.

0 shane@jammy [/home/shane]$ docker pull shieldsio/shields:next
next: Pulling from shieldsio/shields
no matching manifest for linux/amd64 in the manifest list entries

I am also unable to view the packages on GHCR, but, from my experience, GHCR handles multiple architectures seamlessly; however, with the matrix build, it is worth verifying.

chris48s added a commit to chris48s/shields that referenced this pull request Aug 26, 2024
chris48s added a commit that referenced this pull request Aug 26, 2024
@chris48s
Copy link
Member

OK. For the moment I have merged
#10477
which reverts this whole thing.

I won't really have a lot of time to dig into this for a week or so now. I'll have to come back to this another day. Reverting will put us back where we were and push an amd64 image back to the tip of next for those following the next tag.

@smashedr
Copy link
Contributor Author

I already created a new PR to address this going forward: #10476

When you get time lets get that tested and merged.

@chris48s chris48s added the self-hosting Discussion, problems, features, and documentation related to self-hosting Shields label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations Hosting, monitoring, and reliability for the production badge servers self-hosting Discussion, problems, features, and documentation related to self-hosting Shields
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants