Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: A reliable EKS AMI release process #319

Closed
max-rocket-internet opened this issue Jun 7, 2019 · 41 comments
Closed

[EKS] [request]: A reliable EKS AMI release process #319

max-rocket-internet opened this issue Jun 7, 2019 · 41 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@max-rocket-internet
Copy link

My request: The AMI release process needs to be reliable

We have been EKS users since the first preview and one significant pain point is issues with the AMI. From the outside it looks like the release process is inconsistent and/or unreliable.

Here's some notable examples:

  1. Simple log rotation missing in AMI: Adding simple dockerd config file to rotate logs from containers awslabs/amazon-eks-ami#74
  2. Commits missing from a release: Missing commits in ami-098fb7e9b507904e7 awslabs/amazon-eks-ami#215
  3. Names of releases was not consistent in the beginning: https://github.com/awslabs/amazon-eks-ami/releases
  4. New AMI version breaking ulimit settings: Latest AMI version lowers ulimit and breaks Elasticsearch awslabs/amazon-eks-ami#193
  5. Typos in some init scripts: Fixed --enable-docker-bridge bootstrap arg awslabs/amazon-eks-ami#192
  6. Changelog incorrectly filled out: Update changelog awslabs/amazon-eks-ami#241
  7. AMI v20190329 released in the AWS console but not on Github: Latest AMIs (v20190220) missing commits in /etc/eks/bootstrap.sh awslabs/amazon-eks-ami#233 (comment)
  8. Ulimit thing STILL not fixed, read this comment to get the history on this issue, it's crazy: Raise docker default ulimit for nofile to 65535 awslabs/amazon-eks-ami#278 (comment)

Now, I know and expect some bugs, I also recognise it is (or was) a new service so of course there's a few kinks to work out, but the last 2 items are recent. This is not the quality that I have come to expect from my favourite cloud provider 💔

@max-rocket-internet max-rocket-internet added the Proposed Community submitted issue label Jun 7, 2019
@max-rocket-internet
Copy link
Author

@mhausenblas

@mhausenblas mhausenblas added the EKS Amazon Elastic Kubernetes Service label Jun 7, 2019
@whereisaaron
Copy link

whereisaaron commented Jun 7, 2019

EKS AMI releases appear to have simply ceased since 29 March. I could find not AMIs that are newer, and so important fixes to self-inflicted wounds like ulimit are not available.

The project home page and releases page lists the latest AMI's as 27 March. And the AWS Marketplace says the latest AMI version is 20 February.

image

image

image

@wolverian
Copy link

Additionally, if there is a supported method for subscribing to EKS AMI updates, I haven't found it in the documentation.

@thomasjungblut
Copy link

@wolverian you could watch https://github.com/awslabs/amazon-eks-ami in releases-only mode, that corresponded well with the releases so far.

@max-rocket-internet
Copy link
Author

you could watch https://github.com/awslabs/amazon-eks-ami in releases-only mode

This is exactly what I do. But this only works when the release process is reliable, which is what this issue is about 🙏

@thomasjungblut
Copy link

@max-rocket-internet I meant by release-time, not necessarily by content ;-) Although I have to agree that the tagging in this case is slightly misleading.

@max-rocket-internet
Copy link
Author

I meant by release-time, not necessarily by content

I get you but number 7 on my list shows that there have been releases in AWS console but not on github. That's why I mentioned it 😃

@M00nF1sh
Copy link

M00nF1sh commented Jun 12, 2019

Hi, sorry for the issues and trouble caused.

We are working on an more standard AMI release process which can be used to release AMIs more frequently.
There will be SNS notifications for new AMI release notifications
Also, there will be SSM public parameter that references the lastest available AMI.

BTW, the github release doesn't have correlation with our AMI release, e.g. we may release new AMIs in case of security patches.

@wolverian
Copy link

Thank you, that sounds great! Sorry for nagging, but would it be possible to synchronize GitHub releases with AMI releases? If not, maybe disabling GitHub releases altogether would make sense, to reduce potential confusion.

@szymonpk
Copy link

Synchronized releases would be ❤️. We pull information about recent updates for all our services over RSS and GH fits nicely into this workflow.

@max-rocket-internet
Copy link
Author

Synchronized releases would be ❤️

Exactly 💯

An SNS topic is not very user friendly when much of our current software and work flow already revolves around Github.

@M00nF1sh
Copy link

M00nF1sh commented Jun 14, 2019

@max-rocket-internet @szymonpk
Hi, would you help describe how your current workflow build AMI? Did you rebuild AMI with this repo or use our published AMI as base image?

There is actually three component that can be involved in release process here:

  1. The AMI build binary artifacts(binaries in our amazon-eks S3 bucket). It's updated when we release new Kubernetes versions/binary patches.
  2. The AMI build source code(build scripts/files in this `amazon-eks-ami' repo).
  3. The actual AMI releases. This can happen when the above two changed, or the a newer base AmazonLinux2 AMI have been released(e.g. securityPatches).

For Github release, I think it can be synced with 1 and 2 above.

@whereisaaron
Copy link

@M00nF1sh that all sounds like a big improvement on the current situation. There is also an immediate need, as the latest available EKS AMIs are more than 3 months old and contain some pretty problematic bugs, like this ulimit issue. We you be able to cut an updated AMI using the old process in the meantime?

@M00nF1sh
Copy link

@whereisaaron We'll release new amis on monday 😄

@tabern
Copy link
Contributor

tabern commented Jun 17, 2019

@wolverian more context on what @M00nF1sh mentioned, we are working on launching an AMI SSM parameter that you can use along with an SNS topic that you can subscribe to. See #231

@max-rocket-internet
Copy link
Author

@M00nF1sh

Hi, would you help describe how your current workflow build AMI? Did you rebuild AMI with this repo or use our published AMI as base image?

Sure: We don't build AMIs. We want to avoid this administrative overhead. We just want reliable, bug free AMIs provided from AWS with:

  • A clear and consistent change log (6)
  • Releases published on Github (7)
  • Consistency between Github release, change log and AMI in AWS console (2, 3, 7)
  • No basic/obvious bugs (1, 5, 8)

(the numbers are references to my points in my first post).

we are working on launching an AMI SSM parameter that you can use along with an SNS topic that you can subscribe to

This problem is non-existent for Terraform users 😃

@szymonpk
Copy link

@M00nF1sh
Last month, not every important (important from our pov, not necessarily with severity important) security update got new EKS AMI image. Or there was no clear notification about it. We are subscribed to ALAS2 and AWS security bulletins. Our use case varies if there is official AMI we use it. If there is an important update without new AMI we build our own images. Last time we had to build own images was when Intel security issues were disclosed.

The best solution for us, would be synchronized 1, 2 and 3. I see 3 may not happen, so it may be a good idea to have something similar to ALAS2 bulletin but for EKS. However, I see it is rather unlikely to have separate security feeds for AWS services.

@max-rocket-internet
Copy link
Author

This problem is non-existent for Terraform users

Well, it appears I spoke too soon 🙁

In the Hashicorp EKS Terraform module, we filter AMI IDs by name. This means we can easily specify a release by its name and not worry about what region we are in. That's item number 3 is important. But in AWS region us-west-2 there's 2 AMIs with the name amazon-eks-gpu-node-1.13-v20190614: https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;ownerAlias=602401143452;search=amazon-eks-gpu-node-1.13-v20190614;sort=desc:creationDate

Screen Shot 2019-06-24 at 15 54 05

Even if you aren't using Terraform, how would you know what AMI is the correct one?

Noted in awslabs/amazon-eks-ami#291

This is what I mean by "release process". I don't know what the process behind the scenes is, or how it works, but if it was automated in a proper this should be impossible.

@sayboras
Copy link

we are working on launching an AMI SSM parameter that you can use along with an SNS topic that you can subscribe to.

@tabern I just want to highlight one more point related SSM, I got an issue in which the SSM got updated to amazon/amazon-eks-node-1.15-v20200312 around around 3 weeks back, however CF didn't support the new value for ReleaseVersion, which causes stack update failed.

Just checked and find out that newest release for AMI https://github.com/awslabs/amazon-eks-ami/releases v20200406, not sure if it's supported by CF or not.

@sagikazarmark
Copy link

Not sure if it is entirely related to this issue, but I find the Kubernetes version matching AMIs quite uncomfortable. Preferably, I'd like to see a list of Kubernetes versions (including the patch version) and the AMI in a single list. As far as I can tell, there is no way to tell which exact Kubernetes version is used in an AMI.

@booleanbetrayal
Copy link

Bumping up against the ulimit hard limit issue in EKS Fargate (see also: #1013) which is rather painful because the whole point of Fargate is to abstract away node complexity. Now I'm wishing I was embracing that managed node complexity, because it would at least provide me an escape hatch.

@yanivpaz
Copy link

is it planned to support AWS AMI builder awslabs/amazon-eks-ami#548 ?

@mikestef9
Copy link
Contributor

Been a while since any activity on this, and I feel our release process has become much more reliable. One outstanding item is #734, but otherwise I'll close this issue soon if there aren't any other outstanding requests

@rajakshay
Copy link

This problem is non-existent for Terraform users

Well, it appears I spoke too soon 🙁

In the Hashicorp EKS Terraform module, we filter AMI IDs by name. This means we can easily specify a release by its name and not worry about what region we are in. That's item number 3 is important. But in AWS region us-west-2 there's 2 AMIs with the name amazon-eks-gpu-node-1.13-v20190614: https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;ownerAlias=602401143452;search=amazon-eks-gpu-node-1.13-v20190614;sort=desc:creationDate

Screen Shot 2019-06-24 at 15 54 05

Even if you aren't using Terraform, how would you know what AMI is the correct one?

Noted in awslabs/amazon-eks-ami#291

This is what I mean by "release process". I don't know what the process behind the scenes is, or how it works, but if it was automated in a proper this should be impossible.

@max-rocket-internet Is it an acceptable approach to tackle this specific problem by making the AMI names unique by adding hours-min-sec to the AMI name: awslabs/amazon-eks-ami#551 (comment)

@anish
Copy link

anish commented Sep 9, 2021

I'm still seeing inconsistencies in release process that feel very related :

awslabs/amazon-eks-ami#662
awslabs/amazon-eks-ami#707 (comment)
awslabs/amazon-eks-ami#755

@anish
Copy link

anish commented Oct 11, 2021

More issues with the current release
awslabs/amazon-eks-ami#780

I understand bugs and mistakes happen, but at this point really feels despite what was said in #319 (comment) by @mikestef9
"release process" is more of a checklist item here that no one really pays attention to.

@stevehipwell
Copy link

stevehipwell commented Oct 11, 2021

@anish I'm in agreement with you.

The CHANGELOG is still showing the wrong K8s versions on the last two releases despite it being reported and an AWS engineer responding.

I think the CHANGELOG is manually generated after a release, but I might be wrong about this the above linked PR shows this.

@stevehipwell
Copy link

@stevehipwell
Copy link

The current documentation at https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html is missing the last release which happened 10 days ago and included version bumps to K8s (amongst other things). The quoted version of add-on images such as kube-proxy are incorrect in the docs too. It's got to the point where the documentation needs to be considered incorrect and everything needs looking up via the CLI.

Come on AWS this isn't good enough, the docs should be created and published as part of the release process.

@akestner
Copy link

Apologies for the confusion @stevehipwell. We're looking into how we can improve the consistency between our AMI releases and the documentation. In the meantime, the documentation changes with the latest EKS AMI info will be published shortly.

@stevehipwell
Copy link

@akestner I can see you got the release notes out and for v20211117, thank you. However something has happened to make v20211109 show up as the "latest" version over v20211117 despite the dates on the releases being correct (I'm not even sure how you can do this in GH).

@anish
Copy link

anish commented Nov 30, 2021

@akestner The changelog also seems to have been published wrong. The v20211117 release notes claim an update to dockerd: 20.10.7-5.amzn2 , a change which happened 8 days after the the v20211117 release.
This is not in the v20211109 release release either, which as pointed out above happened in the wrong chronological order. The changelog for v20211117 also claims to use the 2021-11-10 binaries when really only v20211109 uses them.

@stevehipwell
Copy link

@anish I suspect that the commits in this repo aren't actually tied to the release process and are mirrored here from a private internal repo.

@anish
Copy link

anish commented Nov 30, 2021

@stevehipwell quite likely, since the changelogs seem to match commits, just not the release tags.

@booleanbetrayal
Copy link

booleanbetrayal commented Dec 17, 2021

Wanted to report it here in addition to nagging AWS directly, but we are seeing Node failures in 4.14.252-195.483.amzn2.x86_64 while running EKS Fargate with Fluent-Bit log streaming enabled. Nodes eventually become NotReady with Kubelet failing to be available. This results in Running Pods that fail to respond until the Node is culled by K8s based on taints.

We did not author this AWS forum post, but it describes our issue precisely. Once we remove the aws-logging ConfigMap from our cluster, we fail to see these sporadic issues anymore.

IMO, this points to an issue with AMI 4.14.252-195.483.amzn2.x86_64 (and possibly introduced in the 481 patch version), which implies a lack of real-world testing, or a lack of observance / inadequate response to failures on Nodes deployed with this AMI.

@anish
Copy link

anish commented Nov 9, 2022

I hadn't seen this issue reappear for a year but back again with https://github.com/awslabs/amazon-eks-ami/releases/tag/v20221104

@whereisaaron
Copy link

@anish the most recent issue with v20221104 can take many days or weeks to appear in our experience, fairly random and some nodes go a week or more without the kernel bug triggering. So I am less surprised AWS did not spot this one ahead of time.

On Nov 9 AWS sent Operational Notifications to anyone running the bad AMI with references. So I thought that was relatively well handled. What could have been better:

  • Operational Notification should have referenced the github issue and github release notes
  • The problem was noted on 30 Oct, so it took 10 days to notify user what was causing their problems, could have been faster?
  • They article did not say how to tell what AMI your nodes are running and which dated version that maps to
Subject: [Action Required] EKS Recalled Optimized AMI Release Version [AWS Account: XXXX] [AP-REGION-X]

Hello,

We have identified that you are using the recalled EKS optimized AMI release version [v20221027],
which may cause nodes to go into NotReady state. We strongly recommend that you upgrade to
the latest EKS optimized AMI release version [v20221101].

The use of dated versions is non-obvious unless you dig into AMI ids, I made a small script to make these checks easier, that lists the dated versions for an AMIs present in a cluster.

#!/bin/bash
set -e -u -o pipefail

#
# Get the versions names for node AMIs as listed in AMI releases
# https://github.com/awslabs/amazon-eks-ami/releases
#

: ${CLUSTER:=${1?"Must specify cluster name"}}

eksctl get nodegroup --cluster=$CLUSTER --output=json \
  | jq -r '.[].ImageID' \
  | uniq \
  | xargs aws ec2 describe-images --image-ids \
  | jq -r '.Images[] | .ImageId + "\t" + .ImageLocation'

# end
$ ./get-aws-eks-node-ami-versions.sh mycluster
ami-049650977d1fedf56   amazon/amazon-eks-node-1.21-v20221104
ami-0ef131285994f51a7   amazon/amazon-eks-node-1.21-v20221027

@anish
Copy link

anish commented Nov 16, 2022

@whereisaaron I was more specifically talking about this awslabs/amazon-eks-ami#1093 (comment)
which got fixed many days after the issue was closed https://github.com/awslabs/amazon-eks-ami/pull/1095/files

(changelog, release and makefile being constantly out of sync with each other) which is an on going issue that had not been seen for a while
#319 (comment)
#319 (comment)
#319 (comment)

@anish
Copy link

anish commented Jul 24, 2023

The last release was messed up again
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20230711 points to awslabs/amazon-eks-ami@a819875

which is a few commits behind the actual release awslabs/amazon-eks-ami#1357

@anish
Copy link

anish commented Jul 24, 2023

fixed. thanks for the quick turnaround @cartermckinnon

@tabern
Copy link
Contributor

tabern commented May 4, 2024

@max-rocket-internet it's been a while now and the EKS team has made a lot of progress on this issue, including resolving/merging all of the original issues in your post:

The team has a well oiled release process at this point, with weekly releases since December 2023 (see https://github.com/awslabs/amazon-eks-ami/releases).

Given this progress, I'm going to close this issue. If there are future suggestions to improve the EKS AMI, they can be opened as new issues on this roadmap.

  • Nate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
Status: Shipped
Development

No branches or pull requests