-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with "software version consistency" feature #2394
Comments
same |
FWIW today I've encountered a production incident after updating to
This was a surprising error to see given that the only change on our side that we can attribute to this is our agent version upgrade 🤷 and it feels similar enough to be worth a mention given the digest in the error message. This seemed to be isolated to a small fraction of our cluster instances (all running 1.83.0) and tasks from the same task revisions yielding the error eventually phased in without intervention. I've also happened to notice that aws/amazon-ecs-agent#4181 intends to help augment these kinds of errors with some more useful context and made it into agent release EDIT: didn't touch the |
this has also caused production issues for my org. we use the since 1.83.0 |
I still found this error on ecs-agent 1.84.0. |
We have production issues with the change too, when the tag is re-used for a new image layer and the old image is deleted. |
I'm also seeing the issue where a newly pushed and tagged "latest" image is being ignored and the agent will only use the older untagged instance. This needs to be fixed ASAP or at least give us a workaround. I'm seeing this behavior on agent 1.83.0. This was not happening on 1.82.1. |
We are also seeing this issue in our environment. |
FWIW, this also impacts the ECS APIs, specifically https://www.reddit.com/r/aws/comments/1dtgc4b/mismatching_image_uris_tag_vs_sha256_in_listtasks/ Unclear if the source of truth (and the root cause) is the agent or the APIs themselves, but just though it's worth noting this. |
Found this issue after internal investigation of an incident that seems likely related to this. If it helps anyone else, here's my analysis of how this impacted a service that was referencing an ECR image based on a persistent image tag that we were regularly rebuilding and overwriting, and had automation in place for deleting the older untagged images I have an open support case with AWS to confirm this behaviour, and have included a link to this github issue. sequenceDiagram
participant jenkins as Jenkins
participant cloudformation as Cloudformation
participant ecs-service as ECS Service
participant ec2-instances as EC2 Instances
participant ecr-registry as ECR Registry
participant docker-base-images as Docker Base Images<br />firelens sidecar image
participant ecr-lifecycle-policy as ECR Lifecycle Policy
jenkins ->> cloudformation: regular deployment
cloudformation ->> ecs-service: creates a new "deployment" for the service
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: starts tasks with resolved image hashes
ec2-instances ->> ecr-registry: pulls latest image from ECR
docker-base-images ->> ecr-registry: rebuid and push image regularly
ecr-lifecycle-policy ->> ecr-registry: deletes older images periodically
note right of ecs-service: periodically, new tasks need to start
ecs-service ->> ec2-instances: starts tasks with previously resolved image hashes
ec2-instances ->> ecr-registry: attempts to run the same image hash from earlier<br />if the image already exists on the instance, its fine<br />otherwise, it needs to pull from ECR again and may fail
ec2-instances ->> ecs-service: tasks fail to launch due to missing image
note right of ecs-service: at this point, the service is unstable<br />might have existing running tasks<br /> but it can't launch new ones
create actor incident as Incident responders
ecs-service ->> incident: begin investigation
note left of incident: "didn't this happen the other day<br />for another service?" *checks slack*
note left of incident: Yeah, it did happen, and the outcome<br />was that we disabled the ECR lifecycle<br />policy, but services were left with<br />the potential to fail when tasks cycle
incident ->> jenkins: trigger replay of latest production deployment early and hope that fixes the issue
jenkins ->> cloudformation: deploy
cloudformation ->> incident: "there are no changes in the template"
incident ->> jenkins: disable the sidecar to get the service up and running again quickly and buy more time for investigation
jenkins ->> cloudformation: deploy with sidecar disabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment without sidecar
activate ecs-service
note right of ecs-service: no longer cares about firelens sidecar image
ecs-service ->> ec2-instances: starts new tasks
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is up and running again, everyone is happy
note left of incident: "but we're not done yet"
incident ->> jenkins: re-enable the sidecar
jenkins ->> cloudformation: deploy with sidecar enabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment with sidecar
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: start new tasks
ec2-instances ->> ecr-registry: pulls new images with updated hash
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is stable again
note left of incident: This service looks good again now<br />but other services might still have a problem
deactivate ecs-service
incident ->> ecs-service: work through "Force New Deployment" for all services in all ecs clusters & accounts
note left of incident: all services are now expected to be<br />stable, as everything should be<br />referencing the latest firelens image<br />hash, and the lifecycle policy<br />to delete older ones is disabled
|
This issue most probably comes from aws/amazon-ecs-agent#4177 merged in
|
Downgrading to 1.82.4 in our case does not make the issue go away, indicating that, even if it was related to the agent, the digest information is now somehow cached by ECS. We are currently using a DAEMON ECS service. According to a recent case opened with AWS support, "ECS now tracks the digest of each image for every service deployment of an ECS service revision. This allows ECS to ensure that for every task used in the service, either in the initial deployment, or later as part of a scale-up operation, the exact same set of container images are used." They added this is part of a rollout that started in the last few days of June and is supposed to complete by Monday. Their suggested solution is to update the ECS service with "Force new deployment" to "invalidate" the cache. If you have AWS support, try to open a case including this information to see how they evaluate your issue. |
I got a similar response to @sjmisterm in my support case, confirming the new behaviour is expected, and stating that we should no longer delete the images from ECR until we're certain that the images are no longer in use by any deployment. This change effectively means ECR lifecycle policies to delete untagged images are expected to cause outages unless additional steps are taken immediately after every time an image is deleted to ensure every deployment that was referencing a mutable tag is redeployed. This is particularly problemattic for my specific use-case where we were referencing a mutable tag for a sidecar container that we include for many services. I've asked if there is any future roadmap plans to make this use-case easier to manage, and requested for a comment from AWS on this github issue 😄 |
AWS has confirmed this is definitely caused by them and they think this is a good feature, as the links (made available yesterday) show https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ There's no way to turn off this new behaviour, which completely breaks the easiest workflow for blue-green deployments - I'm sure tons of people have other cases that need or benefit from the old one. I suggest all who have AWS support to file a case and to request an API to turn this off by service / cluster / account. |
Hello. I am from AWS ECS Agent team. As shared by @sjmisterm above, the behavior change that customers are seeing is because of the recently released Software Version Consistency feature. The feature guarantees that same images are used for a service deployment by recording image manifest digests reported by the first launched task and then overriding tags with digests for all subsequent tasks of the service deployment. Currently there is no way to turn off this feature. ECS Agent v1.83.0 included a change to expedite the reporting of image manifest digests but older Agent versions also report digests and ECS backend will override tags with digests in both cases. We are actively working on solutions to fix the regressions our customers are facing due to this feature. |
One of the patches we are considering is - instead of overriding |
@amogh09 , I can't see how this would address the blue-green scenario. Could you explain it, please? |
@sjmisterm Can you please share more details on how this change is breaking blue-green deployments for you? |
@amogh09 , sure. Our blue-green deployments work by deploying a new image to the ECR repo tagged with latest and then launching a new EC2 instance (from the ECS-optimized image, properly configured for the cluster) while we make sure the new version works as expected in production. Then, we start to progressively drain the old tasks until only new tasks are available. |
@amogh09 in summary: the software version "inconsistency" is what makes blue green a breeze with ECS. Should we want consistency, we'd use a digest or a version tag. |
@sjmisterm Deployment unit for an ECS service is a TaskSet. The software version consistency feature guarantees image consistency at TaskSet level. In your case, how do you get a new task to be placed to the new EC2 instance? The new task needs to be a part of a new TaskSet to get the newer image version. If it belongs to the existing TaskSet then it will use the same image version as its TaskSet. ECS supports blue-green deployments natively at service level if the service is behind an Application Load Balancer. You can also use External deployment type for an even greater control over the deployment process. Software Version Consistency feature is compatible with both of these. |
@amogh09 I use a network load balancer and the LDAP container instances I'm running will not respond well to this new model. If I can't maintain the ability to pull the tagged latest image, I will have to stop using ECS and manage my own EC2s, which would be painful frankly. Looking at the ECS API, what would happen if I called DeregisterTaskDefinition and then RegisterTaskDefinition. Would that have the effect of forcing ECS to resolve the digest from the new latest image without killing the running tasks? |
@amogh09 , I think we're talking about different things. Until the ECS change, launching properly a new ECS instance properly configured for a ECS daemon service whose taskdef is tagged with :latest would launch the new task with, well, the image tagged latest. Now it launches it using the digest resolved by the first task unless you force a new deployment in your service. Our deployment scripts pre-dates CodeDeploy and the other features. So all your suggestions require rewriting deployment code because of a feature we can't simply opt-out. |
I understand the frustrations you all are sharing regarding this change. I request you to contact AWS Support for your issues. Our support team will be able to assist you with workarounds relevant to your specific setups. |
@amogh09 , a simple API flag in the service / cluster / region / account would solve the problem. That's what we're trying to get across because it disturbs your customer base - not everyone pays support and the old behaviour, as you can see, is used by several of them. |
I'll chime in that we were negatively impacted by this change as well, and I don't think it helps anything for most scenarios. Before, customers effectively had a choice: they could either enforce software version consistency by using immutable tags (https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html), or if they wanted to allow for a rolling release (most useful for daemon services as @sjmisterm alluded to) they could achieve that as well by using a mutable tag. Now, this option is gone with nothing to replace it, and very poor notification that it was going to happen to boot. |
I'm very disappointed with AWS on two counts:
|
I know that the circumstances around how we all got notified about this change aren't ideal, but is there anywhere where we can be proactive and follow along for similar updates that may affect us in the future? Did folks get a mention from their AWS technical account managers or similar? I lurk around the containers roadmap fairly often, but don't see an issue/mention there or in any other publicly-facing aws github project around this feature release. |
@scott-vh the problem is that this is an internal API change, ECS backend behaves differently now. This has nothing to do with ecs-agent itself, regardless of version you will get same behaviour. Noone could see it coming |
@dg-nvm Yep I got that 👍 I was just curious if there was any breadcrumb anywhere else for which we could've seen this coming (sounds like no, but wanted to see if anyone who interfaces with TAMs or ECS engineers through higher tiers of support got some notice) |
@scott-vh our TAM was informed about the problem but idk if there was any proposal. Given that I see ideas for workarounds accumulating I would say no :D Luckily our CD was not impacted by this, I can think of scenario that deamons deployments is easier using mutable tags, especially that ECS does not play nicely when replacing daemons. Sometimes they are stuck because they got removed from the host and something else was put in it's place in the meantime :) |
Hi all, I have transferred this issue into the containers-roadmap repo. As far as I understand it, people are experiencing issues with this feature as a whole, rather than an issue with the ECS agent behavior specifically. For reference, see what's new post: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ Please feel free to continue adding your +1 and providing feedback :) |
This issue is affecting us as well. we utilize an initialization container that runs before the app container. this init container sets up monitoring integrations and settings that are not critical to the app itself but with a limited team we rely on the mutable tags to handle the "rolling" update as tasks are restarted. to force an application deployment for each application that my team manages for these small config updates would be an impossible task. Is there any way at all to prevent this "consistency" feature for a single container or disable entirely at the task level? It seems like this feature was already solved with tag immutability, giving us the option to use mutable tags if we actually needed that behavior. |
This regression caused a minor production outage for us because AWS' monitoring tools like X-Ray recommend using mutable tags, which means that if any of those has a release outside of your deployment cycle you are now set up to have all future tasks fail to deploy because you followed the AWS documentation:
I think this feature was a mistake and should be reverted – there are better ways to accomplish that goal which do not break previously-stable systems and immutable tags are not suitable for every situation, as evidenced by the way the AWS teams above are using them – but if the goal is to get rid of mutable tags it should follow a responsible deprecation cycle with customer notification, warnings when creating new task definitions, some period where new clusters cannot be created with the option to use mutable tags in tasks, etc. since this is a disruptive change which breaks systems which have been stable for years and there isn't a justification to break compatibility so rapidly. |
We are also having an issue with this. Our development environment is setup to have all the services on a certain tag that keeps us from having to redeploy. They can simply stop the service and it comes back up with the most current image with that tag. Now they are having to update the service which is more steps than needed. This also seems to be a problem with our lambdas that spin up Fargate tasks and those tasks are not the most current version of the tag now. The update service is not an option on these so we are still trying to work that out. |
The strangest thing is that the feature was already available for those who wanted this. You can specify the container image with digest and that would pin the image explicitly. No code changes required to ECS. floating potentially inconsistent -> |
Also had an issue with one of our sites which I believe is related to this - a container pulling from an ECR repository with a lifecycle policy, EC2 instance restarts - and ECS wants to pull the non-existant old image as there hasn't been a fresh deploy of a container for weeks. The version consistency is a fantastic feature, but there are situations where I want the tag to be used rather than the image digest at last deploy. |
Sorry for the late response on this thread- we're aware of the impact this change has had and apologize for the churn this rollout has created. We've been actively working through the set of issues that have been highlighted on this thread and have 2 updates to share: 1/for customers who've been impacted by the lack of ability to see image tag information, we're working on a change that will bring back image tag information in the describe-tasks response, in the same format as was available prior to the release of version consistency (i.e |
I can concur that this "software version consistency" change to ECS render the concept of services totally useless for us. We may have to fallback to manually deployed task (without services) but then we'll loose the watchdog aspects which we have to re-implement ourselves. In short, we need to guarantee a few properties on our services running background jobs:
These combined with the new constraint that all of the tasks within a service need to have the same image digest, means that we cannot roll out any update to our containers without breaking at least one property. Tbh this feels like we may want to switch to a plain k8s solution were we can setup and manage our workloads with some degree of flexibility. Hopefully an opt-out solution will be available soon as mentioned above, but we are stuck with our deployments atm and need a solution asap. |
The forced addition of this feature also caused a significant production incident for us. We deliberately used mutable tags as part of our deployments, and an ECR lifecycle policy to remove the old untagged images after a period. This should absolutely have been an opt-in feature, or opt-out but disabled for existing services. I'm glad to see that's now been identified and raised but should this feature not be reverted until that option is available? To prevent everyone affected having to redesign workflows or implement workarounds. As has been pointed out by others, those that want consistency by container digest can already achieve that through either tag immutability, or referring to the digest explicitly in the task definition. |
A quick update on @vibhav-ag's post. We have now completed the first action in his comment. Amazon ECS no longer modifies the |
This almost knocked down our production environment, it did knock down stage, because we had been treating our ECR images as Turns out to be not hard for us to switch to only But wow, this breaking change hit us out of left field, and it should probably have been listed as We got hit during a |
This has impacted us as well. We use the equivalent of a mutable 'latest' tag and perform rolling service upgrades when we move the 'latest' tag. This lets us slowly do blue/green deployments (as our service can be told to recycle itself over time). Instead we weren't actually progressing our blue/green deployment as AWS kept deploying the old revision of the service rather than the one pointed to by our mutable tag. Even replacing the EC2 instance didn't fix it. Only re-running the service deployment has done the trick. This is a massive behavior change and shouldn't have ever been released without opting into the change. |
Fargate normally does that health-based deployment but that won't help you if the old containers can't continue running due to a failure in the container or host. That's one of the reasons why this mistake was so dangerous is that unless you monitor the ECS service events you will have a service which is working normally until a previously-recoverable error occurs and then you learn that the ECS team broke your deployment in July when something is completely down. What I ended up with is an Event Bridge rule which listens to the ECR Image Action for our source repositories and a Lambda listener which creates a new ECS service deployment to ensure that there's never a situation where our ECR tags are updated but ECS is now looking for the old version (we use environment-tracking branches & tags so the latest version is something like “testing” or “staging”). That isn't enough to avoid problems with Amazon's own containers, however, so our deployment pipeline now does a lookup for CloudWatch and X-Ray to get the current versioned tag which |
Just to add another use-case... I have an app with very long-running background processes. This app is not deployed with Instead, all instances are marked DRAINING so that new instances are created with the updated container image. Because the service revision is never updated with the new sha, the new instances pull down an old container image. Oddly, there's no way to update the service revision with the new sha without triggering an actual deploy. This is the missing piece for me. I need a way to update the sha stored in the service revision without triggering a deployment. Something like
|
If all containers in the task are opted out, will this remove the latency impact associated with this feature as well? I have a service that's updated with great frequency and whose deployments are latency sensitive. Two of the three containers in this service are already deployed with digests, but the third uses a tag because its image is built/published by CDK, and it's not possible to get access to the digest of such images to use in task definitions. So I believe we are stuck with the latency impact of this feature for mostly no reason: the image tags produced by CDK already approximate the behavior of digests, in that the image should not change if the tag is not changing. I'm specifically referring to this line from the documentation:
|
Update 2: you now have the ability to disable consistency for specific containers in your task by configuring the new versionConsistency field for each container in the task definition. Any changes to this property are applied after a deployment. Once again, we regret the churn this change has caused you all. |
Yes, if you opt out every container, you will see no impact to deployment latency because of digest resolution. |
EDIT: this is related to the "software version consistency" feature launch, see What's New post: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/
Summary
since our EC2 upgraded to ecs-agent v1.83.0, images used for containers are with sha digest and not image tag
Description
we started getting different image value for the '{{.Config.Image}}' property using docker inspect in our ECS EC2.
we are getting sha digest as the .Config.Image instead of getting the image tag.
the task definition contains the correct image tag (and not the digest)
we need the image tag since we rely on that custom tag to understand what was deployed. what can be done?
Expected Behavior
we expect to see the image tag used for the container
Observed Behavior
we get image digest used for the container
Environment Details
{"Cluster":"xxxxr","ContainerInstanceArn":"xxx","Version":"Amazon ECS Agent - v1.83.0 (*xxx"}
The text was updated successfully, but these errors were encountered: