Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deliver] increase chances of success when creating a new app version even when Apple servers are degraded #21742

Merged
merged 9 commits into from
Feb 9, 2024

Conversation

rogerluan
Copy link
Member

@rogerluan rogerluan commented Dec 20, 2023

Checklist

  • I've run bundle exec rspec from the root directory to see all new and existing tests pass
  • I've followed the fastlane code style and run bundle exec rubocop -a to ensure the code style is valid
  • I see several green ci/circleci builds in the "All checks have passed" section of my PR (connect CircleCI to GitHub if not)
  • I've updated the documentation if necessary.
  • I've added or updated relevant unit tests.

Motivation and Context

Resolves #21705
Closes #21741

Description

After submitting a new version, App Store Connect takes some time to recognize the new version and we must wait until it's available before attempting to upload metadata for it. There is a mechanism that will check if it's available and retry after a certain time. This time used to be 10 seconds, and then each retry would be after 10 seconds again, each time.

This PR changes this behavior by adding an exponential backoff to this retry mechanism, which means it will retry after 10 seconds on the first retry, and then after 20 seconds, 40, 80, 160, etc…

The default number of retries is 5. Prior to this PR you couldn't customize this value, but now you can via a new option in deliver called version_check_wait_retry_limit or via an env var DELIVER_VERSION_CHECK_WAIT_RETRY_LIMIT. This option takes an integer which represents the max number of retries before giving up.

The initial wait time remains at 10 seconds and it's not customizable. We decided to keep it this way instead of increasing the initial wait time so that the experience when Apple's servers are quick isn't jeopardized.

Click here to visualize scenario planning so you understand how this strategy works and how it compares to other strategies considered

Prior to this PR, the wait time + number of retries would be:

1st attempt: 10 seconds
2nd attempt: 10 seconds
3rd attempt: 10 seconds
4th attempt: 10 seconds
5th attempt: 10 seconds
Total: 50 seconds

If we simply increased the initial waiting time, to let's say 60s, like seen in the proposal #21741, then this would mean:

1st attempt: 60 seconds
2nd attempt: 60 seconds
3rd attempt: 60 seconds
4th attempt: 60 seconds
5th attempt: 60 seconds
Total: 300 seconds

In this PR, we adopt a strategy that provides faster wait times if Apple's servers aren't too slow:

1st attempt: 10 seconds
2nd attempt: 20 seconds
3rd attempt: 40 seconds
4th attempt: 80 seconds
5th attempt: 160 seconds
Total: 310 seconds

So if the Apple servers take, say, 150 seconds to process the new app version you just submitted, in current master it would timeout and your CI would fail/crash. In the 2nd proposal above, it would succeed in the 3rd attempt after waiting 180 seconds, and in this PR's proposal (last example above) it would succeed in the 4th attempt after waiting 150 seconds.

If it took 305 seconds, it would fail in the first 2 strategies, but it'd succeed in the 5th attempt using this strategy.

If it ends up taking longer than 310 seconds, you can customize the number of retries to e.g. 6, 7, 8… and it will eventually succeed (after waiting a long time 😅 dang it, Apple!)

I hope this is a decent hotfix for this annoying intermittent issue caused by Apple's servers.

Testing Steps

To test this branch, modify your Gemfile as:

gem 'fastlane', git: 'https://github.com/fastlane/fastlane.git', branch: 'rogerluan/add-exponential-backoff-to-retry'

And run bundle install to apply the changes.

Before, you should see something like this when attempting to create a new app version:

[17:42:54]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 4)
[17:43:06]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 3)
[17:43:17]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 2)
[17:43:27]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 1)
[17:43:38]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 0)

Now it should wait longer before each retry, using a exponential backoff strategy, for instance:

[17:42:54]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 4)
[17:43:06]: Cannot find edit app store version... Retrying after 20 seconds (remaining: 3)
[17:43:17]: Cannot find edit app store version... Retrying after 40 seconds (remaining: 2)
[17:43:27]: Cannot find edit app store version... Retrying after 80 seconds (remaining: 1)
[17:43:38]: Cannot find edit app store version... Retrying after 160 seconds (remaining: 0)

Except that it should now succeed before it runs out of retry attempts, since the wait is longer.

However, even if it doesn't, you can increase the number of retries by passing version_check_wait_retry_limit option or DELIVER_VERSION_CHECK_WAIT_RETRY_LIMIT env var.

@nwainwright
Copy link

@rogerluan I get ...

Unable to parse YAML
mapping values are not allowed here
in 'string', line 21, column 33:
- run: gem 'fastlane', git: 'https://github.com/fastlane/f ...

@rogerluan
Copy link
Member Author

@nwainwright are you trying to edit a yaml file? Like the GitHub Actions config file? That's not where you're supposed to put that line, it's in your Gemfile, as explained in the PR description 😅 if you don't have one, you can't test this PR without creating one (see bundler.io to learn how to configure)

I hope this helps 🙏

@nwainwright
Copy link

@rogerluan thanks and yes, I'm on it, sorry for the miss...I'm not that experienced with fastlane and I've only had one mug of caffeine so far today :)

@rogerluan
Copy link
Member Author

Haha no worries at all, we're all here to learn! Let me know if you need any further help 🤗

@nwainwright
Copy link

Hi @rogerluan I replaced the following line in my Gemfile from...

gem 'fastlane'

to...

gem 'fastlane', git: 'https://github.com/fastlane/fastlane.git', branch: 'rogerluan/add-exponential-backoff-to-retry'

and I get CircleCI failing with...

/Users/distiller/.rbenv/versions/3.1.4/lib/ruby/3.1.0/forwardable.rb:236:in get_profiles': \e[31m[!] undefined method get_profiles' for nil:NilClass (NoMethodError)

  unless defined?(methods.each)

@rogerluan
Copy link
Member Author

@nwainwright are you sure your setup is working if you point fastlane back to the latest version (in other words: revert to gem 'fastlane', run bundle install, and re-run)? Your description plus some code review makes it look like your Fastfile is misconfigured, missing an App Store Connect API auth (e.g. calling app_store_connect_api_key before calling your deliver command) 🤔 but can't tell for sure without reviewing your Fastfile

@nwainwright
Copy link

Hi @rogerluan yep, I reverted what you gave me and it ran just fine. I'm happy to do a Zoom with you if that'd help. neil@uphabit.com is my email and I have a Zoom account. Today is pretty busy but I have short times where I can meet. I have a release ready for the App Store and I'm holding it back to see if we can get this working without having to use Transporter.

@percula
Copy link

percula commented Dec 20, 2023

Hi, thanks for the PR! I tried it out, but unfortunately, it didn't resolve the issue for me. Not your fault though, it appears that the increasing timeouts happened as expected. Is there a way you could make these timeout values configurable in the Fastfile?

[10:03:46]: Cannot find edit app store version... Retrying after 10 seconds (remaining: 4)
[10:03:57]: Cannot find edit app store version... Retrying after 20 seconds (remaining: 3)
[10:04:17]: Cannot find edit app store version... Retrying after 40 seconds (remaining: 2)
[10:04:57]: Cannot find edit app store version... Retrying after 80 seconds (remaining: 1)
[10:06:18]: Cannot find edit app store version... Retrying after 160 seconds (remaining: 0)
[10:08:58]: Cannot update languages - could not find an editable version for 'IOS'

@nwainwright
Copy link

@percula what about increasing the retry count to 10?

@percula
Copy link

percula commented Dec 20, 2023

Sorry, just noticed that it is configurable. I'll give that a shot. 👍

@percula
Copy link

percula commented Dec 20, 2023

@nwainwright , it worked on try #8 this time, thanks! Perhaps we should bump up the default?

[11:53:36]: Cannot find edit app info... Retrying after 10 seconds (remaining: 9)
[11:53:46]: Cannot find edit app info... Retrying after 20 seconds (remaining: 8)
[11:54:07]: Cannot find edit app info... Retrying after 40 seconds (remaining: 7)
[11:54:47]: Cannot find edit app info... Retrying after 80 seconds (remaining: 6)
[11:56:08]: Cannot find edit app info... Retrying after 160 seconds (remaining: 5)
[11:58:49]: Cannot find edit app info... Retrying after 320 seconds (remaining: 4)
[12:04:09]: Cannot find edit app info... Retrying after 640 seconds (remaining: 3)
[12:14:50]: Cannot find edit app info... Retrying after 1280 seconds (remaining: 2)
[12:36:12]: No changes to localized App Info detected. Skipping upload.

@rogerluan
Copy link
Member Author

Thanks for confirming @percula 🙏 I'm glad it's working for you!

I'm leaning towards not changing the default max retry count because in your case, for instance, the total wait time was 2550 seconds, which's 42 minutes of idle work that are being billed by your CI provider. This might mean significant additional charges in CI cost for fastlane users without their consent. If users are facing problems, I think we should educate them in knowingly tweaking their config to allow for this increased waiting time to happen, knowing the consequences of their actions. Until then, a failure (caused by Apple 😞) sounds like the lesser evil 😬

This PR is already increasing the total wait time from 50s to 5mins, which already impacts CI cost for users, but I guess not by too much. Bumping to 42mins would also most likely hit the default max build time for many CI providers, which would result in more complaints 😅

I hope you understand 🙏

@nwainwright
Copy link

I haven't been able to get it to work yet, but as you know I'm not a fastlane or CircleCI expert

@rogerluan
Copy link
Member Author

@nwainwright can you include a full stacktrace of your crash (and perhaps the output of fastlane env) so I can investigate the issue you're seeing? Don't forget to redact any potential sensitive info before posting 🙏

@TheWirv
Copy link

TheWirv commented Dec 21, 2023

Hey @rogerluan, I am struggling with this Apple blunder as well, and have been watching this issue (#21705) with much interest, as I am your PR.

But I was wondering if increasing the timeouts exponentially is really the right way to go? As you already pointed out, for a 9th attempt, the script waits 2560 seconds, i.e. ~42 minutes. What if there's multiple time windows in those 42 minutes of pure idle waiting where the Apple servers would be able to provide a proper response?

We have set up our own build machines in-house, so we don't pay for CI time, but still, I don't to potentially wait a whole workday for a deployment to finish, when deploying a few apps with one pipeline.

Do you maybe think it might make more sense to increase the timeouts with a less steep curve?

@rogerluan
Copy link
Member Author

@TheWirv I see your point, maybe we could adjust the algorithm. Exponential backoff is pretty standard, so I just thought of the simplest solution 😃 We could, for instance, start at 10s, check every 10s and then after X (e.g. 1min) check every 60s for X amount of times (then we'd change the customizable parameter from "max retries" to "max waiting time", since I think that would make more sense. I don't wanna add 2 customizable parameters for this same thing, e.g. waiting time per attempt + number of times, there's no need for such granularity IMO and deliver is already a tool with far too many parameters 😅 )

Just notice that the 42min example isn't as bad as it sounds, that's just the max waiting time, but the last waiting time section was actually 1280s (~21mins), not 42mins 😅 and before that, the algo had waited for 1270s without any updates (again, ~21mins), so in this new proposed algo if the exact same case occurred again, we'd be doing at least 26+ checks without success, and then anywhere between 1 and 21 more checks (let's say avg ~10?) checks before it succeeds. Note that that's not necessarily a bad thing, just pointing out the difference in the algo 🙇

@lacostej @AliSoftware what do you guys think would be the best approach in this case? Looking forward to hearing more opinions 💪

@percula
Copy link

percula commented Dec 21, 2023

@rogerluan , perhaps we combine an exponential backoff with a max interval of 5 minutes.

wait_time = min(wait_time * 2, 5*60)

Then we have one configurable parameter "version_check_wait_time_limit" which is the number of minutes that the user is willing to wait. The script will retry at the exponential/max interval until it hits that time limit.

@rogerluan
Copy link
Member Author

Yeah that's a very similar solution to the one I suggested above 👍 At the end of the day we'll have to decide what's the best waiting time for the linear part of the equation (e.g. 1min, 5mins, 2, 3…)

@percula
Copy link

percula commented Dec 21, 2023

🤦 I need to avoid commenting first thing in the morning before the caffeine kicks in 😅

Personally I'd be okay with anything between 1 and 5 minutes, but I'm running locally so I don't have a CI bill to pay.

@nwainwright
Copy link

@rogerluan @percula ASC might be fixed? I just had a CircleCI push work for the first time in weeks

@rogerluan
Copy link
Member Author

rogerluan commented Dec 26, 2023

Maybe, yeah. It's been more stable for the past few days, according to my own monitoring system:

image

Copy link
Member

@crazymanish crazymanish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! 🚀
Thank you, Roger!! 👏

@@ -89,7 +89,7 @@ def upload(options)
enabled_languages = detect_languages(options)

app_store_version_localizations = verify_available_version_languages!(options, app, enabled_languages) unless options[:edit_live]
app_info = fetch_edit_app_info(app)
app_info = fetch_edit_app_info(app, max_retries: options[:version_check_wait_retry_limit])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider creating an object called RetryConfig and move all fields related to the retry (max_retries, initial_wait_time) into it.

This would change the retry_if_nil method like this

    def retry_if_nil(message, retry_config:)
      wait_time = retry_config[:initial_wait_time]
      tries = retry_config[:tries]

and all other calls would pass a single parameter that could be initialized in one place.

Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does, just not sure if it's worth unifying these into an object, since it's just 2 configs? I usually do these type of things once there are 3 or more parameters 👀 Do you think this would make a significant difference? In terms of e.g. UX/DX, testability, maintainability…? @lacostej

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fetch the options[:version_check_wait_retry_limit] 5 times in the code, I suspect there should be a way to make this a bit better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a proposal

  1. we do not add any retry related parameter to the fetch_* methods. We even remove the wait_time one
  2. we also remove the parameters from retry_if_nil, and instead, initialize the initial_wait_time from options within the implementation

this should remove most retry specific information in most of the code

  1. remains the question of the tests. I guess we pass a default value of 0.01 to ensure that they do not take too long. There we could instead stub the sleep call instead, and just ensure it gets called the right amount of times. It will even make the tests slightly faster. Stubbing sleep requires a bit more work to do right.

Copy link
Member Author

@rogerluan rogerluan Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lacostej I think this would be a good approach, I'll try to come back to this later this week 🙏

If you have a clear understanding on how to implement this and wants to take a stab before I have time to get to it, that'd probably help us getting this out of the door faster as I don't think I'll have much availability in the next few days 🙇 if not, all good too! 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey team, I won't have time to look into this in the upcoming future 😢 @lacostej or anyone else, feel free to pick this up 🙏I'm also ok with merging this in as is and creating a separate task to refactor later (although I suspect it won't be picked up because it won't be a high priority 😞)

Copy link
Member

@mollyIV mollyIV Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello 👋 I will take it over 😊

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The follow-up pull request that addresses requested changes: #21861 🙇

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mollyIV

Yes this is good. If you look at the overall combined PR, the code is I think cleaner.

thanks for spending the time to do this!

@AliSoftware
Copy link
Contributor

@rogerluan CircleCI responded with

[21:53:43]: The provided entity includes a relationship with an invalid value - You cannot create a new version of the App in the current state. - /data/relationships/app
The provided entity includes an attribute with a value that has already been used - The version number has been previously used. - /data/attributes/versionString

... when I sent it via CircleCI to the App Store

We've had a similar error as well (despite properly bumping the version) in our CI (Buildkite) too today, but also in some occasions in December. iirc retrying a couple minutes later made it pass.

@rogerluan
Copy link
Member Author

Interesting @AliSoftware thanks for the input! This makes me believe even more now that it's an issue on Apple's end 😢 not sure if there's something reasonable we could do on their end to fix their issues 😞

@nwainwright
Copy link

@rogerluan I just had CircleCI push all the way to the App Store without an issue. First time since mid-December. Maybe Apple is fixing App Store Connect. 😀

@nwainwright
Copy link

@rogerluan and it failed again yesterday, so the issues with ASC continue

@JulienDev
Copy link

I've been updating my app almost daily, and I've encountered an issue each time I try to publish it over the past month. The first time I run the lane, it fails, but on the second attempt, it works. This is quite similar to a previous issue we experienced, as detailed here: #16991. It would be fantastic to see this pull request merged. :)

@nwainwright
Copy link

@rogerluan success today after a number of tries...
Screenshot 2024-01-22 at 10 14 29 PM

@rogerluan
Copy link
Member Author

Left my latest update here @ everyone: #21742 (comment)

@ruairioliverwv
Copy link

@joshdholtz could we merge this one and refactor later if needed? The failed uploads are a serious problem. Thanks

@kpowelltech
Copy link

Thanks to everyone for their input and hard work on this issue! As my organization builds a massive number of apps, this new problem from Apple has become a significant and costly obstacle to our workflow.

@sepbehroozi
Copy link

Hey everyone.
Any update on this PR?
We also rely on this functionality to create new versions on the AppStoreConnect and it's been a while since this issue is causing our CI workflows to fail every time.

@LucasMonteiro1
Copy link

We have the same problem here reported by @kpowelltech and @sepbehroozi and this constant failure has cost much more than we expected. Take a look at this PR please.

@nwainwright
Copy link

@LucasMonteiro1 you can run this pre-release PR in production with the instructions provided. Sadly it still fails most times for us. I'm going to look at adding more retries with the parameters provided.

@JulianPedro
Copy link

Hey folks,

If you use Azure DevOps to build and deploy your application with fastlane, is possible split build and deliver in two different jobs, so set retryCountOnTaskFailure: 1 in deliver job, this will retry automatic deliver job when fail, is a temporary solution until this PR is merged!

This should be valid to another CI/CD tools!

* [deliver] Initialize UploadMetadata with options

* Fix linting
@lacostej lacostej merged commit 4c8f207 into master Feb 9, 2024
7 checks passed
@lacostej lacostej deleted the rogerluan/add-exponential-backoff-to-retry branch February 9, 2024 18:04
@kpowelltech
Copy link

Hey team - I noticed that we have merged the latest changes and I wanted to thank you for that. However, I have discovered a workaround that has proven to be quite effective.

Basically, I call a second Deliver step earlier in the Fastlane execution to "prime" ASC with the new version update. This workaround has significantly increased the success rate of my builds. When I don't use it, I see a spike in build failures with the error message "Cannot update languages - could not find an editable version for 'IOS'". I run all my builds through a CICD, so it's crucial that I have a high success rate.

To be more specific, I call this workaround before the primary Delivery function uploads my .ipa to ASC.

   retries = 5
    begin
      deliver(
        skip_binary_upload: true, 
        skip_screenshots: true, 
        submit_for_review: false, 
        force: true 
      )
    rescue => e
      if retries > 0
        puts "Error: #{e.message}. Retrying in 10 seconds..."
        sleep 10
        retries -= 1
        retry
      else
        raise "Failed to create version on App Store Connect after 5 attempts."
      end
    end

@rogerluan
Copy link
Member Author

@kpowelltech isn't your code essentially what this PR is doing? Did you get a chance to test this PR without that retry mechanism of yours? 😊

@alex-lechner
Copy link

Still not fixed. Left a comment in #21705:

This issue is still not fixed (at least for Github Workflow. I'm using Fastlane version 2.219.0 (latest as of now) and I still get Cannot update languages - could not find an editable version for 'IOS'

@ddramowicz
Copy link

ddramowicz commented Feb 21, 2024

Still not fixed. Left a comment in #21705:

This issue is still not fixed (at least for Github Workflow. I'm using Fastlane version 2.219.0 (latest as of now) and I still get Cannot update languages - could not find an editable version for 'IOS'

There hasn't been a new release of Fastlane since this PR was merged, so I guess we need to wait patiently for that. I have been using this branch in the meantime (or you can use master).

@TheWirv
Copy link

TheWirv commented Feb 22, 2024

@joshdholtz when could we expect a new release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deliver fails to create a new app version on App Store Connect