-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[deliver] increase chances of success when creating a new app version even when Apple servers are degraded #21742
Conversation
@rogerluan I get ... Unable to parse YAML |
@nwainwright are you trying to edit a yaml file? Like the GitHub Actions config file? That's not where you're supposed to put that line, it's in your Gemfile, as explained in the PR description 😅 if you don't have one, you can't test this PR without creating one (see bundler.io to learn how to configure) I hope this helps 🙏 |
@rogerluan thanks and yes, I'm on it, sorry for the miss...I'm not that experienced with fastlane and I've only had one mug of caffeine so far today :) |
Haha no worries at all, we're all here to learn! Let me know if you need any further help 🤗 |
Hi @rogerluan I replaced the following line in my Gemfile from...
to...
and I get CircleCI failing with... /Users/distiller/.rbenv/versions/3.1.4/lib/ruby/3.1.0/forwardable.rb:236:in
|
@nwainwright are you sure your setup is working if you point fastlane back to the latest version (in other words: revert to |
Hi @rogerluan yep, I reverted what you gave me and it ran just fine. I'm happy to do a Zoom with you if that'd help. neil@uphabit.com is my email and I have a Zoom account. Today is pretty busy but I have short times where I can meet. I have a release ready for the App Store and I'm holding it back to see if we can get this working without having to use Transporter. |
Hi, thanks for the PR! I tried it out, but unfortunately, it didn't resolve the issue for me. Not your fault though, it appears that the increasing timeouts happened as expected. Is there a way you could make these timeout values configurable in the Fastfile?
|
@percula what about increasing the retry count to 10? |
Sorry, just noticed that it is configurable. I'll give that a shot. 👍 |
@nwainwright , it worked on try #8 this time, thanks! Perhaps we should bump up the default?
|
Thanks for confirming @percula 🙏 I'm glad it's working for you! I'm leaning towards not changing the default max retry count because in your case, for instance, the total wait time was 2550 seconds, which's 42 minutes of idle work that are being billed by your CI provider. This might mean significant additional charges in CI cost for fastlane users without their consent. If users are facing problems, I think we should educate them in knowingly tweaking their config to allow for this increased waiting time to happen, knowing the consequences of their actions. Until then, a failure (caused by Apple 😞) sounds like the lesser evil 😬 This PR is already increasing the total wait time from 50s to 5mins, which already impacts CI cost for users, but I guess not by too much. Bumping to 42mins would also most likely hit the default max build time for many CI providers, which would result in more complaints 😅 I hope you understand 🙏 |
I haven't been able to get it to work yet, but as you know I'm not a fastlane or CircleCI expert |
@nwainwright can you include a full stacktrace of your crash (and perhaps the output of |
Hey @rogerluan, I am struggling with this Apple blunder as well, and have been watching this issue (#21705) with much interest, as I am your PR. But I was wondering if increasing the timeouts exponentially is really the right way to go? As you already pointed out, for a 9th attempt, the script waits 2560 seconds, i.e. ~42 minutes. What if there's multiple time windows in those 42 minutes of pure idle waiting where the Apple servers would be able to provide a proper response? We have set up our own build machines in-house, so we don't pay for CI time, but still, I don't to potentially wait a whole workday for a deployment to finish, when deploying a few apps with one pipeline. Do you maybe think it might make more sense to increase the timeouts with a less steep curve? |
@TheWirv I see your point, maybe we could adjust the algorithm. Exponential backoff is pretty standard, so I just thought of the simplest solution 😃 We could, for instance, start at 10s, check every 10s and then after X (e.g. 1min) check every 60s for X amount of times (then we'd change the customizable parameter from "max retries" to "max waiting time", since I think that would make more sense. I don't wanna add 2 customizable parameters for this same thing, e.g. waiting time per attempt + number of times, there's no need for such granularity IMO and deliver is already a tool with far too many parameters 😅 ) Just notice that the 42min example isn't as bad as it sounds, that's just the max waiting time, but the last waiting time section was actually 1280s (~21mins), not 42mins 😅 and before that, the algo had waited for 1270s without any updates (again, ~21mins), so in this new proposed algo if the exact same case occurred again, we'd be doing at least 26+ checks without success, and then anywhere between 1 and 21 more checks (let's say avg ~10?) checks before it succeeds. Note that that's not necessarily a bad thing, just pointing out the difference in the algo 🙇 @lacostej @AliSoftware what do you guys think would be the best approach in this case? Looking forward to hearing more opinions 💪 |
@rogerluan , perhaps we combine an exponential backoff with a max interval of 5 minutes. wait_time = min(wait_time * 2, 5*60) Then we have one configurable parameter "version_check_wait_time_limit" which is the number of minutes that the user is willing to wait. The script will retry at the exponential/max interval until it hits that time limit. |
Yeah that's a very similar solution to the one I suggested above 👍 At the end of the day we'll have to decide what's the best waiting time for the linear part of the equation (e.g. 1min, 5mins, 2, 3…) |
🤦 I need to avoid commenting first thing in the morning before the caffeine kicks in 😅 Personally I'd be okay with anything between 1 and 5 minutes, but I'm running locally so I don't have a CI bill to pay. |
@rogerluan @percula ASC might be fixed? I just had a CircleCI push work for the first time in weeks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! 🚀
Thank you, Roger!! 👏
@@ -89,7 +89,7 @@ def upload(options) | |||
enabled_languages = detect_languages(options) | |||
|
|||
app_store_version_localizations = verify_available_version_languages!(options, app, enabled_languages) unless options[:edit_live] | |||
app_info = fetch_edit_app_info(app) | |||
app_info = fetch_edit_app_info(app, max_retries: options[:version_check_wait_retry_limit]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider creating an object called RetryConfig and move all fields related to the retry (max_retries
, initial_wait_time)
into it.
This would change the retry_if_nil
method like this
def retry_if_nil(message, retry_config:)
wait_time = retry_config[:initial_wait_time]
tries = retry_config[:tries]
and all other calls would pass a single parameter that could be initialized in one place.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it does, just not sure if it's worth unifying these into an object, since it's just 2 configs? I usually do these type of things once there are 3 or more parameters 👀 Do you think this would make a significant difference? In terms of e.g. UX/DX, testability, maintainability…? @lacostej
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fetch the options[:version_check_wait_retry_limit]
5 times in the code, I suspect there should be a way to make this a bit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a proposal
- we do not add any retry related parameter to the
fetch_*
methods. We even remove thewait_time
one - we also remove the parameters from
retry_if_nil
, and instead, initialize theinitial_wait_time
fromoptions
within the implementation
this should remove most retry specific information in most of the code
- remains the question of the tests. I guess we pass a default value of
0.01
to ensure that they do not take too long. There we could instead stub thesleep
call instead, and just ensure it gets called the right amount of times. It will even make the tests slightly faster. Stubbing sleep requires a bit more work to do right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lacostej I think this would be a good approach, I'll try to come back to this later this week 🙏
If you have a clear understanding on how to implement this and wants to take a stab before I have time to get to it, that'd probably help us getting this out of the door faster as I don't think I'll have much availability in the next few days 🙇 if not, all good too! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey team, I won't have time to look into this in the upcoming future 😢 @lacostej or anyone else, feel free to pick this up 🙏I'm also ok with merging this in as is and creating a separate task to refactor later (although I suspect it won't be picked up because it won't be a high priority 😞)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello 👋 I will take it over 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The follow-up pull request that addresses requested changes: #21861 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is good. If you look at the overall combined PR, the code is I think cleaner.
thanks for spending the time to do this!
We've had a similar error as well (despite properly bumping the version) in our CI (Buildkite) too today, but also in some occasions in December. iirc retrying a couple minutes later made it pass. |
Interesting @AliSoftware thanks for the input! This makes me believe even more now that it's an issue on Apple's end 😢 not sure if there's something reasonable we could do on their end to fix their issues 😞 |
@rogerluan I just had CircleCI push all the way to the App Store without an issue. First time since mid-December. Maybe Apple is fixing App Store Connect. 😀 |
@rogerluan and it failed again yesterday, so the issues with ASC continue |
I've been updating my app almost daily, and I've encountered an issue each time I try to publish it over the past month. The first time I run the lane, it fails, but on the second attempt, it works. This is quite similar to a previous issue we experienced, as detailed here: #16991. It would be fantastic to see this pull request merged. :) |
@rogerluan success today after a number of tries... |
Left my latest update here @ everyone: #21742 (comment) |
@joshdholtz could we merge this one and refactor later if needed? The failed uploads are a serious problem. Thanks |
Thanks to everyone for their input and hard work on this issue! As my organization builds a massive number of apps, this new problem from Apple has become a significant and costly obstacle to our workflow. |
Hey everyone. |
We have the same problem here reported by @kpowelltech and @sepbehroozi and this constant failure has cost much more than we expected. Take a look at this PR please. |
@LucasMonteiro1 you can run this pre-release PR in production with the instructions provided. Sadly it still fails most times for us. I'm going to look at adding more retries with the parameters provided. |
Hey folks, If you use Azure DevOps to build and deploy your application with fastlane, is possible split build and deliver in two different jobs, so set retryCountOnTaskFailure: 1 in deliver job, this will retry automatic deliver job when fail, is a temporary solution until this PR is merged! This should be valid to another CI/CD tools! |
* [deliver] Initialize UploadMetadata with options * Fix linting
Hey team - I noticed that we have merged the latest changes and I wanted to thank you for that. However, I have discovered a workaround that has proven to be quite effective. Basically, I call a second Deliver step earlier in the Fastlane execution to "prime" ASC with the new version update. This workaround has significantly increased the success rate of my builds. When I don't use it, I see a spike in build failures with the error message "Cannot update languages - could not find an editable version for 'IOS'". I run all my builds through a CICD, so it's crucial that I have a high success rate. To be more specific, I call this workaround before the primary Delivery function uploads my .ipa to ASC.
|
@kpowelltech isn't your code essentially what this PR is doing? Did you get a chance to test this PR without that retry mechanism of yours? 😊 |
Still not fixed. Left a comment in #21705:
|
There hasn't been a new release of Fastlane since this PR was merged, so I guess we need to wait patiently for that. I have been using this branch in the meantime (or you can use |
@joshdholtz when could we expect a new release? |
Checklist
bundle exec rspec
from the root directory to see all new and existing tests passbundle exec rubocop -a
to ensure the code style is validci/circleci
builds in the "All checks have passed" section of my PR (connect CircleCI to GitHub if not)Motivation and Context
Resolves #21705
Closes #21741
Description
After submitting a new version, App Store Connect takes some time to recognize the new version and we must wait until it's available before attempting to upload metadata for it. There is a mechanism that will check if it's available and retry after a certain time. This time used to be 10 seconds, and then each retry would be after 10 seconds again, each time.
This PR changes this behavior by adding an exponential backoff to this retry mechanism, which means it will retry after 10 seconds on the first retry, and then after 20 seconds, 40, 80, 160, etc…
The default number of retries is 5. Prior to this PR you couldn't customize this value, but now you can via a new option in
deliver
calledversion_check_wait_retry_limit
or via an env varDELIVER_VERSION_CHECK_WAIT_RETRY_LIMIT
. This option takes an integer which represents the max number of retries before giving up.The initial wait time remains at 10 seconds and it's not customizable. We decided to keep it this way instead of increasing the initial wait time so that the experience when Apple's servers are quick isn't jeopardized.
Click here to visualize scenario planning so you understand how this strategy works and how it compares to other strategies considered
Prior to this PR, the wait time + number of retries would be:
1st attempt: 10 seconds
2nd attempt: 10 seconds
3rd attempt: 10 seconds
4th attempt: 10 seconds
5th attempt: 10 seconds
Total: 50 seconds
If we simply increased the initial waiting time, to let's say 60s, like seen in the proposal #21741, then this would mean:
1st attempt: 60 seconds
2nd attempt: 60 seconds
3rd attempt: 60 seconds
4th attempt: 60 seconds
5th attempt: 60 seconds
Total: 300 seconds
In this PR, we adopt a strategy that provides faster wait times if Apple's servers aren't too slow:
1st attempt: 10 seconds
2nd attempt: 20 seconds
3rd attempt: 40 seconds
4th attempt: 80 seconds
5th attempt: 160 seconds
Total: 310 seconds
So if the Apple servers take, say, 150 seconds to process the new app version you just submitted, in current master it would timeout and your CI would fail/crash. In the 2nd proposal above, it would succeed in the 3rd attempt after waiting 180 seconds, and in this PR's proposal (last example above) it would succeed in the 4th attempt after waiting 150 seconds.
If it took 305 seconds, it would fail in the first 2 strategies, but it'd succeed in the 5th attempt using this strategy.
If it ends up taking longer than 310 seconds, you can customize the number of retries to e.g. 6, 7, 8… and it will eventually succeed (after waiting a long time 😅 dang it, Apple!)
I hope this is a decent hotfix for this annoying intermittent issue caused by Apple's servers.
Testing Steps
To test this branch, modify your Gemfile as:
And run
bundle install
to apply the changes.Before, you should see something like this when attempting to create a new app version:
Now it should wait longer before each retry, using a exponential backoff strategy, for instance:
Except that it should now succeed before it runs out of retry attempts, since the wait is longer.
However, even if it doesn't, you can increase the number of retries by passing
version_check_wait_retry_limit
option orDELIVER_VERSION_CHECK_WAIT_RETRY_LIMIT
env var.