Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate git extraction errors #6671

Closed
EnTeQuAk opened this issue May 15, 2019 · 6 comments
Closed

Investigate git extraction errors #6671

EnTeQuAk opened this issue May 15, 2019 · 6 comments
Labels
repository:addons-server Issue relating to addons-server

Comments

@EnTeQuAk
Copy link
Contributor

There will most certainly be filed separate issues but I'll open a meta issue to document my findings while looking at this.

I'm more specifically talking about the following sentry exceptions:

Potentially transaction related

Pygit/libgit related

Potentially EFS/NFS related

Other

@diox
Copy link
Member

diox commented Jun 18, 2019

For the extraction not finding the Version, we figured out what was going on:

Code path Release channel from_upload behaviour
devhub unlisted in-request (_submit_upload)
devhub listed in-request (_submit_upload)
API unlisted task (handle_upload(submit=True) creates a task that calls create_version_for_upload())
API listed task (handle_upload(submit=True) creates a task that calls create_version_for_upload())

Version created from the API are created in a task, by create_version_for_upload, which calls sign_file, which calls extract_version_to_git.delay(). The problem is, create_version_for_upload is decorated with a @transaction.atomic, so the transaction might not have been committed when we reach the call to extract_version_to_git.delay().

The solution is to make that call on_commit, and make sure sign_file() is always called inside a transaction.

@EnTeQuAk
Copy link
Contributor Author

EnTeQuAk commented Jul 12, 2019

So, I'd like to do the extreme here… there is a setting in our code called NFS_LAG_DELAY and I'd like to re-use that for the extraction. My current plan would be:

  • use flufl.lock to create a lock specific to a version that would be extracted
  • potentially sleep for a short amount of time (or NFS_LAG_DELAY) to ensure the lock is actually set - although flufl.lock ensures that as much as possible
  • create git repository (if not existing)
    • wait NFS_LAG_DELAY again to ensure all .git files exist properly
  • create branch (if not existing)
    • wait NFS_LAG_DELAY again to ensure all .git files get updated properly
  • create temporary worktree
    • wait NFS_LAG_DELAY again to ensure all .git files get updated properly
  • extract all files from XPI/ZIP
    • wait NFS_LAG_DELAY again to ensure all files got written properly
  • git add -A and commit

This is a very simplistic representation, ideally this means that we're going to lock and wait after the following methods: AddonGitRepository.{git_repository (if not existing), find_or_create_branch, _commit_through_worktree}

That ideally fixes all or most of the problems by cleanly writing all the data to EFS.

Another alternative to fix this would be to introduce some kind of proxy that accepts all the reads/writes and queues them properly, but that'll make things even more complex than they already are.

@tapaswenipathak
Copy link

Hello folks: Can I help or is the ticket open?

@EnTeQuAk
Copy link
Contributor Author

EnTeQuAk commented Aug 5, 2019

Hello folks: Can I help or is the ticket open?

No, this issue isn't really suitable for contributors as it requires quite a few experimentation - locally as well as on our servers.

Please take a look at other issues that have the contrib: welcome label.

@EnTeQuAk
Copy link
Contributor Author

EnTeQuAk commented Nov 29, 2019

Without actually implementing a extraction-level fix yet, the failure rate has been lowered significantly since November 21st because we enabled delayed unlisted signing.

That only means we're on the right track here with delaying the extraction and improving locking.

Specifically…

https://sentry.prod.mozaws.net/operations/olympia-prod/issues/5525655/

https://sentry.prod.mozaws.net/operations/olympia-prod/issues/5525343/

https://sentry.prod.mozaws.net/operations/olympia-prod/issues/5525326/

https://sentry.prod.mozaws.net/operations/olympia-prod/issues/5525342/

Have simply been "fixed" now. These 4 errors have been the origin of ~708k entries in sentry in the last ~8 months

@willdurand willdurand removed their assignment May 11, 2020
@willdurand
Copy link
Member

I'm going to close this issue and I'll file new ones, specifically one for each error. We fixed a bunch of errors already, but new ones appeared it seems.

@KevinMind KevinMind added migration:no-jira repository:addons-server Issue relating to addons-server labels May 4, 2024
@KevinMind KevinMind transferred this issue from mozilla/addons-server May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repository:addons-server Issue relating to addons-server
Projects
None yet
Development

No branches or pull requests

5 participants