Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom LFS integration has broken unexpectedly #8288

Closed
eloquence opened this issue Jun 25, 2021 · 22 comments · Fixed by #8340
Closed

Custom LFS integration has broken unexpectedly #8288

eloquence opened this issue Jun 25, 2021 · 22 comments · Fixed by #8340
Labels
Support Support question

Comments

@eloquence
Copy link

Details

The SecureDrop project uses https://github.com/freedomofpress/securedrop-docs as its documentation repo. It uses LFS because screenshots are updated with each release, which would otherwise balloon the repo size.

We are aware that LFS is not officially supported (#1846), but have so far relied on a commonly used method for installing git-lfs via conf.py (https://github.com/freedomofpress/securedrop-docs/blob/main/docs/conf.py#L21-L27). This method has stopped working unexpectedly approximately 2-3 days ago, causing all images in our documentation to break (they are served as LFS pointers).

We would appreciate any light you can shed on changes to the RTD build environment that may have caused this breakage. We have attempted to resolve it using various strategies, e.g., by installing git-lfs as a package, to no avail so far. It appears to us that even when the LFS checkout is working as expected, at some point during the build process, the actual images are not copied correctly.

Further notes on our technical investigation so far can be found at freedomofpress/securedrop-docs#234 . We can of course revert to a non-LFS repo, but if there is a straightforward fix or lead for us to pursue, we would appreciate any pointers.

@humitos
Copy link
Member

humitos commented Jun 28, 2021

@eloquence Hi! Can you paste the link for your RTD project?

This method has stopped working unexpectedly approximately 2-3 days ago, causing all images in our documentation to break (they are served as LFS pointers).

How did it break? Can you paste the link for a failing build?

@humitos
Copy link
Member

humitos commented Jun 28, 2021

Definitely, something broke for some reason. The example from test-builds is not working anymore: https://test-builds.readthedocs.io/en/git-lfs/ (note the image is not shown)

@humitos
Copy link
Member

humitos commented Jun 28, 2021

I updated git-lfs to its latest version and it didn't fix the issue.

@humitos humitos added the Support Support question label Jun 28, 2021
@humitos
Copy link
Member

humitos commented Jun 28, 2021

I don't have too much experience with git-lfs, but in any case, I don't think this is a bug/issue/problem of RTD --it seems the files are not properly checked out locally by git-lfs, since my test from that page shows the content as a text file:

>>> print(open('img/git-lfs.png', 'rb').read())
b'version https://git-lfs.github.com/spec/v1\noid sha256:daf4c09c20b1ed6d72f7fee66b68005abe21bfa51c692f717fd32a313fb300ab\nsize 329787\n'

when it should show the PNG content itself.

@rmol
Copy link

rmol commented Jun 28, 2021

@humitos If you add that test read of the file to conf.py after the LFS setup, I think it will show the image content (probably causing an exception), not the LFS pointer. I found that git lfs checkout was working, but somewhere in the build, the images are being reverted to pointers. I haven't yet been able to find where that's happening, but the Sphinx code that copies images around, at least, doesn't seem to have changed recently.

@dokempf
Copy link

dokempf commented Jul 2, 2021

I have the exact same problem and I can confirm what @rmol says. My log indicates that the LFS checkout was successful, still the files are pointers when the actual sphinx-build call happens. One thing I noticed is that since my last passing build and today, two additional stages were added to the build log (they show up as git rev-parse HEAD which would only be logging the HEAD SHA1, but may be more is happening at these stage). I can tell that the RTD-generated part of conf.py has not changed between the two build.

@RuRo
Copy link

RuRo commented Jul 2, 2021

I think, that I am having the same issue here.

Also, rebuilding an older version that had working LFS images breaks them, so this is almost certainly a regression.

@RuRo
Copy link

RuRo commented Jul 3, 2021

It looks like the extra git rev-parse HEAD steps were introduced by #8263.
Accessing the commit property on a git backend runs rev-parse.

This might all be a red herring, though, since I don't see any way for rev-parse to convert the lfs files back to pointers.


I think, that I might have figured out, what is wrong. I tried adding some os.system calls to my docs/conf.py and I was able to get the following info:

  1. git status thinks that all the files in the repository are deleted. This happens even before all the git-lfs "dark magic".

  2. The reason, why git status thinks that all the files were deleted is because for some reason git thinks that the repository worktree root is in the current directory (which is the directory with conf.py in it, docs in my case). So

    > realpath . # current working directory
    /home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest/docs
    > git rev-parse --show-toplevel # where git thinks the root of the repository is
    /home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest/docs
    > find ..
    ..
    ../.git
    <snip>
    ../docs
    <snip (there is no docs/.git dir btw)>

    It seems, that the repository was initially cloned into checkouts/latest/, then we cded into docs and now for some reason git thinks that checkouts/latest/docs is the root of the repo even though the .git dir is one level higher.

  3. In this state, running git-lfs checkout does checkout the lfs data correctly, but it is placed in the wrong directory (relative to docs instead of relative to the root of the repo). I was able to confirm this by adding a

    os.system("cp path/to/lfs_image.jpg ../path/to/lfs_image.jpg")

    after all the os.system("./git-lfs ...") commands. After manually moving the checked out image to the correct location, it appeared in the docs.

So the only questions that remain are:

  • Why does git think that the root of the repository is in checkouts/latest/docs instead of checkouts/latest?
  • What changed in readthedocs to cause this issue (and how can it be fixed)?

@dokempf
Copy link

dokempf commented Jul 5, 2021

Thanks for posting your analysis, I did a bit more digging.

* Why does `git` think that the root of the repository is in `checkouts/latest/docs` instead of `checkouts/latest`?

I found

env['GIT_DIR'] = os.path.join(self.working_dir, '.git')
which looks suspicious to me - although it is 11 years old.

* What changed in `readthedocs` to cause this issue (and how can it be fixed)?

The description of #8263 states "In some places we were executing git commands outside the container" so, maybe above line was not actually used, but is now?

Unfortunately, I am missing the code base knowledge to continue from here.

@dokempf
Copy link

dokempf commented Jul 5, 2021

So, I fixed this issue for me by playing around with the working directory of the git-lfs command. I have packaged my currently working solution into a Sphinx extension: https://github.com/ssciwr/sphinx_lfs_content Maybe having the code in a Sphinx extension will make it easier to report and address issues with the non-official LFS workflow on RTD.

@RuRo
Copy link

RuRo commented Jul 5, 2021

Unfortunately, it doesn't seem, that GIT_DIR is the culprit here. By inspecting the output of env

GIT_DIR=/home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest/.git

and git rev-parse --git-dir we can see, that the .git directory is located in

/home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest/.git

as expected. It's only the current git worktree root, that is for some reason moved to

/home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest/docs

The weirdest part is that according to git worktree --list, the only existing worktree is in

/home/docs/checkouts/readthedocs.org/user_builds/<projectname>/checkouts/latest  <sha> (detached HEAD)

So at first glance it seems like the output of git rev-parse --show-toplevel contradicts the output of git worktree --list.


Okay. This is insane. I think, that I might have found the problem. And it might be a bug in git itself. Consider the following example

# Setup example repo
mkdir /example
cd /example
git init .
mkdir docs
cd docs

Then the following behaviour is clearly broken

git rev-parse --show-toplevel # prints /example as expected
export GIT_DIR=/example/.git
git rev-parse --show-toplevel # prints /example/docs for some reason

In fact, after exporting GIT_DIR, rev-parse --show-toplevel seems to always just return the current directory.

The only thing, that I don't understand is how/why did it work differently in the past.

@RuRo
Copy link

RuRo commented Jul 5, 2021

Ah, okay. I think, it makes sense now. The #8263 must be the cause after all. Although it is exporting the correct GIT_DIR environment variable, it results in git getting confused about the location of the current worktree.

@humitos can we get a temporary quick fix for this regression, that doesn't push the GIT_DIR to the user environment?

Edit: actually, the correct solution would probably be to just export GIT_WORK_TREE alongside GIT_DIR in readthedocs/vcs_support/backends/git.py::Backend.env

Here's a relevant extract from man git-config (core.worktree)

If --git-dir or GIT_DIR is specified but none of --work-tree, GIT_WORK_TREE
and core.worktree is specified, the current working directory is regarded as the
top level of your working tree.

Edit edit: I made a PR with the proposed fix.

@dokempf
Copy link

dokempf commented Jul 5, 2021

Nice detective work @RuRo I just build git master from source and your minimum example is still failing there.

@ericholscher
Copy link
Member

I still don't quite understand what broke here. It seems that #8263 wouldn't have changed the environment in the docker containers, and GIT_DIR has been passed for ~11 years into the containers. I'm worried about changing GIT_DIR, since it's been in the builders for so long. It seems like deleting it could cause lots of other issues with users code (eg. people calling git commands without setting the CWD properly).

Maybe @stsewd can chime in a bit more here -- I'd like to understand the issue a bit more so we can help fix it.

@RuRo
Copy link

RuRo commented Jul 13, 2021

Maybe I am wrong, but here is my reasoning, for why the environment wasn't passed into the build step previously.

Look carefully at how self.project.vcs_repo and self.version.project.vcs_repo were called before and after #8263. It used to be called without an environment argument. Which AFAIK causes the BaseVCS to construct a temporary LocalBuildEnvironment that doesn't share the environment variables with the other steps and doesn't record the step.

After #8263, the VCS is now constructed with an explicit environment, that is shared between the steps in the doc builder. This also explain, why there are 2 new git rev-parse HEAD commands in the build logs, since they were previously not recorded.

@ericholscher
Copy link
Member

Yea -- but the build is failing on the python -m sphinx call -- which shouldn't have been changed by this. At least the current build that I'm looking at -- and where conf.py would be evaluated. So I don't believe the env would have changed in that docker container or the code that's being executed. Is the git-lfs release pinned, for example? It would help to see your build log on RTD, so I can get a bit more context.

(Also, thanks so much for diving into this code @RuRo -- we really appreciate the help).

@stsewd
Copy link
Member

stsewd commented Jul 13, 2021

@ericholscher a wild guess, maybe the executing the git commands with GIT_DIR inside the container are changing something (they were being executed outside the container)

- commit = self.commit or self.project.vcs_repo(self.version.slug).commit
+ commit = self.commit or self.get_vcs_repo(environment).commit

@RuRo
Copy link

RuRo commented Jul 13, 2021

(also replying to your comment on the PR)

  • which shouldn't have been changed by this

    I have just described, why it wasn't affected previously, and how is it affected now. Can you clarify, why you don't think, that my explanation is correct?

    Also, I am not near my laptop right now, but I invite you to try building the git-lfs test repo (with an added os.system("env") in conf.py before and after applying Git: don't expand envvars in Gitpython #8263. I am claiming that you won't see GIT_DIR in the environment before Git: don't expand envvars in Gitpython #8263.

  • Is the git-lfs release pinned

    git-lfs is a binary, that is downloaded from their GitHub release page and launched with os.system("./git-lfs ..."). Yes, it's effectively pinned.

  • I didn't remove expand_vars, only GIT_DIR and with this change git-lfs works as expected.

  • I don't even see, how could expand_vars affect the git-lfs binary that is started inside the docker container from conf.py.

  • It would help to see your build log on RTD, so I can get a bit more context.

    I am unfortunately not near my laptop right now, but there are some partial logs earlier in this thread and you can also just build the git-lfs branch from the official test builds.

@stsewd
Copy link
Member

stsewd commented Jul 13, 2021

I have just described, why it wasn't affected previously, and how is it affected now. Can you clarify, why you don't think, that my explanation is correct?

The thing is that we execute the git-clone and sphinx-build commands inside a container (we always have). We also use gitpython to iterate over tags/branches or check if the repo exists, and we were executing the command 'git', 'rev-parse', 'HEAD' outside the container too. So, we were using a combination of both, executing git commands inside and outside the container, but the checkout step has always been executed inside the container (or at leas in the recent years).

@RuRo
Copy link

RuRo commented Jul 13, 2021

The thing is that we execute the git-clone and sphinx-build commands inside a container (we always have). We also use gitpython to iterate over tags/branches or check if the repo exists, and we were executing the command 'git', 'rev-parse', 'HEAD' outside the container too. So, we were using a combination of both, executing git commands inside and outside the container, but the checkout step has always been executed inside the container (or at leas in the recent years).

Hmmm. I see your point now, but are you sure that the environment is actually shared between these steps? And I don't mean if they are both ran inside docker. Is the actual environment object shared between them.

For example, it seems to me, that the checkout step is started from SyncRepositoryTaskStep.run and a new environment object is constructed for this step. (environment = env_cls(...))

I am not 100% sure, but it doesn't seem to be the same object as self.build_env in the doc builders. So the environment dict would not be shared between them.

@stsewd
Copy link
Member

stsewd commented Jul 13, 2021

@RuRo we actually just tracked this down to

self.environment.environment.update(self.env)
, the enviroment vars are being updated on the same instance :) we are discussing a fix, but your PR should be fine to merge anyway.

@stsewd
Copy link
Member

stsewd commented Jul 13, 2021

@RuRo we have deployed the fix, git-lfs should be working now. Thanks for helping us debugging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Support Support question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants