Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: standardize output directory for artifacts #9888

Merged
merged 20 commits into from
Jan 19, 2023

Conversation

humitos
Copy link
Member

@humitos humitos commented Jan 11, 2023

Always expose the final artifacts under _readthedocs/<format> directory, where <format> can be one of the following:

  • html
  • htmlzip
  • pdf
  • epub
  • json

This allows all users of the platform to have the same behavior. No matter if they are using the default Sphinx/MkDocs builders or a custom one by using build.commands.

Besides, this commit removes the "intermediary" directory used before uploading the files to S3. This allows users to modify the output files directly from _readthedocs/<format> using build.jobs.post_build, for example.

Some important notes

  • BaseBuilder.type is removed since it was not required anymore.
  • There is no old_artifact_path anymore since there is no intermediary folder.
  • The sphinx-build command is executed from path where the repository is cloned, instead from the the directory where the conf.py lives. This is to simplify the code and be able to use relative path for the build output directory (e.g. sphinx-build -b html ... docs/ _readthedocs/html).
  • A new BaseSphinx._post_build method was added to cleanup the output directory (leave only one .epub, .html and .pdf file before moving forward).
  • These changes should allow us to support all formats for build.commands. We would need to check if these standard directories exist, set these formats in the version and upload their content.
  • Upload step now validate there is one and only one PDF, ePUB and ZIP file when they are built. Currently, we only support one file per version. In the future we could remove this restriction. It fails silently for now, which is not great. We need to refactor this code a little to fail the build properly and communicate this to the user. I think it's the correct way of doing it.
  • This could also allow to support other ways to build PDF (SimplePDF or tectonic, see PDF builder: tectonic #9598) and also ZIP (e.g. zundler, see https://github.com/readthedocs/meta/issues/77#issuecomment-1301927685).

Use cases manually tested from test-builds

  • mkdocs-python-tags
  • build-commands
  • all-formats
  • latest
  • build-jobs-post-build
  • singlehtml
  • htmldir

Closes #9179

@humitos humitos requested a review from a team as a code owner January 11, 2023 14:43
@humitos humitos requested review from ericholscher, stsewd and agjohnson and removed request for benjaoming January 11, 2023 14:43
Always expose the final artifacts under `_readthedocs/<format>` directory, where
`<format>` can be one of the following:

- html
- htmlzip
- pdf
- epub
- json

This allows all users of the platform to have the same behavior. No matter if
they are using the default Sphinx/MkDocs builders or a custom one by using
`build.commands`.

Besides, this commit removes the "intermediary" directory used _before_
uploading the files to S3. This allows users to modify the output files directly
from `_readthedocs/<format>` using `build.jobs.post_build`, for example.

- `BaseBuilder.type` is removed since it was not required anymore.
- There is no `old_artifact_path` anymore since there is no intermediary folder.
- The `sphinx-build` command is executed from path where the repository is
  cloned, instead from the the directory where the `conf.py` lives. This is to
  simplify the code and be able to use relative path for the build output
  directory (e.g. `sphinx-build -b html ... docs/ _readthedocs/html`).
- A new `BaseSphinx._post_build` method was added to cleanup the output
  directory (leave only one .epub, .html and .pdf file before moving forward).
- These changes should allow us to support _all formats_ for `build.commands`.
  We would need to check if these standard directories exist, set these formats
  in the version and upload their content.
- Upload step now validate there is one and only one PDF, ePUB and ZIP file when
  they are built. Currently, we only support one file per version. In the future
  we could remove this restriction. It fails silently for now, which is not
  great. We need to refactor this code a little to fail the build properly and
  communicate this to the user. I think it's _the correct way_ of doing it.
- This could also allow to support other ways to build PDF (SimplePDF or
 `tectonic`, see #9598) and
  also ZIP (e.g. zundler, see
  readthedocs/meta#77 (comment)).

Closes #9179
@humitos humitos force-pushed the humitos/standard-output-directory branch from db78cdf to faec2a2 Compare January 11, 2023 14:59
@ericholscher
Copy link
Member

ericholscher commented Jan 16, 2023

  • The sphinx-build command is executed from path where the repository is cloned, instead from the the directory where the conf.py lives. This is to simplify the code and be able to use relative path for the build output directory (e.g. sphinx-build -b html ... docs/ _readthedocs/html).

This one scares me a little bit, as I mentioned in the call. Users are doing lots of weird things with the working directory in conf.py, mostly to get various things added to the PYTHONPATH or similar. I do think we probably want to avoid this if we can easily undo it.

Upload step now validate there is one and only one PDF, ePUB and ZIP file when they are built. Currently, we only support one file per version. In the future we could remove this restriction. It fails silently for now, which is not great. We need to refactor this code a little to fail the build properly and communicate this to the user. I think it's the correct way of doing it.

This is a nice improvement. Silent failure is definitely not a great UX in general, and I'm glad we're working to remove these as we go.

I think the next step here would probably be supporting a zip file, to allow people to upload multiple? Or support multiple, and zip them on our side. But not a problem for this PR :)

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited about this change, and think it's a great one. Other than my 2 comments above, I just had a few small nitpicks as I read through things. Feel free to disregard some of them, I am just a bit hesitent to change too much all at once, given the side of this already underlying change.

Really think this is a great step forward overall though. We should coordinate this change w/ the rclone changes @stsewd is making. I don't think they'll conflict too badly, but definitely changing a lot of similar code.

readthedocs/doc_builder/backends/sphinx.py Show resolved Hide resolved
readthedocs/doc_builder/backends/sphinx.py Show resolved Hide resolved
readthedocs/doc_builder/backends/sphinx.py Show resolved Hide resolved
readthedocs/doc_builder/backends/sphinx.py Show resolved Hide resolved
readthedocs/doc_builder/backends/sphinx.py Outdated Show resolved Hide resolved
readthedocs/doc_builder/backends/sphinx.py Outdated Show resolved Hide resolved
safe_rmtree(self.old_artifact_path)
log.info('Removing old artifact path.', path=self.old_artifact_path)
def _post_build(self):
"""Execute extra steps (e.g. create ZIP, rename PDF, etc) after building if required."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Python support empty function bodies?!

I generally still think it's cleaner to have a pass or raise NotImplementedError here to be more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He, yes. It does. I usually add a pass as well. However, it seems that Black removed it. We cannot have a raise NotImplementedError because it's always called by the builder and some of the subclass don't implement it.

I'm not sure if there is a way to tell Black that I want to keep the pass in this line. That's probably what we need here.

log.info(
"Store build artifacts finished.",
time=(timezone.now() - time_before_store_build_artifacts).seconds,
)

# NOTE: we are updating the db version instance *only* when
# HTML are built successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we deleted half this comment. Is it not true anymore?

Ah. I see it got missed in a comment refactor. This line needs to be moved down below :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by this comment. I'm not sure if there is something I have to do here or not 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the comment was 2 lines, and only 1 line was moved a few lines down. The comment is basically saying we only hit the API when html passes, which is medium-weird, but we don't want to change with this refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is medium-weird, but we don't want to change with this refactor

Yes, this is wrong, in my opinion. I noticed this when I did the Celery refactor. This is something I want to change soon, if possible.

readthedocs/projects/tasks/builds.py Outdated Show resolved Hide resolved
readthedocs/projects/tests/test_build_tasks.py Outdated Show resolved Hide resolved
@stsewd
Copy link
Member

stsewd commented Jan 17, 2023

We should coordinate this change w/ the rclone changes @stsewd is making. I don't think they'll conflict too badly, but definitely changing a lot of similar code.

This doesn't conflict with the rclone PR.

@humitos
Copy link
Member Author

humitos commented Jan 17, 2023

@ericholscher

This one scares me a little bit, as I mentioned in the call. Users are doing lots of weird things with the working directory in conf.py, mostly to get various things added to the PYTHONPATH or similar. I do think we probably want to avoid this if we can easily undo it.

This is definitely a good point.

I can revert this change, but it will produce different relative output paths depending on the location of the conf.py for the project --which could be confusing:

  • a project with docs/conf.py will execute this command: sphinx-build -b html [...] . ../_readthedocs/html
  • another project with src/docs/user/conf.py will call: sphinx-build -b html [...] . ../../../_readthedocs/html

Note we will be keeping the source_directory as . but they will have different output_directory. I decided to go the way I wrote the code because it's more explicit, easier to read and the command it's consistent across all the projects (instead of changing the output_directory, this implementation changes the source_directory which makes it explicit, since the cwd is not exposed to users)

In theory, people should be modifying the PYTHONPATH based on the location of the conf.py file (since their code is running from inside it) instead of cwd; but I know that's not true in practice 😄

@humitos
Copy link
Member Author

humitos commented Jan 17, 2023

@ericholscher

I think the next step here would probably be supporting a zip file, to allow people to upload multiple? Or support multiple, and zip them on our side. But not a problem for this PR :)

I prefer supporting uploading and serving multiple .pdf as-is. Otherwise, people will have to download a big .zip just to read one small PDF. Being able to read just one PDF is a good feature. Also, giving people the ability to "download all the PDF as .zip" is another good feature.

In summary, I want both of your suggestions 😄

I proposed to changed this CWD to be the one where the repository was cloned,
but that could lead to some unexpected behavior. So, it's better to be
conservative here and keep the CWD as it was.

There are some small downsides for this decision, but it's better than breaking
people's builds :)
@humitos
Copy link
Member Author

humitos commented Jan 17, 2023

@ericholscher I addressed most of the feedback (thanks!). I reverted the CWD change and I added an exception to raise when there is no PDF found. I repeated some QA locally and it seems that things keep working fine. Note that I tried to make incremental commits to allow easier reviews.

Please, let me know if there are more changes to be made on this PR. If there are things we want to address later, we can create issues otherwise.

@humitos
Copy link
Member Author

humitos commented Jan 18, 2023

@ericholscher

I wonder about using a full path for the readthedocs/html directory? I agree that it's not a great UX, but probably nicer output to just have it be explicit. Even better if we can just do something like $HOME/rtd-builds/<project>/<version>/_readthedocs/html or even $BUILD_ROOT/_readthedocs/html??

I really like this idea of $BUILD_ROOT 💯 . I'd probably name it differently and also include the _readthedocs/ part on it. This is my proposal 1:

$READTHEDOCS_OUTPUT=<path where the repository was clonned>/_readthedocs/

Then, the Sphinx command executed would be:

python -m sphinx -T -E -j auto -b singlehtml -d _build/doctrees -D language=en . $READTHEDOCS_OUTPUT/html

I think it's pretty clear, short and easy to understand 👍🏼 . It also doesn't change the contract we have with build.commands since it keeps using the _readthedocs/html folder from inside where the repository was cloned. Ideally, we should use a path completely separated from the "user's repository" to avoid potential collisions, but that will break the contract we currently have.

Footnotes

  1. we could also have one variable per output format type we support: $READTHEDOCS_OUTPUT_HTML, $READTHEDOCS_OUTPUT_HTMLZIP, $READTHEDOCS_OUTPUT_PDF, etc...

humitos and others added 4 commits January 18, 2023 12:15
Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>
Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>
@humitos
Copy link
Member Author

humitos commented Jan 18, 2023

I opened #9913 to continue the work making the output path an environment variable because I found it will require some more extra work, and I'd like to not block this particular because of that. I'm happy to merge this PR exposing the path as relative now (we are not changing the behavior here since it was the way it currently works) and make the improvement on being explicit in the other PR with fresh eyes and deeper research about why it's not working as expected.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty much ready. It's definitely something we should be testing heavily during QA steps, and ahead of deploy if we have the energy. Really excited to see it go out!

Comment on lines +69 to +75
# IMO, if there are multiple config files,
# the build should fail immediately communicating this to the user.
# This can be achived by unhandle the exception here
# and leaving `on_failure` Celery handle to deal with it.
#
# In case there is no config file, we should continue the build
# because Read the Docs will automatically create one for it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Is there a good way to implement the suggested logic? I guess raise a different exception for multiple configs that we raise (or just don't catch)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I documented this here in this way because I want to change the internal behavior and also communicate these errors better to the user. I'm opening a new issue with all these findings to continue working after this, since it will require some refactor and more design decisions.

@humitos
Copy link
Member Author

humitos commented Jan 19, 2023

@ericholscher thanks for all these reviews 👍🏼 . I opened #9918 with all the notes we took from the conversations on this PR to not forget them and decide which of them we want to prioritize. I tried to link as many thing there as I could.

I'm going to merge this PR now and ping the rest of the team so they can do some QA locally before deploying it next week.

@humitos humitos merged commit 239d890 into main Jan 19, 2023
@humitos humitos deleted the humitos/standard-output-directory branch January 19, 2023 10:26
hoodmane added a commit to hoodmane/pyodide that referenced this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Delay builder.move until after post_build job
3 participants