Build: standardize output directory for artifacts #9888

humitos · 2023-01-11T14:43:01Z

Always expose the final artifacts under _readthedocs/<format> directory, where <format> can be one of the following:

html
htmlzip
pdf
epub
json

This allows all users of the platform to have the same behavior. No matter if they are using the default Sphinx/MkDocs builders or a custom one by using build.commands.

Besides, this commit removes the "intermediary" directory used before uploading the files to S3. This allows users to modify the output files directly from _readthedocs/<format> using build.jobs.post_build, for example.

Some important notes

BaseBuilder.type is removed since it was not required anymore.
There is no old_artifact_path anymore since there is no intermediary folder.
The sphinx-build command is executed from path where the repository is cloned, instead from the the directory where the conf.py lives. This is to simplify the code and be able to use relative path for the build output directory (e.g. sphinx-build -b html ... docs/ _readthedocs/html).
A new BaseSphinx._post_build method was added to cleanup the output directory (leave only one .epub, .html and .pdf file before moving forward).
These changes should allow us to support all formats for build.commands. We would need to check if these standard directories exist, set these formats in the version and upload their content.
Upload step now validate there is one and only one PDF, ePUB and ZIP file when they are built. Currently, we only support one file per version. In the future we could remove this restriction. It fails silently for now, which is not great. We need to refactor this code a little to fail the build properly and communicate this to the user. I think it's the correct way of doing it.
This could also allow to support other ways to build PDF (SimplePDF or tectonic, see PDF builder: tectonic #9598) and also ZIP (e.g. zundler, see https://github.com/readthedocs/meta/issues/77#issuecomment-1301927685).

Use cases manually tested from `test-builds`

mkdocs-python-tags
build-commands
all-formats
latest
build-jobs-post-build
singlehtml
htmldir

Closes #9179

Always expose the final artifacts under `_readthedocs/<format>` directory, where `<format>` can be one of the following: - html - htmlzip - pdf - epub - json This allows all users of the platform to have the same behavior. No matter if they are using the default Sphinx/MkDocs builders or a custom one by using `build.commands`. Besides, this commit removes the "intermediary" directory used _before_ uploading the files to S3. This allows users to modify the output files directly from `_readthedocs/<format>` using `build.jobs.post_build`, for example. - `BaseBuilder.type` is removed since it was not required anymore. - There is no `old_artifact_path` anymore since there is no intermediary folder. - The `sphinx-build` command is executed from path where the repository is cloned, instead from the the directory where the `conf.py` lives. This is to simplify the code and be able to use relative path for the build output directory (e.g. `sphinx-build -b html ... docs/ _readthedocs/html`). - A new `BaseSphinx._post_build` method was added to cleanup the output directory (leave only one .epub, .html and .pdf file before moving forward). - These changes should allow us to support _all formats_ for `build.commands`. We would need to check if these standard directories exist, set these formats in the version and upload their content. - Upload step now validate there is one and only one PDF, ePUB and ZIP file when they are built. Currently, we only support one file per version. In the future we could remove this restriction. It fails silently for now, which is not great. We need to refactor this code a little to fail the build properly and communicate this to the user. I think it's _the correct way_ of doing it. - This could also allow to support other ways to build PDF (SimplePDF or `tectonic`, see #9598) and also ZIP (e.g. zundler, see readthedocs/meta#77 (comment)). Closes #9179

readthedocs/doc_builder/backends/sphinx.py

ericholscher · 2023-01-16T23:01:41Z

The sphinx-build command is executed from path where the repository is cloned, instead from the the directory where the conf.py lives. This is to simplify the code and be able to use relative path for the build output directory (e.g. sphinx-build -b html ... docs/ _readthedocs/html).

This one scares me a little bit, as I mentioned in the call. Users are doing lots of weird things with the working directory in conf.py, mostly to get various things added to the PYTHONPATH or similar. I do think we probably want to avoid this if we can easily undo it.

Upload step now validate there is one and only one PDF, ePUB and ZIP file when they are built. Currently, we only support one file per version. In the future we could remove this restriction. It fails silently for now, which is not great. We need to refactor this code a little to fail the build properly and communicate this to the user. I think it's the correct way of doing it.

This is a nice improvement. Silent failure is definitely not a great UX in general, and I'm glad we're working to remove these as we go.

I think the next step here would probably be supporting a zip file, to allow people to upload multiple? Or support multiple, and zip them on our side. But not a problem for this PR :)

ericholscher

I'm very excited about this change, and think it's a great one. Other than my 2 comments above, I just had a few small nitpicks as I read through things. Feel free to disregard some of them, I am just a bit hesitent to change too much all at once, given the side of this already underlying change.

Really think this is a great step forward overall though. We should coordinate this change w/ the rclone changes @stsewd is making. I don't think they'll conflict too badly, but definitely changing a lot of similar code.

readthedocs/doc_builder/backends/sphinx.py

ericholscher · 2023-01-16T23:17:09Z

readthedocs/doc_builder/base.py

-            safe_rmtree(self.old_artifact_path)
-            log.info('Removing old artifact path.', path=self.old_artifact_path)
+    def _post_build(self):
+        """Execute extra steps (e.g. create ZIP, rename PDF, etc) after building if required."""


Does Python support empty function bodies?!

I generally still think it's cleaner to have a pass or raise NotImplementedError here to be more explicit.

He, yes. It does. I usually add a pass as well. However, it seems that Black removed it. We cannot have a raise NotImplementedError because it's always called by the builder and some of the subclass don't implement it.

I'm not sure if there is a way to tell Black that I want to keep the pass in this line. That's probably what we need here.

ericholscher · 2023-01-16T23:20:53Z

readthedocs/projects/tasks/builds.py

        log.info(
            "Store build artifacts finished.",
            time=(timezone.now() - time_before_store_build_artifacts).seconds,
        )

-        # NOTE: we are updating the db version instance *only* when
        # HTML are built successfully.


Seems we deleted half this comment. Is it not true anymore?

Ah. I see it got missed in a comment refactor. This line needs to be moved down below :)

I got confused by this comment. I'm not sure if there is something I have to do here or not 😅

Yes, the comment was 2 lines, and only 1 line was moved a few lines down. The comment is basically saying we only hit the API when html passes, which is medium-weird, but we don't want to change with this refactor.

which is medium-weird, but we don't want to change with this refactor

Yes, this is wrong, in my opinion. I noticed this when I did the Celery refactor. This is something I want to change soon, if possible.

readthedocs/projects/tasks/builds.py

readthedocs/projects/tests/test_build_tasks.py

stsewd · 2023-01-17T00:36:18Z

We should coordinate this change w/ the rclone changes @stsewd is making. I don't think they'll conflict too badly, but definitely changing a lot of similar code.

This doesn't conflict with the rclone PR.

humitos · 2023-01-17T09:47:18Z

@ericholscher

This one scares me a little bit, as I mentioned in the call. Users are doing lots of weird things with the working directory in conf.py, mostly to get various things added to the PYTHONPATH or similar. I do think we probably want to avoid this if we can easily undo it.

This is definitely a good point.

I can revert this change, but it will produce different relative output paths depending on the location of the conf.py for the project --which could be confusing:

a project with docs/conf.py will execute this command: sphinx-build -b html [...] . ../_readthedocs/html
another project with src/docs/user/conf.py will call: sphinx-build -b html [...] . ../../../_readthedocs/html

Note we will be keeping the source_directory as . but they will have different output_directory. I decided to go the way I wrote the code because it's more explicit, easier to read and the command it's consistent across all the projects (instead of changing the output_directory, this implementation changes the source_directory which makes it explicit, since the cwd is not exposed to users)

In theory, people should be modifying the PYTHONPATH based on the location of the conf.py file (since their code is running from inside it) instead of cwd; but I know that's not true in practice 😄

humitos · 2023-01-17T09:50:32Z

@ericholscher

I think the next step here would probably be supporting a zip file, to allow people to upload multiple? Or support multiple, and zip them on our side. But not a problem for this PR :)

I prefer supporting uploading and serving multiple .pdf as-is. Otherwise, people will have to download a big .zip just to read one small PDF. Being able to read just one PDF is a good feature. Also, giving people the ability to "download all the PDF as .zip" is another good feature.

In summary, I want both of your suggestions 😄

readthedocs/doc_builder/backends/sphinx.py

I proposed to changed this CWD to be the one where the repository was cloned, but that could lead to some unexpected behavior. So, it's better to be conservative here and keep the CWD as it was. There are some small downsides for this decision, but it's better than breaking people's builds :)

humitos · 2023-01-17T12:51:44Z

@ericholscher I addressed most of the feedback (thanks!). I reverted the CWD change and I added an exception to raise when there is no PDF found. I repeated some QA locally and it seems that things keep working fine. Note that I tried to make incremental commits to allow easier reviews.

Please, let me know if there are more changes to be made on this PR. If there are things we want to address later, we can create issues otherwise.

humitos · 2023-01-18T11:03:11Z

@ericholscher

I wonder about using a full path for the readthedocs/html directory? I agree that it's not a great UX, but probably nicer output to just have it be explicit. Even better if we can just do something like $HOME/rtd-builds/<project>/<version>/_readthedocs/html or even $BUILD_ROOT/_readthedocs/html??

I really like this idea of $BUILD_ROOT 💯 . I'd probably name it differently and also include the _readthedocs/ part on it. This is my proposal ¹:

$READTHEDOCS_OUTPUT=<path where the repository was clonned>/_readthedocs/

Then, the Sphinx command executed would be:

python -m sphinx -T -E -j auto -b singlehtml -d _build/doctrees -D language=en . $READTHEDOCS_OUTPUT/html

I think it's pretty clear, short and easy to understand 👍🏼 . It also doesn't change the contract we have with build.commands since it keeps using the _readthedocs/html folder from inside where the repository was cloned. Ideally, we should use a path completely separated from the "user's repository" to avoid potential collisions, but that will break the contract we currently have.

we could also have one variable per output format type we support: $READTHEDOCS_OUTPUT_HTML, $READTHEDOCS_OUTPUT_HTMLZIP, $READTHEDOCS_OUTPUT_PDF, etc... ↩

Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>

humitos · 2023-01-18T12:08:39Z

I opened #9913 to continue the work making the output path an environment variable because I found it will require some more extra work, and I'd like to not block this particular because of that. I'm happy to merge this PR exposing the path as relative now (we are not changing the behavior here since it was the way it currently works) and make the improvement on being explicit in the other PR with fresh eyes and deeper research about why it's not working as expected.

ericholscher

I think this looks pretty much ready. It's definitely something we should be testing heavily during QA steps, and ahead of deploy if we have the energy. Really excited to see it go out!

ericholscher · 2023-01-18T23:50:05Z

readthedocs/doc_builder/backends/sphinx.py

+            # IMO, if there are multiple config files,
+            # the build should fail immediately communicating this to the user.
+            # This can be achived by unhandle the exception here
+            # and leaving `on_failure` Celery handle to deal with it.
+            #
+            # In case there is no config file, we should continue the build
+            # because Read the Docs will automatically create one for it.


This seems reasonable. Is there a good way to implement the suggested logic? I guess raise a different exception for multiple configs that we raise (or just don't catch)?

Yes. I documented this here in this way because I want to change the internal behavior and also communicate these errors better to the user. I'm opening a new issue with all these findings to continue working after this, since it will require some refactor and more design decisions.

humitos · 2023-01-19T10:25:54Z

@ericholscher thanks for all these reviews 👍🏼 . I opened #9918 with all the notes we took from the conversations on this PR to not forget them and decide which of them we want to prioritize. I tried to link as many thing there as I could.

I'm going to merge this PR now and ping the rest of the team so they can do some QA locally before deploying it next week.

readthedocs/readthedocs.org#9888

humitos requested a review from a team as a code owner January 11, 2023 14:43

humitos requested a review from benjaoming January 11, 2023 14:43

auto-assign bot assigned humitos Jan 11, 2023

humitos requested review from ericholscher, stsewd and agjohnson and removed request for benjaoming January 11, 2023 14:43

humitos force-pushed the humitos/standard-output-directory branch from db78cdf to faec2a2 Compare January 11, 2023 14:59

humitos commented Jan 11, 2023

View reviewed changes

readthedocs/doc_builder/backends/sphinx.py Outdated Show resolved Hide resolved

humitos added 5 commits January 11, 2023 17:05

Build: decideif there is an output type based on the path existence

0666df6

Test: update build tests to match changes

b494c64

Test: check if the file exist before continue

d2bf3f0

Build: simplify the code by running the commands inside the container

2ab29ba

Test: add checks for more commands

faa611f

This was referenced Jan 13, 2023

Testing hidden folders with RTD 2i2c-org/team-compass#585

Closed

Docs: explain how _readthedocs/<format> directories work, and add examples #9894

Open

ericholscher reviewed Jan 16, 2023

View reviewed changes

humitos commented Jan 17, 2023

View reviewed changes

readthedocs/doc_builder/backends/sphinx.py Outdated Show resolved Hide resolved

humitos added 6 commits January 17, 2023 13:11

Storage: use constants to make explicit artifact types

c80416a

PDF builder: raise an error if the PDF file is not found

67f0958

Build: use relative_output_dir and absolute_output_dir

f54ca20

HTMLZip build: use zip inside the Docker container to build it

da328fd

Minor changes about docstring and final dot in a sentence :)

782179f

humitos and others added 4 commits January 18, 2023 12:15

Sphinx builder: better default

1f50a3b

Update readthedocs/doc_builder/backends/sphinx.py

a88bf14

Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>

Update readthedocs/projects/tests/test_build_tasks.py

280e8e4

Co-authored-by: Eric Holscher <25510+ericholscher@users.noreply.github.com>

Minor changes to fix tests

e78442a

humitos mentioned this pull request Jan 18, 2023

Build: use environment variable $READTHEDOCS_OUTPUT to define output directory #9913

Merged

Docstring: explain why there is an exception handling at this place

c301c0d

humitos requested a review from ericholscher January 18, 2023 17:41

ericholscher approved these changes Jan 18, 2023

View reviewed changes

humitos mentioned this pull request Jan 19, 2023

Build: notes from output directory standardization #9918

Closed

humitos merged commit 239d890 into main Jan 19, 2023

humitos deleted the humitos/standard-output-directory branch January 19, 2023 10:26

humitos mentioned this pull request Jan 23, 2023

Build: handle symlinks in a centralized way #9800

Closed

mwtoews mentioned this pull request Jan 24, 2023

ReadTheDocs: update output directory for artefacts OSGeo/PROJ#3591

Merged

ryanking13 mentioned this pull request Jan 25, 2023

Backports for 0.22.1 pyodide/pyodide#3487

Merged

hoodmane added a commit to hoodmane/pyodide that referenced this pull request Jan 25, 2023

Try to fix docs deploy

5325f58

readthedocs/readthedocs.org#9888

jobovy mentioned this pull request Jan 26, 2023

Troubleshoot ReadTheDocs jupyterlite/jupyterlite#964

Merged

bmhan12 mentioned this pull request Jan 31, 2023

Fix doxygen on ReadTheDocs (+ return to srun for MPI tests on tioga) LLNL/axom#1016

Merged

This was referenced Jan 31, 2023

Fix path to RTD hosted output directory for doxygen content LLNL/RAJA#1437

Closed

Change path to generated doxygen due to RTD change LLNL/RAJA#1438

Closed

The real fix for the recent RTD change.... LLNL/RAJA#1439

Merged

humitos mentioned this pull request Feb 2, 2023

Make PDF and EPUB opt-in, rather than opt-out? #8359

Closed

humitos mentioned this pull request Jun 8, 2023

Multiple pdf files for single project #2045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: standardize output directory for artifacts #9888

Build: standardize output directory for artifacts #9888

humitos commented Jan 11, 2023 •

edited

Loading

ericholscher commented Jan 16, 2023 •

edited

Loading

ericholscher left a comment

ericholscher Jan 16, 2023

humitos Jan 17, 2023

ericholscher Jan 16, 2023

humitos Jan 17, 2023

ericholscher Jan 17, 2023

humitos Jan 18, 2023

stsewd commented Jan 17, 2023

humitos commented Jan 17, 2023

humitos commented Jan 17, 2023

humitos commented Jan 17, 2023 •

edited

Loading

humitos commented Jan 18, 2023 •

edited

Loading

humitos commented Jan 18, 2023

ericholscher left a comment

ericholscher Jan 18, 2023

humitos Jan 19, 2023

humitos commented Jan 19, 2023

Build: standardize output directory for artifacts #9888

Build: standardize output directory for artifacts #9888

Conversation

humitos commented Jan 11, 2023 • edited Loading

Some important notes

Use cases manually tested from test-builds

ericholscher commented Jan 16, 2023 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Jan 17, 2023

humitos commented Jan 17, 2023

humitos commented Jan 17, 2023

humitos commented Jan 17, 2023 • edited Loading

humitos commented Jan 18, 2023 • edited Loading

Footnotes

humitos commented Jan 18, 2023

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Jan 19, 2023

humitos commented Jan 11, 2023 •

edited

Loading

Use cases manually tested from `test-builds`

ericholscher commented Jan 16, 2023 •

edited

Loading

humitos commented Jan 17, 2023 •

edited

Loading

humitos commented Jan 18, 2023 •

edited

Loading