Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sphinx-build -jauto has nondeterministic html results #6714

Closed
bmwiedemann opened this issue Oct 2, 2019 · 29 comments · Fixed by #12888
Closed

sphinx-build -jauto has nondeterministic html results #6714

bmwiedemann opened this issue Oct 2, 2019 · 29 comments · Fixed by #12888

Comments

@bmwiedemann
Copy link
Contributor

bmwiedemann commented Oct 2, 2019

Describe the bug
html rendering with -jauto has nondeterministic results, except when run on a 1-core-VM

To Reproduce
Steps to reproduce the behavior:

osc checkout openSUSE:Factory/kernel-docs && cd $_
osc build --noservice
# extract html/Documentation/output/core-api/refcount-vs-atomic.html
# compare to those from another build

or maybe

git clone --depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
cd linux-2.6
make htmldocs

Expected behavior
output should be deterministic

Your project
https://github.com/torvalds/linux/blob/master/Documentation/conf.py

Screenshots

-<p>Il kernel è scritto nel linguaggio di programmazione C <a class="reference internal" href="../../../process/programming-language.html#href_anchor" id="a_idN"><span>[c-language]</span></a>.
-Più precisamente, il kernel viene compilato con <code class="docutils literal notranslate"><span class="pre">gcc</span></code> <a class="reference internal" href="../../../process/programming-language.html#href_anchor" id="a_idN"><span>[gcc]</span></a> usando
-l’opzione <code class="docutils literal notranslate"><span class="pre">-std=gnu89</span></code> <a class="reference internal" href="../../../process/programming-language.html#href_anchor" id="a_idN"><span>[gcc-c-dialect-options]</span></a>: il dialetto GNU
+<p>Il kernel è scritto nel linguaggio di programmazione C <a class="reference internal" href="#href_anchor" id="a_idN"><span>[c-language]</span></a>.
+Più precisamente, il kernel viene compilato con <code class="docutils literal notranslate"><span class="pre">gcc</span></code> <a class="reference internal" href="#href_anchor" id="a_idN"><span>[gcc]</span></a> usando
+l’opzione <code class="docutils literal notranslate"><span class="pre">-std=gnu89</span></code> <a class="reference internal" href="#href_anchor" id="a_idN"><span>[gcc-c-dialect-options]</span></a>: il dialetto GNU

Environment info

  • OS: openSUSE Tumbleweed
  • Python version: 3.7.3
  • Sphinx version: 2.2.0
  • Sphinx extensions: websupport, devhelp, htmlhelp, jsmath, serializinghtml

Additional context
Got same result when adding -E

This bug was found while working on reproducible builds for openSUSE.

@AA-Turner AA-Turner added this to the some future version milestone Sep 29, 2022
@bmwiedemann
Copy link
Contributor Author

This is still an issue with Sphinx-7.2.6

So far affected are 3 openSUSE packages: kernel-docs qemu guake

One simple reproducer is
https://github.com/Guake/guake/blob/master/docs/Makefile
where I get varying results from

make -O -j4 -C docs html man

even without -jauto and with -E

@bmwiedemann
Copy link
Contributor Author

I found a workaround for guake: using different -d doctrees paths helps there.

@bmwiedemann
Copy link
Contributor Author

This is likely a duplicate of #2946

@bmwiedemann
Copy link
Contributor Author

Sphinx 7.4.5 still causes non-determinism in our kernel-docs build with -jauto from https://github.com/torvalds/linux/blob/c763c4339688/Documentation/sphinx/parallel-wrapper.sh

@jayaddison
Copy link
Contributor

This is still an issue with Sphinx-7.2.6

So far affected are 3 openSUSE packages: kernel-docs qemu guake

One simple reproducer is https://github.com/Guake/guake/blob/master/docs/Makefile where I get varying results from

make -O -j4 -C docs html man

even without -jauto and with -E

@bmwiedemann would it be possible to prepare an even-more-minimal repro example to replicate the problem? The reason I ask is that I'm aware of a separate race-condition issue ( Guake/guake#2219 ) that occurs when building guake using that same Makefile, and I want to try to eliminate that so that we can more easily determine the root cause in Sphinx.

@bmwiedemann
Copy link
Contributor Author

I think, there may be two different issues.

  1. There is Race conditions when running multiple Sphinx instances using the same doctree #2946 which covers the guake case - there is parallelism external to sphinx and two different sphinx processed don't take care to play nicely with the shared doctrees cache files.

  2. There is kernel-docs where a single sphinx -jauto call does all the processing. The latest diff is https://rb.zq1.de/compare.factory-20240904/diffs/kernel-source-compare.out - this one would probably the better base for building a reproducer. Not sure, I will have time for it soon.

@jayaddison
Copy link
Contributor

2. There is kernel-docs where a single sphinx -jauto call does all the processing. The latest diff is https://rb.zq1.de/compare.factory-20240904/diffs/kernel-source-compare.out - this one would probably the better base for building a reproducer.

From inspecting the diff: most, perhaps all, of the differences appear to be related to the toctree.

@jayaddison

This comment was marked as resolved.

@jayaddison
Copy link
Contributor

So far affected are 3 openSUSE packages: kernel-docs qemu guake

I'm able to replicate this locally for qemu using Sphinx v7.4.7 and also Sphinx v8.0.2, and in both cases with and without sphinx-rtd-theme enabled (in-theory not compatible with Sphinx 8.x; but in practice it can build).

This appears to eliminate sphinx-rtd-theme as a factor (its layout.html does make use of toctree, so initially I considered it a candidate) -- while allowing the possibility that the same root cause affects both qemu and linux (the latter does not use sphinx-rtd-theme by default).

Differing the parallelism level apperas to make the problem manifest more readily (mixing -j5 and -j10 between subsequent builds allowed me to replicate the problem reliably on my test system).

e.g.

$ sphinx-build -j5 -b html docs _build.0
...
$ sphinx-build -j10 -b html docs _build.1
...
$ diff -rq _build.0 _build.1
...
Files _nonrtd.0/specs/ppc-spapr-numa.html and _nonrtd.1/specs/ppc-spapr-numa.html differ
...

The problem is apparent in the sidebar navigation menu; a toctree that is displayed as part of the theme layout. In essence, the navigation hierarchy elements indicated as current seem to differ between subsequent builds.

Build-0 Build-1
image image
image image

This seems similar to #12409; these issues may be duplicates.

@jayaddison
Copy link
Contributor

A naive first-guess at the cause: distributing parallel tasks for each of the nodes of a tree datastructure and then re-combining them in a deterministic manner that is agnostic to both thread-count and also to task-processing-order isn't entirely trivial; race conditions could occur, particularly during collection and recombination of results.

jayaddison added a commit to jayaddison/sphinx that referenced this issue Sep 11, 2024
Bugreports sphinx-doc#6714 and sphinx-doc#12409 indicate that the table-of-contents
collection process is not currently implemented in a way that
guarantees deterministic resolution in the presence of parallelism.

Until such time as we can implement that, disable parallel reads
for the `TocTreeCollector` so that the tree is resolved serially.
@jayaddison

This comment was marked as resolved.

@jayaddison
Copy link
Contributor

Note: it seems that #12885 would effectively turn off all read parallelism when building any Sphinx application that uses a toctree. That isn't something that I had anticipated when drafting the change.

@jayaddison

This comment was marked as outdated.

@jayaddison
Copy link
Contributor

The following command may be useful to attempt comparative builds of test-toctree-glob, while collecting some verbose build logs:

$ rm -rf _build.*; sphinx-build -v -j5 -b html tests/roots/test-toctree-glob _build.5 > build5.txt; sphinx-build -v -j10 -b html tests/roots/test-toctree-glob _build.10 > build10.txt; diff -rq _build.*;

Diffs in the output do not always occur, but repeating that command a few times will generally result in a nondeterministic pair of builds.

I'm currently focusing on this section of code that appears to be important during the toctree build:

toc = build_toc(doctree)
if toc:
app.env.tocs[docname] = toc
else:
app.env.tocs[docname] = nodes.bullet_list('')
app.env.toc_num_entries[docname] = numentries[0]

@jayaddison
Copy link
Contributor

The call to note_toctree here seems like a problem candidate:

note_toctree(app.env, docname, toctreenode)

Basically, I'm working on the assumption that parts of the toctree are being assembled by somewhat-independent worker subprocesses (within the single Unix-ish Python program process) -- and that if they do not provide, or if they alter, information that each other rely on in order to build a consistent tree, then problems will result.

Commenting-out these lines seems to provide for reproducible output, but loses the valuable nesting information in the HTML navigation tree:

while d in parent and d not in ancestors:
ancestors.append(d)
d = parent[d]

@jayaddison
Copy link
Contributor

(a potentially-important connection between note_toctree and _get_toctree_ancestors is the env.toctree_includes environment attribute, which note_toctree modifies as a side-effect during the collector phase. that seems potentially problematic, because the ordering of those side-effects could be important)

@jayaddison

This comment was marked as resolved.

@jayaddison
Copy link
Contributor

At the risk of seeming spammy (I feel like I'm adding a lot of commentary without many conclusions) - I think that there is a tree traversal order problem here.

I can't express precisely why yet, but I feel that the index document requires processing first; processing of other documents appears to pollute the environment state in a way that causes errors if that does not happen.

Root/index-document first seems to imply breadth-first traversal order to me -- that is, begin with all documents at level 1, then process all documents at level 2, and so on -- rather than exploring from root to a leaf node before iterating elsewhere (depth-first).

@jayaddison
Copy link
Contributor

s/errors/incorrect output

@jayaddison
Copy link
Contributor

@AA-Turner @picnixz @chrisjsewell I'd be very glad for any help solving this.

There is definitely a tricky problem somewhere related to toctree ordering and parallelism in HTML builds -- and whatever it is feels at the limit or beyond my knowledge of the Sphinx internals.

I believe that #12886 provides a test case that should help explore/confirm possible solutions -- it passes sometimes, but not always, when run locally (and that, I hope, is an indicator that it may be accurate, since the bug itself is also intermittent).

@khanxmetu
Copy link
Contributor

khanxmetu commented Sep 15, 2024

The following command may be useful to attempt comparative builds of test-toctree-glob, while collecting some verbose build logs:

$ rm -rf _build.*; sphinx-build -v -j5 -b html tests/roots/test-toctree-glob _build.5 > build5.txt; sphinx-build -v -j10 -b html tests/roots/test-toctree-glob _build.10 > build10.txt; diff -rq _build.*;

Diffs in the output do not always occur, but repeating that command a few times will generally result in a nondeterministic pair of builds.

@jayaddison Thank you for suggesting a minimal reproducible example and sharing your thoughts. I was able to replicate and after going through the code I have figured out the root cause.

The issue is indeed related to parallelism and toctree building but more precisely about the order in which documents are read/parsed before writing phase begins and an assumption made during toctree generation.

During the read phase, one of the attributes that is populated (when doctree-read emitted) is BuildEnvironment.toctree_includes which is essentially a dictionary mapping doc->children docs with respect to toctree. The order of this dictionary is dependent on the order in which files are read/parsed and therefore affected by parallelism.

In the write phase toctrees are independently resolved (via Builder.env.get_and_resolve_doctree() in Builder.write()) and by that time all docs have been read and parsed. It internally calls _get_toctree_ancestors passing in the above BuildEnvironment.toctree_includes dictionary. The main issue here is the assumption of _get_toctree_ancestors that the toctree is indeed a tree (meaning each node/doc has a unique parent) and when there could be multiple parents/paths to root ancestor it tracks just one path which depends on the order of keys in BuildEnvironment.toctree_includes.
This can be seen in the following loop which replaces any prior parent found for a given doc where parent: dict[str, str] is doc->parentdoc mapping:

for p, children in toctree_includes.items():
parent |= dict.fromkeys(children, p)

The fix is trivial. We can ensure that the path is deterministic (lexicographically greatest parent chosen) in all cases by sorting Builder.env.toctree_includes before write phase begins.

Edit:
Another deterministic approach could be to choose the deepest path. Though it would increase the computation time depending on number of such cases where there are multiple parents but still can be considered.

@bmwiedemann
Copy link
Contributor Author

I tested #12888 with our kernel-docs usecase and found that it avoids all the toctree diffs. However there is still two other diffs left:

--- old//usr/share/doc/kernel/html/rst/translations/zh_CN/dev-tools/gcov.html   2024-09-19 20:01:21.450883175 +0000
+++ new//usr/share/doc/kernel/html/rst/translations/zh_CN/dev-tools/gcov.html   2024-09-19 20:01:21.458883214 +0000
@@ -361,7 +361,7 @@
 <section id="a-collect-on-build-sh">
 <h2>附录A:collect_on_build.sh<a class="headerlink" href="#href_anchor" title="Link to this heading">¶</a></h2>
 <p>用于在编译机上收集覆盖率元文件的示例脚本
-(见 <a class="reference internal" href="#href_anchor"><span class="std std-ref">编译机和测试机分离 a.</span></a> )</p>
+(见 <a class="reference internal" href="../../zh_TW/dev-tools/gcov.html#href_anchor"><span class="std std-ref">编译机和测试机分离 a.</span></a> )</p>
 <div class="highlight-sh notranslate"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>

The other is in /usr/share/doc/kernel/html/rst/objects.inv

I checked both were also part of the original https://rb.zq1.de/compare.factory-20240904/diffs/kernel-source-compare.out but hard to spot between all the others.

@khanxmetu
Copy link
Contributor

khanxmetu commented Sep 19, 2024

I tested #12888 with our kernel-docs usecase and found that it avoids all the toctree diffs. However there is still two other diffs left:

--- old//usr/share/doc/kernel/html/rst/translations/zh_CN/dev-tools/gcov.html   2024-09-19 20:01:21.450883175 +0000
+++ new//usr/share/doc/kernel/html/rst/translations/zh_CN/dev-tools/gcov.html   2024-09-19 20:01:21.458883214 +0000
@@ -361,7 +361,7 @@
 <section id="a-collect-on-build-sh">
 <h2>附录A:collect_on_build.sh<a class="headerlink" href="#href_anchor" title="Link to this heading">¶</a></h2>
 <p>用于在编译机上收集覆盖率元文件的示例脚本
-(见 <a class="reference internal" href="#href_anchor"><span class="std std-ref">编译机和测试机分离 a.</span></a> )</p>
+(见 <a class="reference internal" href="../../zh_TW/dev-tools/gcov.html#href_anchor"><span class="std std-ref">编译机和测试机分离 a.</span></a> )</p>
 <div class="highlight-sh notranslate"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>

The other is in /usr/share/doc/kernel/html/rst/objects.inv

I checked both were also part of the original https://rb.zq1.de/compare.factory-20240904/diffs/kernel-source-compare.out but hard to spot between all the others.

Good catch. I believe this is due to multiple definitions of the target _gcov-test_zh (see
Documentation/translations/zh_CN/dev-tools/gcov.rst and Documentation/translations/zh_TW/dev-tools/gcov.rst). Instead of sharing the same target names, the references in zh_TW folder should point to corresponding _TW targets and the references in zh_CN folder should point to corresponding _CN targets.

Ideally, our priority here should be to warn the user, ensuring determinism for this case doesn't bring much meaning in my opinion.

@bmwiedemann
Copy link
Contributor Author

For us, the build process happens automatic and you have to actively go look for the log to see any warning. About 5 steps of navigation to get to https://build.opensuse.org/public/build/openSUSE:Factory/standard/x86_64/kernel-source:kernel-docs/_log - so few people will see any warnings.

As a user, I would prefer for a link to always point to the local reference, even if another document duplicated it. Seems nicer than to declare this undefined behavior as wanted. But if that is too hard, then it is "GIGO" (Garbage In => Garbage Out)

@bmwiedemann
Copy link
Contributor Author

bmwiedemann commented Sep 20, 2024

Sent a patch for Documentation/translations/zh_CN/dev-tools/gcov.rst =
https://lore.kernel.org/linux-doc/20240920070144.26947-1-bernhard+linux-doc@lsmod.de/T/#u

That still leaves objects.inv as unreproducible. Is that file actually needed in the output?

@khanxmetu
Copy link
Contributor

As a user, I would prefer for a link to always point to the local reference, even if another document duplicated it. Seems nicer than to declare this undefined behavior as wanted. But if that is too hard, then it is "GIGO" (Garbage In => Garbage Out)

Having a preference for labels within the same doc is not too hard to implement, but if we account for this case we should account for other cases like:

  1. Should the labels defined in the docs under common ancestor folder be also considered as local? Let's say there are duplicated labels being referenced outside the same document for instance _kernel_doc_zh defined in zh_CN/doc-guide/kernel-doc.rst was similarly defined in zh_TW/doc-guide/kernel-doc.rst(does not exist atm) and being referenced in zh_CN/process/submit-checklist.rst and zh_TW/process/submit-checklist.rst. This would require significant work I believe.
  2. What should happen when there are duplicated label definitions under the same locality? We can ensure determinism by choosing lexicographically smallest/greatest location but this is arbitrary and doesn't bring much meaning in my opinion as I said.

@bmwiedemann
Copy link
Contributor Author

I found that our python-Sphinx package also packages the test output. This produced
https://rb.zq1.de/compare.factory-20240904/diffs/python-Sphinx-compare.out with missing entries in index.html

@jayaddison
Copy link
Contributor

That still leaves objects.inv as unreproducible. Is that file actually needed in the output?

@bmwiedemann the objects.inv file is an Intersphinx inventory, used during creation of inter-project references, and it is currently not an optional part of the HTML build output (it is always generated). I think the issue you describe was previously reported in #12001 -- and from there is linked to a possible fix for the QEMU project in https://gitlab.com/qemu-project/qemu/-/issues/2190 .

I'll re-open #12001 since there may be a way to fix it centrally in Sphinx; it's closely-related and arguably a subset of this issue, but I think is worth tracking with a distinct issue.

@bmwiedemann
Copy link
Contributor Author

Found another strange effect with our python 3.12 package that has a Python-3.12.5/Doc/Makefile JOBS=auto setting
https://rb.zq1.de/other/python312-doc-20241001.diff.txt

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Oct 7, 2024
We found that sphinx parallel processing would randomly pick
one or the other
sphinx-doc/sphinx#6714 (comment)

Signed-off-by: Bernhard M. Wiedemann <bwiedemann@suse.de>
Reviewed-by: Yanteng Si <siyanteng@linux.dev>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20240920070144.26947-1-bernhard+linux-doc@lsmod.de
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Oct 14, 2024
We found that sphinx parallel processing would randomly pick
one or the other
sphinx-doc/sphinx#6714 (comment)

Signed-off-by: Bernhard M. Wiedemann <bwiedemann@suse.de>
Reviewed-by: Yanteng Si <siyanteng@linux.dev>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20240920070144.26947-1-bernhard+linux-doc@lsmod.de
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.