Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions when running multiple Sphinx instances using the same doctree #2946

Open
mgorny opened this issue Sep 10, 2016 · 5 comments
Open
Labels
internals:parallel type:bug type:enhancement enhance or introduce a new feature

Comments

@mgorny
Copy link
Contributor

mgorny commented Sep 10, 2016

Projects using Sphinx (and Makefiles) often define multiple targets for different doc builders. For example, the LLVM project defines a target for manpages and for HTML docs. Since those targets are independent, make is allowed to run them in parallel (and does that). Since they use the same source tree, they use the same doctree root as well.

Lately, I've noticed that the two parallel Sphinx instances 'fight' over the doctree. Basically, it seems that they both ignore each other and create the doctrees independently, i.e. the same files are first written by one Sphinx instance, and afterwards overwritten by the second. While normally this doesn't cause problems (except for being awfully inefficient), I've actually seen a race condition causing major failure (sorry, don't have any logs for it). Basically, one Sphinx instance has opened one of the files for reading while the other was writing it.

We've attempted to find a good solution on the build system side but could fine none satisfactory. If we switched to separate doctrees, that would be a waste of a good caching opportunity. Getting two independent targets serialized is not trivial in CMake — and LLVM people aren't happy with me adding fake dependencies to enforce serialization. But even if we did that, there's still the general problem that other people will be unaware of the issue and will be hitting it in the future.

Therefore, I think it would be best to fix it on Sphinx side. The simplest solution would be to lock some file in the doctree directory. Considering the specific design, the environment pickle should be the best file to lock. However, that would require redesigning the application to keep the file open rather than closing it immediately after reading/writing (which would also be beneficial to avoiding race conditions).

@mgorny
Copy link
Contributor Author

mgorny commented Sep 11, 2016

Well, I've tried to give it a shot… and I can't find a sane way of doing it with the current design.

The big problem is, the use of pickles is done quite inconsistently. The pickled environment is first read in Sphinx._init_env() which is called by Sphinx.__init__(), i.e. pretty early. However, it is not updated until Builder.build(), and there it is pickled afterwards. So to get the results I need, I would have to lock the environment from Sphinx._init_env() to Builder.build().

To do that, I would have to keep the environment file open during that period. That's quite doable — I was thinking of opening it directly in Sphinx._init_env(), and keeping the open file in BuildEnvironment. This would have the extra advantage of not having to repeatedly pass the path around. In this case, Sphinx._init_env() would open the file r+ (i.e. open R/W or create), and lock it. BuildEnvironment.frompickle() would just read the open file (if non-empty), and BuildEnvironment.topickle() would rewind and write it. Then it could be unlocked and closed.

The problem with all that is that Python doesn't do RAII properly, so there's no clean way of ensuring that the file will be closed on exception without redesigning how Sphinx is used. Probably the whole class would have to be turned into context manager, and used via with like files are usually opened. Otherwise, I'm afraid that some corner case may result in the file being kept open (and locked), and deadlocking some other process.

@tk0miya
Copy link
Member

tk0miya commented Sep 14, 2016

Agreed. As you said, sphinx is not designed to be able to invoke parallelly.
Is there any reason not to use -j option of sphinx-build?
I think the option will help you.

About locking, period from BuildEnvironment.frompickle() to BuildEnvironment.topickle() is very long. I wonder does it really helps parallelism. The second sphinx process waits until the first one finished reading all documents.

@tk0miya tk0miya added the type:enhancement enhance or introduce a new feature label Sep 14, 2016
@tk0miya tk0miya added this to the some future version milestone Sep 14, 2016
@mgorny
Copy link
Contributor Author

mgorny commented Sep 14, 2016

Agreed. As you said, sphinx is not designed to be able to invoke parallelly.
Is there any reason not to use -j option of sphinx-build?
I think the option will help you.

The problem is that we are actually using two different builders (manpage and HTML), so -j doesn't really help here. Even if Sphinx had an option to use two different builders in one run, that wouldn't integrate well with build systems where it is desired to have separate targets.

About locking, period from BuildEnvironment.frompickle() to BuildEnvironment.topickle() is very long.
I wonder does it really helps parallelism. The second sphinx process waits until the first one finished
reading all documents.

Well, lack of parallelism is not that much of an issue as the risk of race condition resulting in malformed documents or a crash.

@zenhack
Copy link

zenhack commented Sep 6, 2017

fwiw, this affects https://github.com/pazz/alot as well. I maintain the aur package and am currently just passing -j1 to make. -j1 isn't a big deal for alot, because it's only using make for the docs, but it took a little bit of work to track down what was going on, and the output of a user's build log suggests reporting a bug if for nothing else better error messages:

https://ci.virtapi.org/job/Arch_Package_alot/35/console

@ehabkost
Copy link

This is affecting the QEMU project as well. See the build failure at:
https://app.shippable.com/github/ehabkost/qemu/runs/24/8/console

ehabkost added a commit to ehabkost/qemu-hacks that referenced this issue Oct 12, 2019
sphinx-build is buggy when multiple processes are using the same
doctree directory in parallel.  See the 3-year-old Sphinx bug
report at: sphinx-doc/sphinx#2946

Instead of avoiding parallel builds or adding some kind of
locking, I'm using the simplest solution: just using a different
doctree cache for each builder.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
qemu-deploy pushed a commit to qemu/qemu that referenced this issue Oct 17, 2019
sphinx-build is buggy when multiple processes are using the same
doctree directory in parallel.  See the 3-year-old Sphinx bug
report at: sphinx-doc/sphinx#2946

Instead of avoiding parallel builds or adding some kind of
locking, I'm using the simplest solution: just using a different
doctree cache for each builder.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20191014150133.14318-1-ehabkost@redhat.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals:parallel type:bug type:enhancement enhance or introduce a new feature
Projects
None yet
Development

No branches or pull requests

5 participants