You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A regression was introduced in commit 9e2572b: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.
With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:
ERROR:root:daemonic processes are not allowed to have children
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found
It happens because the pypdfium2's Document.render method has these lines:
and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in Document.render, there is no option not to create sub-processes altogether.
For comparison, the torch.DataLoader class solves this by allowing num_workers to be set to 0 and handling it as a special case:
but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?
The text was updated successfully, but these errors were encountered:
I regret to say that the document-level pdf.render() API was an inherent design mistake since it implies transferring bitmaps across processes. Also, as you have noticed here, pypdfium2 providing an API with "hidden" process pool is kind of problematic. pdf.render() is deprecated for these reasons, however callers are encouraged to implement their own parallelization without bitmap transfer.
A regression was introduced in commit 9e2572b: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.
With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:
It happens because the pypdfium2's Document.render method has these lines:
and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in
Document.render
, there is no option not to create sub-processes altogether.For comparison, the
torch.DataLoader
class solves this by allowingnum_workers
to be set to 0 and handling it as a special case:but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?
The text was updated successfully, but these errors were encountered: