Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pypdfium2 clashes with multiprocessing support #110

Open
Vidminas opened this issue Sep 21, 2023 · 2 comments
Open

Pypdfium2 clashes with multiprocessing support #110

Vidminas opened this issue Sep 21, 2023 · 2 comments

Comments

@Vidminas
Copy link

A regression was introduced in commit 9e2572b: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.

With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:

ERROR:root:daemonic processes are not allowed to have children
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found

It happens because the pypdfium2's Document.render method has these lines:

with mp.Pool(n_processes, **pool_kwargs) as pool:
    yield from pool.imap(_parallel_renderer_job, page_indices)

and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in Document.render, there is no option not to create sub-processes altogether.

For comparison, the torch.DataLoader class solves this by allowing num_workers to be set to 0 and handling it as a special case:

def _get_iterator(self) -> '_BaseDataLoaderIter':
        if self.num_workers == 0:
            return _SingleProcessDataLoaderIter(self)
        else:
            self.check_worker_number_rationality()
            return _MultiProcessingDataLoaderIter(self)

but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?

@lukas-blecher
Copy link
Contributor

I switched to pypdfium2 because of the poppler dependency of pdf2image.
I'll have a look

@mara004
Copy link

mara004 commented Oct 17, 2023

Hi, pypdfium2 maintainer here.

You can simply use the page-level rendering method, which does not use multiprocessing:

n_pages = len(pdf)
page = pdf[i]
image = page.render(...).to_...(...)

I regret to say that the document-level pdf.render() API was an inherent design mistake since it implies transferring bitmaps across processes. Also, as you have noticed here, pypdfium2 providing an API with "hidden" process pool is kind of problematic. pdf.render() is deprecated for these reasons, however callers are encouraged to implement their own parallelization without bitmap transfer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants