Pypdfium2 clashes with multiprocessing support #110

Vidminas · 2023-09-21T21:12:03Z

A regression was introduced in commit 9e2572b: previously with PyMuPDF or pdf2image rasterizer implementations, it was possible to run nougat in a multiprocessing pool, so that multiple PDFs could be parsed at the same time.

With pypdfium2 this is no longer possible. Running with multiprocessing results in errors like this:

ERROR:root:daemonic processes are not allowed to have children
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found
ERROR:root:list index out of range
WARNING:root:Image not found

It happens because the pypdfium2's Document.render method has these lines:

with mp.Pool(n_processes, **pool_kwargs) as pool:
    yield from pool.imap(_parallel_renderer_job, page_indices)

and in Python it is not possible to nest multiprocessing pools (at least not with the built-in implementation). Although it is possible to set n_processes to 1 in Document.render, there is no option not to create sub-processes altogether.

For comparison, the torch.DataLoader class solves this by allowing num_workers to be set to 0 and handling it as a special case:

def _get_iterator(self) -> '_BaseDataLoaderIter':
        if self.num_workers == 0:
            return _SingleProcessDataLoaderIter(self)
        else:
            self.check_worker_number_rationality()
            return _MultiProcessingDataLoaderIter(self)

but I guess it might be more difficult to solve this from the pypdfium2 side than to switch back to the earlier pdf2image implementation, unless there is a good reason to use pypdfium2?

The text was updated successfully, but these errors were encountered:

lukas-blecher · 2023-09-22T11:31:00Z

I switched to pypdfium2 because of the poppler dependency of pdf2image.
I'll have a look

mara004 · 2023-10-17T11:26:22Z

Hi, pypdfium2 maintainer here.

You can simply use the page-level rendering method, which does not use multiprocessing:

n_pages = len(pdf)
page = pdf[i]
image = page.render(...).to_...(...)

I regret to say that the document-level pdf.render() API was an inherent design mistake since it implies transferring bitmaps across processes. Also, as you have noticed here, pypdfium2 providing an API with "hidden" process pool is kind of problematic. pdf.render() is deprecated for these reasons, however callers are encouraged to implement their own parallelization without bitmap transfer.

mara004 mentioned this issue Oct 17, 2023

predict.py Unhandled Exception #96

Closed

Vidminas mentioned this issue Nov 11, 2023

Render PDF pages individually #173

Open

This was referenced Nov 23, 2023

[m1 Pro] I get a warning about memory leaks and not sure how to procced. #162

Open

Library not available: "Cannot close object, library is destroyed..." pypdfium2-team/pypdfium2#281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pypdfium2 clashes with multiprocessing support #110

Pypdfium2 clashes with multiprocessing support #110

Vidminas commented Sep 21, 2023

lukas-blecher commented Sep 22, 2023

mara004 commented Oct 17, 2023 •

edited

Loading

Pypdfium2 clashes with multiprocessing support #110

Pypdfium2 clashes with multiprocessing support #110

Comments

Vidminas commented Sep 21, 2023

lukas-blecher commented Sep 22, 2023

mara004 commented Oct 17, 2023 • edited Loading

mara004 commented Oct 17, 2023 •

edited

Loading