Pypdfium capability advice needed #331

PriyaranjanKS · 2024-12-03T04:34:32Z

PriyaranjanKS
Dec 3, 2024

Hi All,

I have been working on an invoice search application for internal use within my organization. There are 14 years of invoices which is collated and spread over 9000 documents. The total pages across the documents come to 12Million. As the first step, I have to split the individual files, extract metadata and store them in the database for the search application to fetch the details.

I had been using PDF Plumber and PyPDF to do the splitting and extraction of the metadata . However the time taken per page is almost 5 seconds which is time intensive. Then i tried with PyMuPDF which performed the splitting and extraction in .2 seconds per page which seemed really fast but their licensing conditions seemed a bit vague. I am okay to procure their commercial license but the legal clauses are not well documented in their site.

Hence I am looking to see if Pypdfium2 can perform the operations on a massive scale(12million) like splitting of a pdf document and extraction of the content using regular expression and bounding box at a speed comparable to PyMuPDF. I did see the benchmarks here where it seems to have faired well .

Looking forward to your suggestions

Answered by mara004

Dec 3, 2024

Just try it and see how it performs. I never tested pypdfium2 on such a large scale.
Text extraction performance should be OK, according to the py-pdf benchmark.

If you have any concrete usage questions, feel free to ask.
However, note that (unlike pymupdf) pypdfium2 itself does not provide layout analysis.

IANAL and this is not legal advice, but for what it's worth I think you may be able to use pymupdf within your organization under the AGPL (i.e. without purchasing an artifex license) as long as you don't distribute your project as a closed-source application or web service. Internal use (regardless if commercial or not) should be fine. Anyone (whether company or end user) is free to use

View full answer

mara004 · 2024-12-03T15:16:10Z

mara004
Dec 3, 2024
Maintainer

Just try it and see how it performs. I never tested pypdfium2 on such a large scale.
Text extraction performance should be OK, according to the py-pdf benchmark.

If you have any concrete usage questions, feel free to ask.
However, note that (unlike pymupdf) pypdfium2 itself does not provide layout analysis.

IANAL and this is not legal advice, but for what it's worth I think you may be able to use pymupdf within your organization under the AGPL (i.e. without purchasing an artifex license) as long as you don't distribute your project as a closed-source application or web service. Internal use (regardless if commercial or not) should be fine. Anyone (whether company or end user) is free to use an (A)GPL-licensed project for anything they like, only distribution of the software you build on top of it is covered by requirements.

I.e. the "commercial" licensing option is meant for developers who distribute/sell a closed-source application with pymupdf. As for merely using pymupdf in a commercial setting (e.g. a document processing pipeline), the free AGPL licensing option should be OK.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pypdfium capability advice needed #331

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Pypdfium capability advice needed #331

PriyaranjanKS Dec 3, 2024

Replies: 1 comment

mara004 Dec 3, 2024 Maintainer

PriyaranjanKS
Dec 3, 2024

mara004
Dec 3, 2024
Maintainer