Pypdfium capability advice needed #331
-
Hi All, I have been working on an invoice search application for internal use within my organization. There are 14 years of invoices which is collated and spread over 9000 documents. The total pages across the documents come to 12Million. As the first step, I have to split the individual files, extract metadata and store them in the database for the search application to fetch the details. I had been using PDF Plumber and PyPDF to do the splitting and extraction of the metadata . However the time taken per page is almost 5 seconds which is time intensive. Then i tried with PyMuPDF which performed the splitting and extraction in .2 seconds per page which seemed really fast but their licensing conditions seemed a bit vague. I am okay to procure their commercial license but the legal clauses are not well documented in their site. Hence I am looking to see if Pypdfium2 can perform the operations on a massive scale(12million) like splitting of a pdf document and extraction of the content using regular expression and bounding box at a speed comparable to PyMuPDF. I did see the benchmarks here where it seems to have faired well . Looking forward to your suggestions |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Just try it and see how it performs. I never tested pypdfium2 on such a large scale. If you have any concrete usage questions, feel free to ask. IANAL and this is not legal advice, but for what it's worth I think you may be able to use pymupdf within your organization under the AGPL (i.e. without purchasing an artifex license) as long as you don't distribute your project as a closed-source application or web service. Internal use (regardless if commercial or not) should be fine. Anyone (whether company or end user) is free to use an (A)GPL-licensed project for anything they like, only distribution of the software you build on top of it is covered by requirements. I.e. the "commercial" licensing option is meant for developers who distribute/sell a closed-source application with pymupdf. As for merely using pymupdf in a commercial setting (e.g. a document processing pipeline), the free AGPL licensing option should be OK. |
Beta Was this translation helpful? Give feedback.
Just try it and see how it performs. I never tested pypdfium2 on such a large scale.
Text extraction performance should be OK, according to the py-pdf benchmark.
If you have any concrete usage questions, feel free to ask.
However, note that (unlike pymupdf) pypdfium2 itself does not provide layout analysis.
IANAL and this is not legal advice, but for what it's worth I think you may be able to use pymupdf within your organization under the AGPL (i.e. without purchasing an artifex license) as long as you don't distribute your project as a closed-source application or web service. Internal use (regardless if commercial or not) should be fine. Anyone (whether company or end user) is free to use