-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Images not extracted #1368
Comments
The missing ones might be inline images:
The operators are:
|
Examples: There are a lot of images with width=1 and height=1:
And some with just small dimensions:
|
https://arxiv.org/pdf/2201.00214.pdf contains even more missing images |
PyMuPDF extracted a couple of images for https://arxiv.org/pdf/2201.00029.pdf whereas PyPDF2 didn't extract any. Here is how to extract images with PyMuPDF: https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L185-L198 |
Not sure if this helps, but I thought I'd bisect this to see if it's a regression. My test script is based on a snippet from above, but goes through all pages: from PyPDF2 import PdfReader
import sys
reader = PdfReader("2201.00151.pdf")
images = []
for page in reader.pages:
images.extend(page.images)
if images == []:
print("bad")
sys.exit(1)
else:
print("good") I went as far back as 85b3e87 where |
Oh and I seem to have missed the docstring saying this doesn't work with inline images: https://github.com/py-pdf/PyPDF2/blob/0b2b3ec997b81d6bb22c5c834ef5ba9e4f2330e7/PyPDF2/_page.py#L378 |
When trying to extract the images, PyPDF2 didn't capture them.
Environment
$ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.11.0
Code + PDF
Use this PDF: https://arxiv.org/pdf/2201.00151.pdf
Expected were two images.
The text was updated successfully, but these errors were encountered: