Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images not extracted #1368

Closed
MartinThoma opened this issue Sep 26, 2022 · 7 comments · Fixed by #1834 or #1850
Closed

Images not extracted #1368

MartinThoma opened this issue Sep 26, 2022 · 7 comments · Fixed by #1834 or #1850
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@MartinThoma
Copy link
Member

When trying to extract the images, PyPDF2 didn't capture them.

Environment

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

Use this PDF: https://arxiv.org/pdf/2201.00151.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader("2201.00151.pdf")
>>> reader.pages[1].images
[]

Expected were two images.

@MartinThoma
Copy link
Member Author

The missing ones might be inline images:

An inline image object uses a special syntax to express the data for a small image directly within the content stream.

The operators are:

  • BI: Begin an inline image object.
  • ID: Begin the image data for an inline image object.
  • EI: End an inline image object.

@MartinThoma
Copy link
Member Author

Examples:

There are a lot of images with width=1 and height=1:

BI
/IM true
/W 1
/H 1
/BPC 1
ID �
EI Q
q 4.8 0 0 -279.6 368.28 8056.48 cm
BI
/IM true
/W 1
/H 1
/BPC 1
ID �

And some with just small dimensions:

BI
/CS/R124
/W 40
/H 40
/BPC 8
/F/Fl
ID x��S[o�E�ݝݙ����ދw��u6q7N�'�o���8��4��ҖT�UH%Z�����
!!���J ����?��������":��iℋ8�����o>�9�o麨����� G�<⡀ �<����X�����@�� K�(P JE����X��NH��dU@��8�ʲ��������b]#2�r#�Q	������1��Z��l���r��C��r���MxX2�R����	�(��LjrY~,
&�	noԪ��i:�mۊ��T�b�80�1�c��}e�\v�0(Z�f����Y!�Q9���Y{6���LMeհE��,
\^
UE�㹼aQQ�.���'�Z�E�\�HrY5�t���zzi�V	<�sf���:���&�d	�<;I��j�(��`���yaэ��k���^��@���_�l�������MR�(�cn���5'+��{f���q�$����(,���Ndx\2+����v�(Z����b���Գ�3���
�J���K��@��l��B\�+�WN�R�����U��aD?i6[�S��k�{��?�?���[M�� ��/�����N�v#��ԖV��7�+�I��
CUD�9�����l%���XovV��a���j�,T|C����	��(��bP���[k�;ۣ^w�l9
����L?�,���Rg�����Q�KJ�*Ɯ�"H�fsi4�ܹys��F;�,M��{�������/&qm�So�΢�PP�U��S��Gw���rZ�
v�~k��dQb�1��������w�o}������ӧ{��M$HOlʕa�o���G���ܸV_���!����������no��mm]�//d�O�A���S��W_���x�����J\*�tSg���~��������n�/�6��0�u��4�����n�{��g��߸����e������/?~r����|���O�?���v#�u��{��<���õK6�Ӌ�������/��:��5z���
���׳�"�������W�1��Fc������ƫqo
EI

@MartinThoma
Copy link
Member Author

https://arxiv.org/pdf/2201.00214.pdf contains even more missing images

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 26, 2022
@MartinThoma
Copy link
Member Author

MartinThoma commented Sep 27, 2022

PyMuPDF extracted a couple of images for https://arxiv.org/pdf/2201.00029.pdf whereas PyPDF2 didn't extract any.

Here is how to extract images with PyMuPDF: https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L185-L198

@stchris
Copy link

stchris commented Nov 9, 2022

Not sure if this helps, but I thought I'd bisect this to see if it's a regression. My test script is based on a snippet from above, but goes through all pages:

from PyPDF2 import PdfReader
import sys

reader = PdfReader("2201.00151.pdf")
images = []
for page in reader.pages:
    images.extend(page.images)

if images == []:
    print("bad")
    sys.exit(1)
else:
    print("good")

I went as far back as 85b3e87 where PageObject.images was implemented and it seems to be consistently faulty.

@stchris
Copy link

stchris commented Nov 9, 2022

Oh and I seem to have missed the docstring saying this doesn't work with inline images: https://github.com/py-pdf/PyPDF2/blob/0b2b3ec997b81d6bb22c5c834ef5ba9e4f2330e7/PyPDF2/_page.py#L378

@pubpub-zz
Copy link
Collaborator

2nd page:
image
4th image:
inline4

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 18, 2023
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 20, 2023
MartinThoma pushed a commit that referenced this issue Jun 13, 2023
MartinThoma pushed a commit that referenced this issue Jun 18, 2023
Take the number of colors into account for PNG images

Properly process the mask for transparency:

Fixes #1787
Fixes #1599

Adds support for inline images extraction:

Fixes #1368
Fixes #1863

Additional changes:

* Process TIFF predictor 2
* Upgrades Pillow requirement to version 9.5 for Python 9.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
3 participants