Images not extracted #1368

MartinThoma · 2022-09-26T14:53:29Z

When trying to extract the images, PyPDF2 didn't capture them.

Environment

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

Use this PDF: https://arxiv.org/pdf/2201.00151.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader("2201.00151.pdf")
>>> reader.pages[1].images
[]

Expected were two images.

MartinThoma · 2022-09-26T15:04:49Z

The missing ones might be inline images:

An inline image object uses a special syntax to express the data for a small image directly within the content stream.

The operators are:

BI: Begin an inline image object.
ID: Begin the image data for an inline image object.
EI: End an inline image object.

MartinThoma · 2022-09-26T15:17:14Z

Examples:

There are a lot of images with width=1 and height=1:

BI
/IM true
/W 1
/H 1
/BPC 1
ID �
EI Q
q 4.8 0 0 -279.6 368.28 8056.48 cm
BI
/IM true
/W 1
/H 1
/BPC 1
ID �

And some with just small dimensions:

BI
/CS/R124
/W 40
/H 40
/BPC 8
/F/Fl
ID x��S[o�E�ݝݙ����ދw��u6q7N�'�o���8��4��ҖT�UH%Z�����
!!���J ����?��������":��iℋ8�����o>�9�o麨����� G�<⡀ �<����X�����@�� K�(P JE����X��NH��dU@��8�ʲ��������b]#2�r#�Q	������1��Z��l���r��C��r���MxX2�R����	�(��ǈrY~,
&�	noԪ��i:�mۊ��T�b�80�1�c��}e�\v�0(Z�f����Y!�Q9���Y{6���LMeհE��,
\^
UE�㹼aQQ�.���'�Z�E�\�HrY5�t���zzi�V	<�sf���:���&�d	�<;I��j�(��`���yaэ��k���^��@���_�l�������MR�(�cn���5'+��{f���q�$����(,���Ndx\2+����v�(Z����b���Գ�3���
�J���K��@��l��B\�+�WN�R�����U��aD?i6[�S��k�{��?�?���[M�� ��/�����N�v#��ԖV��7�+�I��
CUD�9�����l%���XovV��a���j�,T|C����	��(��bP���[k�;ۣ^w�l9
����L?�,���Rg�����Q�KJ�*Ɯ�"H�fsi4�ܹys��F;�,M��{�������/&qm�So�΢�PP�U��S��Gw���rZ�
v�~k��dQb�1��������w�o}������ӧ{��M$HOlʕa�o���G���ܸV_���!����������no��mm]�//d�O�A���S��W_���x�����J\*�tSg���~��������n�/�6��0�u��4�����n�{��g��߸����e������/?~r����|���O�?���v#�u��{��<���õK6�Ӌ�������/��:��5z���
���׳�"�������W�1��Fc������ƫqo
EI

MartinThoma · 2022-09-26T15:33:24Z

https://arxiv.org/pdf/2201.00214.pdf contains even more missing images

MartinThoma · 2022-09-27T16:18:06Z

PyMuPDF extracted a couple of images for https://arxiv.org/pdf/2201.00029.pdf whereas PyPDF2 didn't extract any.

Here is how to extract images with PyMuPDF: https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L185-L198

stchris · 2022-11-09T08:33:18Z

Not sure if this helps, but I thought I'd bisect this to see if it's a regression. My test script is based on a snippet from above, but goes through all pages:

from PyPDF2 import PdfReader
import sys

reader = PdfReader("2201.00151.pdf")
images = []
for page in reader.pages:
    images.extend(page.images)

if images == []:
    print("bad")
    sys.exit(1)
else:
    print("good")

I went as far back as 85b3e87 where PageObject.images was implemented and it seems to be consistently faulty.

stchris · 2022-11-09T08:36:39Z

Oh and I seem to have missed the docstring saying this doesn't work with inline images: https://github.com/py-pdf/PyPDF2/blob/0b2b3ec997b81d6bb22c5c834ef5ba9e4f2330e7/PyPDF2/_page.py#L378

pubpub-zz · 2023-05-18T20:40:38Z

2nd page:

4th image:

closes py-pdf#1368

Closes #1368

Take the number of colors into account for PNG images Properly process the mask for transparency: Fixes #1787 Fixes #1599 Adds support for inline images extraction: Fixes #1368 Fixes #1863 Additional changes: * Process TIFF predictor 2 * Upgrades Pillow requirement to version 9.5 for Python 9.11

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 26, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 18, 2023

extract Inline Images

2009a07

closes py-pdf#1368

pubpub-zz mentioned this issue May 18, 2023

BUG: Fix RGB FlateEncode Images(PNG) and transparency #1834

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 20, 2023

Add Inline Image extraction

5fd8135

closes py-pdf#1368

pubpub-zz mentioned this issue May 20, 2023

ENH: Extraction of inline images #1850

Merged

MartinThoma closed this as completed in #1850 Jun 13, 2023

MartinThoma pushed a commit that referenced this issue Jun 13, 2023

ENH: Extraction of inline images (#1850)

2eee5ed

Closes #1368

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images not extracted #1368

Images not extracted #1368

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 27, 2022 •

edited

Loading

stchris commented Nov 9, 2022 •

edited

Loading

stchris commented Nov 9, 2022 •

edited

Loading

pubpub-zz commented May 18, 2023

Images not extracted #1368

Images not extracted #1368

Comments

MartinThoma commented Sep 26, 2022

Environment

Code + PDF

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 26, 2022

MartinThoma commented Sep 27, 2022 • edited Loading

stchris commented Nov 9, 2022 • edited Loading

stchris commented Nov 9, 2022 • edited Loading

pubpub-zz commented May 18, 2023

MartinThoma commented Sep 27, 2022 •

edited

Loading

stchris commented Nov 9, 2022 •

edited

Loading

stchris commented Nov 9, 2022 •

edited

Loading