Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix RGB FlateEncode Images(PNG) and transparency #1834

Merged
merged 40 commits into from
Jun 18, 2023

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented May 6, 2023

Take the number of colors into account for PNG images

Properly process the mask for transparency:

Fixes #1787
Fixes #1599

Adds support for inline images extraction:

Fixes #1368
Fixes #1863

Additional changes:

  • Process TIFF predictor 2
  • Upgrades Pillow requirement to version 9.5 for Python 9.11

Number of colors were not taken into account to process PNG Images

also properly process mask to transparency

closes py-pdf#1787
@codecov
Copy link

codecov bot commented May 6, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@2eee5ed). Click here to learn what that means.
Patch coverage: 88.69% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1834   +/-   ##
=======================================
  Coverage        ?   93.79%           
=======================================
  Files           ?       34           
  Lines           ?     6919           
  Branches        ?     1364           
=======================================
  Hits            ?     6490           
  Misses          ?      280           
  Partials        ?      149           
Impacted Files Coverage Δ
pypdf/filters.py 93.86% <87.73%> (ø)
pypdf/_page.py 92.83% <100.00%> (ø)
pypdf/_utils.py 100.00% <100.00%> (ø)
pypdf/constants.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented May 6, 2023

test images ;
from resources/labeled-edges-center-image.pdf (PNG + transparency wirht reversed scale) (!! updated)
labeled2

from pdf_font_garbled.pdf:
2nd page watermark (RGB+mask PNG):
i0

Image45 part of resources in TPL2 in 2nd page (jp2 file => zipped to stored) (JPG + transparency => JP2)
p65.zip

@pubpub-zz
Copy link
Collaborator Author

tiff predictor 2:
from https://corpora.tika.apache.org/base/docs/govdocs1/977/977609.pdf
image :
pdftiffimage

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label May 7, 2023
@MartinThoma MartinThoma changed the title BUG : fix RGB FlateEncode Images(PNG) and transparency BUG: Fix RGB FlateEncode Images(PNG) and transparency May 7, 2023
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/filters.py Outdated Show resolved Hide resolved
@pubpub-zz
Copy link
Collaborator Author

images possible updates
i0
Im0

pypdf/constants.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@pubpub-zz This PR introduces two changes: (1) The bugfix and (2) the new interface.

Could you please make a second PR just with the bugfix? I feel like that one can be merged today/tomorrow, but I want to have another look at the new interface.

@pubpub-zz
Copy link
Collaborator Author

@pubpub-zz This PR introduces two changes: (1) The bugfix and (2) the new interface.

Could you please make a second PR just with the bugfix? I feel like that one can be merged today/tomorrow, but I want to have another look at the new interface.

For tracking I've isolated the new images and 2 other PR following : inline_images and new function to replace images. However I'm not inclined to process only the bugs ones appart : I've worked all changes at the same time and I would not like to introduce regressions with a mod not being correctly reported.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 14, 2023
@MartinThoma
Copy link
Member

@pubpub-zz I messed something up when I fixed the merge conflicts 🙈

__________________________________ test_cmyk ___________________________________

    @pytest.mark.enable_socket()
    def test_cmyk():
        """Decode cmyk with transparency"""
        url = "https://corpora.tika.apache.org/base/docs/govdocs1/972/972174.pdf"
        name = "tika-972174.pdf"
        reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))
        url_png = "https://user-images.githubusercontent.com/4083478/238288207-b77dd38c-34b4-4f4f-810a-bf9db7ca0414.png"
        name_png = "tika-972174_p0-im0.png"
        refimg = Image.open(
            BytesIO(get_pdf_from_url(url_png, name=name_png))
        )  # not a pdf but it works
        data = reader.pages[0].images[0]
        assert ".jp2" in data.name
>       assert list(data.image.getdata()) == list(refimg.getdata())
E       assert [(255, 255, 255, 0),\n (255, 255, 255, 0),\n (255, 255, 255,

@MartinThoma
Copy link
Member

@pubpub-zz Can you please help me? I have no idea where to look 🙈

@pubpub-zz
Copy link
Collaborator Author

under analysis.

@pubpub-zz
Copy link
Collaborator Author

shoud be fixed. I had to upgrade pillow to 9.5 for python 3.11

@MartinThoma MartinThoma merged commit 68e2cf0 into py-pdf:main Jun 18, 2023
@MartinThoma
Copy link
Member

Thank you 🙏

I'm quickly checking if any of the other PRs is ready to merge and will make a release in less than an hour. This time we have awesome improvements for image handling thanks to your efforts. That was really good work 👍 🤗

MartinThoma added a commit that referenced this pull request Jun 18, 2023
New Features (ENH):
-  Extraction of inline images (#1850)
-  Add capability to replace image (#1849)
-  Extend images interface by returning an ImageFile(File) class (#1848)
-  Add set_data to EncodedStreamObject (#1854)

Bug Fixes (BUG):
-  Fix RGB FlateEncode Images(PNG) and transparency (#1834)
-  Generate static appearance for fields (#1864)

[Full Changelog](3.9.1...3.10.0)
@pubpub-zz pubpub-zz deleted the rgb_png&transparency branch September 2, 2023 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
2 participants