Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable image compression #1546

Closed
finevine opened this issue Jan 12, 2023 · 6 comments · Fixed by #1849
Closed

Enable image compression #1546

finevine opened this issue Jan 12, 2023 · 6 comments · Fixed by #1849
Assignees
Labels
is-feature A feature request

Comments

@finevine
Copy link

Explanation

I want to replace images in a pdf with compressed ones.
Gettings the images and saving them to disk work like a charm with the example in doc.
But I cannot change them in the pdf

Code Example

How would your feature be used?

from pypdf import PdfReader, PdfWriter
reader = PdfReader(input_file_path, strict=False)
for page in reader.pages:
        page.images = [compress(image_file_object) for image_file_object in page.images]
        writer.add_page(page)
...  # your new feature in action!

I have found a bunch of code that aimed at coding this feature:

but it doesn't work as expected.

@MartinThoma MartinThoma changed the title Replace images in a page.images Enable image compression Jan 12, 2023
@MartinThoma MartinThoma added the is-feature A feature request label Jan 12, 2023
@MartinThoma
Copy link
Member

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

@MartinThoma
Copy link
Member

There are also quite a lot of image compression algorithms

@finevine
Copy link
Author

finevine commented Jan 13, 2023

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

Yes, it does but as mentionned in the doc, it does not work on all pdf files (at least it does not increase size as GS do)


+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+
| Size (Mo) |                   type                    | pypdf | ratio  | quality |  gs   |  ratio  |      | pike-pdf |  ratio  |           quality            |
+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+
|       108 | Img and OCR                               |   8,6 | 92,04% | good    | 137,3 | -27,13% | good |      163 | -50,93% | good                         |
|        20 | Traditional text and images               |    20 | 0,00%  | good    |  12,3 | 38,50%  | good |       11 | 45,00%  | good for some images         |
|        44 | Repeated huge photos in a word doc to pdf |    44 | 0,00%  | good    |   4,4 | 90,00%  | good |       21 | 52,27%  | very poor on repeated images |
|      56,6 | 400 pages of a book converted to images   |  56,6 | 0,00%  | good    |  69,2 | -22,26% | good |     53,5 | 5,48%   | good                         |
|       5,8 | 96 pages of text and images               |   3,6 | 37,93% | good    |     3 | 48,28%  | good |      2,4 | 58,62%  | OK except for alpha layer    |
+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+

pypdf column uses you quoted method
gs column uses https://github.com/theeko74/pdfc (call to ghostscript)
pike-pdf column uses more or less https://github.com/theeko74/pdfc (resize big images with pillowimage.resize((width/2, height/2), Image.BILINEAR)

@finevine finevine reopened this Jan 13, 2023
@pubpub-zz
Copy link
Collaborator

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

Having a quick look to the code, It seams that only the content is deflated but not Ximages where big images should be.

@finevine,
I dislike the idea of implementing a non lossless compression : there will be too many options. However, there might be a possibility to implement a visitor function to replace the images

@finevine
Copy link
Author

finevine commented Jan 13, 2023 via email

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 31, 2023

some notes/ideas about image setting:
(just a draft to be cleaned-up

from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject, NullObject
from PIL import Image
from io import BytesIO

w = PdfWriter()
w.append("resources/labeled-edges-center-image.pdf")

for p in w.pages:
    for image_file_object in p.images:
        print(image_file_object.name)
        ii = Image.open(BytesIO(image_file_object.data))
        b = BytesIO()
        ii.save(b, "pdf", quality=60, resolution=19.0, optimize=True)
        rrr = PdfReader(b)
        n = NameObject("/" + "".join(image_file_object.name.split(".")[:-1]))
        ind = p["/Resources"]["/XObject"].raw_get(n)
        w._objects[ind.idnum] = NullObject()  # to cleanup file
        p["/Resources"]["/XObject"][n] = (
            rrr.pages[0]["/Resources"]["/XObject"]["/image"].clone(w).indirect_reference
        )
w.write("tt.pdf")

edit : code updated

MartinThoma pushed a commit that referenced this issue Jun 13, 2023
Having the capability to replace images trivially extends to compressing a PDF file size by reducing the contained images.

Closes #1546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants