Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Remove only raster graphics #2208

Closed
pbrus opened this issue Sep 20, 2023 · 11 comments · Fixed by #2214
Closed

ENH: Remove only raster graphics #2208

pbrus opened this issue Sep 20, 2023 · 11 comments · Fixed by #2214
Assignees
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-feature A feature request

Comments

@pbrus
Copy link

pbrus commented Sep 20, 2023

The documentation contains two examples regarding image extraction:

This works as expected for pages containing embedded raster graphics. For example:

from pypdf import PdfReader, PdfWriter


# PDF: https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
	with open(str(count) + image_file_object.name, "wb") as fp:
		fp.write(image_file_object.data)
		count += 1

writer = PdfWriter()
writer.add_page(page)
writer.remove_images()

with open("out.pdf", "wb") as f:
	writer.write(f)

As a result we get two images and a new PDF document without these images. When we change the page from 0 to the last one:

page = reader.pages[18]

We get only reduced PDF document:

image

Some background images have disappeared (they look like vector graphics) and they are not store in separate files.

This might look like a bug but I suppose that you are aware of that (that's the reason to request a new feature instead of a bug). Nevertheless, it could be great to have a feature allowing to extract only raster graphics to separate files and don't touch vector graphics (or extract them but also with text which might be very hard as I guess).

@pubpub-zz
Copy link
Collaborator

remove_images() is designed to remove all images(and drawings)
This is a new feature. Feasability under analysis

@pubpub-zz pubpub-zz added is-feature A feature request Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Sep 20, 2023
@pubpub-zz
Copy link
Collaborator

@MartinThoma
I would like your opinion:
a) looking deeper in the code I've observed that inline images are not deleted with remove_images() does not remove inline images. This does not sound normal to me
b) I would propose the following prototype:

XOBJECT_IMAGES = 1
INLINE_IMAGES = 2
DRAWING_IMAGES  = 4
ALL_IMAGES = 7
def remove_images(self, image_type:int = ALL_IMAGES;ignore_byte_string_object: Optional[bool] = None) -> None:

your opinion

@MartinThoma MartinThoma changed the title Remove only raster graphics ENH: Remove only raster graphics Sep 24, 2023
@MartinThoma
Copy link
Member

a) looking deeper in the code I've observed that inline images are not deleted with remove_images() does not remove inline images. This does not sound normal to me

Yes, that is a bug. It should delete inline images (I vaguely remember that this was also documented somewhere)

@pubpub-zz
Copy link
Collaborator

b) I would propose the following prototype:

XOBJECT_IMAGES = 1
INLINE_IMAGES = 2
DRAWING_IMAGES  = 4
ALL_IMAGES = 7
def remove_images(self, image_type:int = ALL_IMAGES;ignore_byte_string_object: Optional[bool] = None) -> None:

your opinion?

@MartinThoma
Copy link
Member

Making it more flexible is a similar issue as with #1831 .

The solution proposed there is to add a function which gets metadata about the annotation and returns True (delete) or False (don't delete).

That is for sure the more flexible design and allows easier extension.

I'm currently writing a proposal going in that direction :-) Give me a few minutes xD You're too fast xD

@MartinThoma
Copy link
Member

MartinThoma commented Sep 24, 2023

What do you think about the following:

class ImageType(Enum):
    XOBJECT_IMAGES = auto()
    INLINE_IMAGES = auto()
    DRAWING_IMAGES  = auto()

@dataclass
class ImageData:
    image_type: ImageType
    ... # potentially more information, e.g. the name, the size, the encoding, position in the page, reference to the image object, ...

def constant_true(image_data: ImageData) -> bool:
    return True

def remove_images(
    self,
    ignore_byte_string_object: Optional[bool] = None,
    to_delete: Callable[[ImageData], bool] = constant_true
) -> None: ...

@MartinThoma
Copy link
Member

I'm looking for similar functions:

@MartinThoma
Copy link
Member

Weird little finding: I thought None might be a bad default as it doesn't sound like "all". The ellipsis ... might be clearer, but I have only seen it as a placeholder so far. Turns out, one can use it:

def mul(a, b=...):
    if b == ...:
        return "all indices"
    return a*b

print(mul(2))

I'm not sure if that is a good idea though 😅

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 24, 2023

class ImageType(Enum):
    XOBJECT_IMAGES = auto()
    INLINE_IMAGES = auto()
    DRAWING_IMAGES  = auto()

@dataclass
class ImageData:
    image_type: ImageType
    ... # potentially more information, e.g. the name, the size, the encoding, position in the page, reference to the image object, ...

def constant_true(image_data: ImageData) -> bool:
    return True

def remove_images(
    self,
    ignore_byte_string_object: Optional[bool] = None,
    to_delete: Callable[[ImageData], bool] = constant_true
) -> None: ...

ignore_byte_string_object is deprecating and I'm pretty sure rarely used. I would rather put to_delete in second pos.
with the type we can identify if to_delete has been forgotten.

also I Propose to have class ImageType(Flag):

@pubpub-zz
Copy link
Collaborator

Weird little finding: I thought None might be a bad default as it doesn't sound like "all". The ellipsis ... might be clearer, but I have only seen it as a placeholder so far. Turns out, one can use it:

def mul(a, b=...):
    if b == ...:
        return "all indices"
    return a*b

print(mul(2))

I'm not sure if that is a good idea though 😅

None is more common for me.

@MartinThoma
Copy link
Member

also I Propose to have class ImageType(Flag)

Sounds good 👍

None is more common for me.

Also good 👍

ignore_byte_string_object is deprecating and I'm pretty sure rarely used. I would rather put to_delete in second pos.
with the type we can identify if to_delete has been forgotten.

Then it's fine to me 👍

What do you think about using the delete_indices solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-feature A feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants