ENH: Accelerate image list keys generation #2014

pubpub-zz · 2023-07-25T18:28:50Z

closes #1987

closes py-pdf#1987

pubpub-zz · 2023-07-25T18:31:26Z

@MartinThoma
I've got 2 mypy errors I do not understand Can you have a look please 😘

codecov · 2023-07-26T18:56:34Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (890c93a) 94.03% compared to head (88c8bb2) 94.01%.
Report is 6 commits behind head on main.

❗ Current head 88c8bb2 differs from pull request most recent head c756267. Consider uploading reports for the commit c756267 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2014      +/-   ##
==========================================
- Coverage   94.03%   94.01%   -0.02%     
==========================================
  Files          33       33              
  Lines        7076     7090      +14     
  Branches     1413     1418       +5     
==========================================
+ Hits         6654     6666      +12     
- Misses        263      264       +1     
- Partials      159      160       +1

Files Changed	Coverage Δ
pypdf/_page.py	`93.61% <100.00%> (-0.15%)`	⬇️
pypdf/_utils.py	`99.17% <100.00%> (+<0.01%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-07-26T19:08:18Z

@MartinThoma
I've found a nice fix. Now it's all your 😀

MartinThoma · 2023-07-28T11:58:10Z

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

stefan6419846 · 2023-07-28T12:00:41Z

Which code did you use for testing? Did you remove page.inline_images = dict()?

MartinThoma · 2023-07-28T12:11:30Z

Thank you - I forgot that 🙈

MartinThoma · 2023-07-28T12:14:30Z

Before (current main):

4.24s: 009-pdflatex-geotopo/GeoTopo.pdf
2.88s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

With this PR:

2.01s: 009-pdflatex-geotopo/GeoTopo.pdf
0.44s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

Good work 🎉

pubpub-zz · 2023-07-28T13:04:05Z

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

can you clarify the code you are using ? page.inline_images = dict() is normally not required

MartinThoma · 2023-07-28T14:02:04Z

Yes, I was adding page.inline_images = dict(). That was leading to the error. However, I would not consider this a blocker as this is modifying pypdf behavior in an unexpected way.

I want to have a final look after work, but so far it seems like a great improvement. I'll likely merge it as it is :-)

pypdf/_page.py

## What's new ### New Features (ENH) - Accelerate image list keys generation (#2014) - Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000) - Extract LaTeX characters (#2016) - ASCIIHexDecode.decode now returns bytes instead of str (#1994) ### Bug Fixes (BUG) - Add RunLengthDecode filter (#2012) - Process /Separation ColorSpace (#2007) - Handle single element ColorSpace list (#2026) - Process lookup decoded as TextStringObjects (#2008) ### Robustness (ROB) - Cope with garbage collector during cloning (#1841) ### Maintenance (MAINT) - Cleanup of annotations (#1745) [Full Changelog](3.13.0...3.14.0)

ENH : accelerate image list keys generation

30dacb9

closes py-pdf#1987

mypy

88c8bb2

MartinThoma changed the title ~~ENH : accelerate image list keys generation~~ ENH: Accelerate image list keys generation Jul 28, 2023

MartinThoma reviewed Jul 28, 2023

View reviewed changes

pypdf/_page.py Outdated Show resolved Hide resolved

Update pypdf/_page.py

c756267

MartinThoma merged commit 94f23f9 into py-pdf:main Jul 28, 2023

pubpub-zz deleted the iss1987 branch September 2, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Accelerate image list keys generation #2014

ENH: Accelerate image list keys generation #2014

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023

codecov bot commented Jul 26, 2023 •

edited

Loading

pubpub-zz commented Jul 26, 2023

MartinThoma commented Jul 28, 2023

stefan6419846 commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

pubpub-zz commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

ENH: Accelerate image list keys generation #2014

ENH: Accelerate image list keys generation #2014

Conversation

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

pubpub-zz commented Jul 26, 2023

MartinThoma commented Jul 28, 2023

stefan6419846 commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

pubpub-zz commented Jul 28, 2023

MartinThoma commented Jul 28, 2023

codecov bot commented Jul 26, 2023 •

edited

Loading