Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Accelerate image list keys generation #2014

Merged
merged 3 commits into from
Jul 28, 2023
Merged

Conversation

pubpub-zz
Copy link
Collaborator

closes #1987

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I've got 2 mypy errors I do not understand Can you have a look please 😘

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (890c93a) 94.03% compared to head (88c8bb2) 94.01%.
Report is 6 commits behind head on main.

❗ Current head 88c8bb2 differs from pull request most recent head c756267. Consider uploading reports for the commit c756267 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2014      +/-   ##
==========================================
- Coverage   94.03%   94.01%   -0.02%     
==========================================
  Files          33       33              
  Lines        7076     7090      +14     
  Branches     1413     1418       +5     
==========================================
+ Hits         6654     6666      +12     
- Misses        263      264       +1     
- Partials      159      160       +1     
Files Changed Coverage Δ
pypdf/_page.py 93.61% <100.00%> (-0.15%) ⬇️
pypdf/_utils.py 99.17% <100.00%> (+<0.01%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I've found a nice fix. Now it's all your 😀

@MartinThoma
Copy link
Member

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

@MartinThoma MartinThoma changed the title ENH : accelerate image list keys generation ENH: Accelerate image list keys generation Jul 28, 2023
@stefan6419846
Copy link
Collaborator

Which code did you use for testing? Did you remove page.inline_images = dict()?

@MartinThoma
Copy link
Member

Thank you - I forgot that 🙈

@MartinThoma
Copy link
Member

Before (current main):

  • 4.24s: 009-pdflatex-geotopo/GeoTopo.pdf
  • 2.88s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

With this PR:

  • 2.01s: 009-pdflatex-geotopo/GeoTopo.pdf
  • 0.44s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

Good work 🎉

@pubpub-zz
Copy link
Collaborator Author

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

can you clarify the code you are using ? page.inline_images = dict() is normally not required

@MartinThoma
Copy link
Member

Yes, I was adding page.inline_images = dict(). That was leading to the error. However, I would not consider this a blocker as this is modifying pypdf behavior in an unexpected way.

I want to have a final look after work, but so far it seems like a great improvement. I'll likely merge it as it is :-)

pypdf/_page.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma merged commit 94f23f9 into py-pdf:main Jul 28, 2023
MartinThoma added a commit that referenced this pull request Jul 29, 2023
## What's new

### New Features (ENH)
-  Accelerate image list keys generation (#2014)
-  Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000)
-  Extract LaTeX characters (#2016)
-  ASCIIHexDecode.decode now returns bytes instead of str (#1994)

### Bug Fixes (BUG)
-  Add RunLengthDecode filter (#2012)
-  Process /Separation ColorSpace (#2007)
-  Handle single element ColorSpace list (#2026)
-  Process lookup decoded as TextStringObjects (#2008)

### Robustness (ROB)
-  Cope with garbage collector during cloning (#1841)

### Maintenance (MAINT)
-  Cleanup of annotations (#1745)

[Full Changelog](3.13.0...3.14.0)
@pubpub-zz pubpub-zz deleted the iss1987 branch September 2, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide public interface for skipping inline page images
3 participants