Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfReadError: Too many lookup values while extracting image #2889

Closed
michelcrypt4d4mus opened this issue Oct 4, 2024 · 1 comment · Fixed by #2900
Closed

PdfReadError: Too many lookup values while extracting image #2889

michelcrypt4d4mus opened this issue Oct 4, 2024 · 1 comment · Fixed by #2900
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@michelcrypt4d4mus
Copy link

michelcrypt4d4mus commented Oct 4, 2024

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.6.7-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('pycryptodome', '3.20.0'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('New York State 100AnnvBook140701final10.pdf')

for page in reader.pages:
    print(page)
    for image in page.images:
        print(image)
        print(image.image)

New York State 100AnnvBook140701final10.pdf

Traceback

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2454, in __iter__
    yield self[i]
          ~~~~^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 2450, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3.11/site-packages/pypdf/_page.py", line 490, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3.11/site-packages/pypdf/filters.py", line 791, in _xobj_to_image
    img, image_format, extension, _ = _handle_flate(
                                      ^^^^^^^^^^^^^^
  File "/Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3.11/site-packages/pypdf/_xobj_image_helpers.py", line 218, in _handle_flate
    raise PdfReadError(
pypdf.errors.PdfReadError: Too many lookup values: Expected 8, got 1016.
@stefan6419846
Copy link
Collaborator

Thanks for your report. Your PDF file seems to contain some invalid color lookup tables, especially on page 13. The first broken LUT is too big, the following LUTs are too small.

The following code for lines 212 to 224 seems to fix it:

                if len(lookup) != expected_count:
                    if len(lookup) < expected_count:
                        logger_warning(
                            f"Not enough lookup values: Expected {expected_count}, got {len(lookup)}.",
                            __name__
                        )
                        lookup += bytes([0] * (expected_count - len(lookup)))
                    elif not check_if_whitespace_only(lookup[expected_count:]):
                        logger_warning(
                            f"Too many lookup values: Expected {expected_count}, got {len(lookup)}.",
                            __name__
                        )
                    lookup = lookup[:expected_count]

This basically adds a right padding with null bytes if there are not enough values and always cuts all entries which are out of bounds - and emits warnings instead of hard errors.

@stefan6419846 stefan6419846 added workflow-images From a users perspective, image handling is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants