Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PdfReader causes a crash #2761

Closed
Avgor46 opened this issue Jul 19, 2024 · 9 comments · Fixed by #2800
Closed

Using PdfReader causes a crash #2761

Avgor46 opened this issue Jul 19, 2024 · 9 comments · Fixed by #2800
Labels
is-robustness-issue From a users perspective, this is about robustness PdfMerger The PdfMerger component is affected

Comments

@Avgor46
Copy link

Avgor46 commented Jul 19, 2024

Hi!

I've been fuzzing PdfReader with a sydr-fuzz via langchain project and found few errors. The question is should the user handle python errors from the pypdf library or is it a bug in pypdf? The necessary information to reproduce one of them is provided below.

Environment

$ python3 -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.0, crypt_provider=('cryptography', '3.1'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-b26d05712a29b241ac6f9dc7fff57428ba2d1a04.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/fuzz/./reproducer.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./reproducer.py", line 10, in TestOneInput
    for page_number, page in enumerate(pdf_reader.pages):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2296, in __iter__
    for i in range(len(self)):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2227, in __len__
    return self.length_function()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  [Previous line repeated 986 more times]
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1122, in _flatten
    obj = page.get_object()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 383, in get_object
    retval = self.cache_get_indirect_object(
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 545, in cache_get_indirect_object
    return self.resolved_objects.get((generation, idnum))
RecursionError: maximum recursion depth exceeded in comparison

@stefan6419846
Copy link
Collaborator

Thanks for the report.

In general, I would consider it to be okay for completely broken PDF files to throw any exception. For the specific PDF file and the cyclic page tree, we might want to provide a dedicated error message. Feel free to submit a corresponding PR.

@pubpub-zz
Copy link
Collaborator

@Avgor46
You have apparently created the pdf file by your own. If so, for me the issue is morelikely on the generation side to have build a damaged pdf : for information your file induces an exception also in pdfbox.

@stefan6419846
Copy link
Collaborator

@MartinThoma
Copy link
Member

In general, I would consider it to be okay for completely broken PDF files to throw any exception.

I agree, but there is an important addition: Infinite loops as in GHSA-xcjx-m2pj-8g79 are not ok. pypdf should fail with an exception, not with a hanging application. A RecursionError might be ok, but (without having checked that specific case) refactoring of the code could also lead to an infinite loop which would no longer be ok.

@Avgor46
Copy link
Author

Avgor46 commented Jul 20, 2024

Ok, so, I've found several other PoCs that cause different exceptions (AttributeError, ValueError, ...) in library code. I see 3 ways to close this issue:

  1. Consider completely broken PDF files as erroneous inputs that may cause any exception.
  2. If you want, I will attach them here or create new issues, so you will be able to investigate them.
  3. I can examine them and try to fix or catch exceptions in some places and convert them into pypdf exceptions. But I am not familiar with codebase, so it may take some time.

What should we do next?

@stefan6419846
Copy link
Collaborator

In general, it always depends on the actual PDF file which needs investigation - there is not "the way" to go with these cases. Thus (1) might not work here. (3) sounds bad as well, as this might just hide/obfuscate real bugs we might have.

(2) sounds the best, but I am not sure what the best solution about this would be. I prefer not to have lots of mass reports of potentially just broken files at once - neither in this issue nor in separate ones. It just takes its time to further analyze each case and decide on how to go with it, while the number of active/frequent contributors tends to be rather low.

My preferred approach in this case would be to gradually report and solve these issues by opening a small amount of issues each time and ideally getting this solved/analyzed with your support before opening the next batch.

@MartinThoma
Copy link
Member

I've found several other PoCs that cause different exceptions (AttributeError, ValueError, ...) in library code

See #1210 . Let's have the discussion weather it's fine to throw standard exceptions in that discussion :-)

@MartinThoma
Copy link
Member

If you want, I will attach [broken PDF documents] here

Free-to-use PDF document are valuable to pypdf, but also potentially to other libraries. If you have the copyright on those PDFs, please add a PR to add them to https://github.com/py-pdf/sample-files/ :-)

@MartinThoma
Copy link
Member

Looking at https://github.com/py-pdf/pypdf/blob/main/pypdf/errors.py I would say we already decided on custom exceptions. In this specific case, I would have expected ParseError or PdfReadError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfMerger The PdfMerger component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants