Using PdfReader causes a crash #2761

Avgor46 · 2024-07-19T13:03:59Z

Hi!

I've been fuzzing PdfReader with a sydr-fuzz via langchain project and found few errors. The question is should the user handle python errors from the pypdf library or is it a bug in pypdf? The necessary information to reproduce one of them is provided below.

Environment

$ python3 -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.0, crypt_provider=('cryptography', '3.1'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-b26d05712a29b241ac6f9dc7fff57428ba2d1a04.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/fuzz/./reproducer.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./reproducer.py", line 10, in TestOneInput
    for page_number, page in enumerate(pdf_reader.pages):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2296, in __iter__
    for i in range(len(self)):
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2227, in __len__
    return self.length_function()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1125, in _flatten
    self._flatten(obj, inherit, **addt)
  [Previous line repeated 986 more times]
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_doc_common.py", line 1122, in _flatten
    obj = page.get_object()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 383, in get_object
    retval = self.cache_get_indirect_object(
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 545, in cache_get_indirect_object
    return self.resolved_objects.get((generation, idnum))
RecursionError: maximum recursion depth exceeded in comparison

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-07-19T13:15:30Z

Thanks for the report.

In general, I would consider it to be okay for completely broken PDF files to throw any exception. For the specific PDF file and the cyclic page tree, we might want to provide a dedicated error message. Feel free to submit a corresponding PR.

pubpub-zz · 2024-07-19T14:28:14Z

@Avgor46
You have apparently created the pdf file by your own. If so, for me the issue is morelikely on the generation side to have build a damaged pdf : for information your file induces an exception also in pdfbox.

stefan6419846 · 2024-07-19T14:33:04Z

According to https://redirect.github.com/ispras/oss-sydr-fuzz/pull/240/files#diff-0493c80d58164d8423b9d74653bea11bbc054b8797d45fbb5552cf670fa23961R30, this might be part of the pdf.js test suite as well.

MartinThoma · 2024-07-19T19:52:16Z

In general, I would consider it to be okay for completely broken PDF files to throw any exception.

I agree, but there is an important addition: Infinite loops as in GHSA-xcjx-m2pj-8g79 are not ok. pypdf should fail with an exception, not with a hanging application. A RecursionError might be ok, but (without having checked that specific case) refactoring of the code could also lead to an infinite loop which would no longer be ok.

Avgor46 · 2024-07-20T23:03:22Z

Ok, so, I've found several other PoCs that cause different exceptions (AttributeError, ValueError, ...) in library code. I see 3 ways to close this issue:

Consider completely broken PDF files as erroneous inputs that may cause any exception.
If you want, I will attach them here or create new issues, so you will be able to investigate them.
I can examine them and try to fix or catch exceptions in some places and convert them into pypdf exceptions. But I am not familiar with codebase, so it may take some time.

What should we do next?

stefan6419846 · 2024-07-21T08:25:00Z

In general, it always depends on the actual PDF file which needs investigation - there is not "the way" to go with these cases. Thus (1) might not work here. (3) sounds bad as well, as this might just hide/obfuscate real bugs we might have.

(2) sounds the best, but I am not sure what the best solution about this would be. I prefer not to have lots of mass reports of potentially just broken files at once - neither in this issue nor in separate ones. It just takes its time to further analyze each case and decide on how to go with it, while the number of active/frequent contributors tends to be rather low.

My preferred approach in this case would be to gradually report and solve these issues by opening a small amount of issues each time and ideally getting this solved/analyzed with your support before opening the next batch.

MartinThoma · 2024-07-21T10:01:48Z

I've found several other PoCs that cause different exceptions (AttributeError, ValueError, ...) in library code

See #1210 . Let's have the discussion weather it's fine to throw standard exceptions in that discussion :-)

MartinThoma · 2024-07-21T10:04:06Z

If you want, I will attach [broken PDF documents] here

Free-to-use PDF document are valuable to pypdf, but also potentially to other libraries. If you have the copyright on those PDFs, please add a PR to add them to https://github.com/py-pdf/sample-files/ :-)

MartinThoma · 2024-07-21T10:12:13Z

Looking at https://github.com/py-pdf/pypdf/blob/main/pypdf/errors.py I would say we already decided on custom exceptions. In this specific case, I would have expected ParseError or PdfReadError.

closes py-pdf#2761

Avgor46 mentioned this issue Jul 21, 2024

[Community]: Added Simple Exceptions for PyPDFParser langchain-ai/langchain#24478

Closed

stefan6419846 added PdfMerger The PdfMerger component is affected is-robustness-issue From a users perspective, this is about robustness labels Jul 24, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 14, 2024

Terminal ENH: report PdfReadError instead of RecursionError

2f1a623

closes py-pdf#2761

pubpub-zz mentioned this issue Aug 14, 2024

ENH: report PdfReadError instead of RecursionError #2800

Merged

stefan6419846 closed this as completed in #2800 Aug 14, 2024

stefan6419846 closed this as completed in d9a8c54 Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using PdfReader causes a crash #2761

Using PdfReader causes a crash #2761

Avgor46 commented Jul 19, 2024

stefan6419846 commented Jul 19, 2024

pubpub-zz commented Jul 19, 2024

stefan6419846 commented Jul 19, 2024

MartinThoma commented Jul 19, 2024

Avgor46 commented Jul 20, 2024

stefan6419846 commented Jul 21, 2024

MartinThoma commented Jul 21, 2024

MartinThoma commented Jul 21, 2024

MartinThoma commented Jul 21, 2024

Using PdfReader causes a crash #2761

Using PdfReader causes a crash #2761

Comments

Avgor46 commented Jul 19, 2024

Environment

Code + PDF

PoC

Traceback

stefan6419846 commented Jul 19, 2024

pubpub-zz commented Jul 19, 2024

stefan6419846 commented Jul 19, 2024

MartinThoma commented Jul 19, 2024

Avgor46 commented Jul 20, 2024

stefan6419846 commented Jul 21, 2024

MartinThoma commented Jul 21, 2024

MartinThoma commented Jul 21, 2024

MartinThoma commented Jul 21, 2024