You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr Using PageObject.merge_page when one of the pages' content stream ends in Q ends up popping up this error message when the resulting PDF is opened in Adobe Reader:
For search engines: An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.
Example "PDF with content stream ending in Q": blank_portrait.pdf. It's a single page, blank PDF. Opening in a text editor you can see the contents are a 3-byte stream consisting of q Q.
I use PyPDF along with ReportLab to generate PDFs and end up filling in forms by using the PageObject.merge_page to merge a page with text onto a blank form. I normally use Chrome's PDF viewer or MacOS's Preview to look at PDFs, so I didn't see this error for quite a while until I got feedback from people using Adobe Reader. I used poppler-utils's pdftotext as a guess to see if it might show more info about the error. Testing with a PDF with the error showed something like Syntax Error (415): Unknown operator 'QQ'.
It turns out that the base pdf (blank form) had a content stream ending in Q, as in it was of the format q <rest-of-content-stream> Q. After much digging I finally found this line https://github.com/py-pdf/pypdf/blob/main/pypdf/generic/_data_structures.py#L1259 that was adding the additional Q onto the end of the page's content stream when merging.
I think the solution is just to move the newline before the Q in that line of code, I'll have a PR up shortly and I have a very basic repro script that should showcase the problem.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
macOS-14.4.1-arm64-arm-64bit
# also Alpine Linux
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('cryptography', '42.0.5'), PIL=10.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
fromioimportBytesIOfrompypdfimportPdfWriter, PdfReaderfromreportlab.pdfgenimportcanvasdefpatch_isolate_graphics_state():
frompypdf._utilsimportb_frompypdf.generic._data_structuresimportContentStreamdef_isolate_graphics_state(self):
ifself._operations:
self._operations.insert(0, ([], "q"))
self._operations.append(([], "Q"))
elifself._data:
self._data=b"q\n"+b_(self._data) +b"\nQ"ContentStream.isolate_graphics_state=_isolate_graphics_state# Uncomment this to get a fixed PDF# patch_isolate_graphics_state()# generate top page with text via reportlabpacket=BytesIO()
can=canvas.Canvas(packet)
foryinrange(50, 750, 15):
can.drawString(250, y, "Testing, testing, 1, 2, 3...")
can.save()
top_page=PdfReader(packet).pages[0]
# mergeblank_pdf=PdfReader(open("blank_portrait.pdf", "rb"))
bottom_page=blank_pdf.pages[0]
bottom_page.merge_page(top_page)
# write outoutput=PdfWriter()
output.add_page(bottom_page)
withopen("output.pdf", mode='wb') asoutputStream:
output.write(outputStream)
For some reason you have to print above a certain quantity of text on the PDF before Adobe Reader will complain about there being an error on the page, hence the many can.drawString calls. But the QQ operator will be in the output regardless, even if there is no text on the top PDF - so the stream is still not formatted properly, even though Adobe Reader doesn't pop up an error for a super minimal example.
This very basic blank pdf example was produced with MacOS's Preview in 2021, but I have a few others in the wild that were printed from Microsoft Word on MacOS in 2024 that also have the issue, I just can't share them here without quite a bit of cleanup. I've also seen the same issue with forms generated from printing to PDF with Chrome on MacOS.
I'll add the blank PDF here as a test case in my PR.
The text was updated successfully, but these errors were encountered:
stefan6419846
added
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
generic
The generic submodule is affected
labels
Apr 7, 2024
tl;dr Using PageObject.merge_page when one of the pages' content stream ends in Q ends up popping up this error message when the resulting PDF is opened in Adobe Reader:
For search engines: An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.
Example "PDF with content stream ending in Q": blank_portrait.pdf. It's a single page, blank PDF. Opening in a text editor you can see the contents are a 3-byte stream consisting of
q Q
.I use PyPDF along with ReportLab to generate PDFs and end up filling in forms by using the PageObject.merge_page to merge a page with text onto a blank form. I normally use Chrome's PDF viewer or MacOS's Preview to look at PDFs, so I didn't see this error for quite a while until I got feedback from people using Adobe Reader. I used
poppler-utils
'spdftotext
as a guess to see if it might show more info about the error. Testing with a PDF with the error showed something likeSyntax Error (415): Unknown operator 'QQ'
.It turns out that the base pdf (blank form) had a content stream ending in Q, as in it was of the format
q <rest-of-content-stream> Q
. After much digging I finally found this line https://github.com/py-pdf/pypdf/blob/main/pypdf/generic/_data_structures.py#L1259 that was adding the additional Q onto the end of the page's content stream when merging.I think the solution is just to move the newline before the
Q
in that line of code, I'll have a PR up shortly and I have a very basic repro script that should showcase the problem.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Input PDF file: blank_portrait.pdf
Output PDF file using latest PyPDF release: output.pdf
Output PDF file after fix: output.pdf
For some reason you have to print above a certain quantity of text on the PDF before Adobe Reader will complain about there being an error on the page, hence the many
can.drawString
calls. But theQQ
operator will be in the output regardless, even if there is no text on the top PDF - so the stream is still not formatted properly, even though Adobe Reader doesn't pop up an error for a super minimal example.This very basic blank pdf example was produced with MacOS's Preview in 2021, but I have a few others in the wild that were printed from Microsoft Word on MacOS in 2024 that also have the issue, I just can't share them here without quite a bit of cleanup. I've also seen the same issue with forms generated from printing to PDF with Chrome on MacOS.
I'll add the blank PDF here as a test case in my PR.
The text was updated successfully, but these errors were encountered: