-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Simplify file identifiers generation #2003
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2003 +/- ##
==========================================
- Coverage 95.14% 95.13% -0.02%
==========================================
Files 51 51
Lines 8551 8551
Branches 1706 1707 +1
==========================================
- Hits 8136 8135 -1
Misses 261 261
- Partials 154 155 +1 ☔ View full report in Codecov by Sentry. |
return ByteStringObject(_rolling_checksum(stream).encode("utf8")) | ||
def _compute_document_identifier(self) -> ByteStringObject: | ||
md5 = hashlib.md5() | ||
md5.update(str(time.time()).encode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes document-generation non-deterministic, right?
9092a14
to
5fd1e91
Compare
See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
@@ -1246,7 +1244,7 @@ def generate_file_identifiers(self) -> None: | |||
id2 = self._compute_document_identifier() | |||
else: | |||
id1 = self._compute_document_identifier() | |||
id2 = id1 | |||
id2 = ByteStringObject(id1.original_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id1 is a ByteStringObject already. So .original_bytes
just returns id1. Then wrapping it in ByteStringObject
doesn't do anything, right?
return ByteStringObject(_rolling_checksum(stream).encode("utf8")) | ||
md5 = hashlib.md5() | ||
md5.update(str(time.time()).encode("utf-8")) | ||
md5.update(str(self.fileobj).encode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.fileobj equivalent to self._write_pdf_structure(stream)?
The PDF standard says:
the identifiers are also be used for encryption.
Why this PR improves pypdf:
PdfWriter.encrypt
is called, the identifiers are genearated by uncrypted pdf stream,then
PdfWriter.write
called, the content of pdf file is encrypted, so the hash changed.for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
so the identifiers cannot be the hash of pdf stream content.