-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IO] Pdf File close after opening #1624
[IO] Pdf File close after opening #1624
Conversation
Hi @justinjosephmkj 👋, Thanks for the PR :) But i think that's already solved could you please check if upgrading CC @mara004 |
@justinjosephmkj Which version of pypdfium2 are you using? That said, explicitly closing the PdfDocument might be advantageous, to immediately release the file handle, and should inherently prevent the However, I think it can be done more easily than this PR. Just wrap the current try:
return ...
finally:
pdf.close() Closing a parent object implicitly closes child objects (e.g. page handles) since v4.11, so this should be safe with respect to the order of closing. I will also re-introduce context manager functionality for this reason.1 Footnotes
|
(@felixdittrich92 Actually, I doubt whether updating pypdfium2 will help, because the reporter says to be using doctr |
Hi, @mara004 Version: 4.28.0,
This is also solving the issue. I am not using multiprocessing; I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script. |
Please share this file, if possible – that would be a valuable testing item for us to investigate the issue.
You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue. |
@justinjosephmkj Another question: can you confirm whether this is happening just before the process exits? |
Unfortunately I wont be able to share the file, will try to see if I can find a file that might cause this issue from the internet.
I think the same
Yes, this happens right before the process exits for a single file. I am getting the output that I am looking for but it ends with this error message. For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch. |
Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? |
That said, I figured that, if you see this message only on process exit, it should be harmless. What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit. |
I have a script that processes 100s of files one by one. However, the issue that I am encountering occurs randomly with one of these files, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed. The code comes to halt with this message
|
@mara004 |
Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt? Again, a reproducible example (code & pdf(s)) would be needed to investigate. @felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ? |
Yes the closing patch fixed the issue. |
FWIW, whether this error is shown or not is controlled by a conditional variable |
Hi @justinjosephmkj @mara004 👋, Soooo @justinjosephmkj For the scope of this PR i would be fine with @mara004 suggestion:
@justinjosephmkj @mara004 I was not able to reproduce any mem leak but the message appears if:
case 2:
case 1:
script to reproduce (doctr main branch):
memory was ~11 - 13 GB usage - no peaks (releasing / garabage collection looks correct) data is attached :) |
with the code change the message disappeared everything else was the same on my end |
Firstly, thanks for looking into this @felixdittrich92.
As long as it's only on before exit, that should be harmless, as stated already in #1624 (comment) |
Ah, I just had a mismatched tensorflow version. Doing |
Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ... |
I assume you are running on CPU ? ^^ |
Yes.
That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message..... |
Strange, i tested it also again with each single file .. no message .. tested with several bunchs .. no message .. tested again with all 5 file .. message appears xD @mara004
The message appears for each page in the documents now... looks to me like with for example if i process with this script only |
I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much... @felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True and attach the output of a run that produced the error?
|
Here you go :)
|
@mara004 yeah 😅 |
@felixdittrich92 Erm, concerning the output:
That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly. |
|
Ah, I see. That's unfortunate, as what I am interested in is, of course, the error with debug info, because both alone is sort of useless. |
@felixdittrich92 @mara004 Thank you both for looking into this issue, I have updated Name: pypdfium2 to
only shows up right before the process exits. This will give the message for all pdf files.
The issue not causing an memory leak as of now but the message is coming up. |
As you can see yourself, there is no related change between
FWIW, I still do not get the message despite adding this, with Fedora-provided Python 3.11.6. That said, I looked into the weakref module again and found the finalizer diff --git a/src/pypdfium2/internal/bases.py b/src/pypdfium2/internal/bases.py
index af34c30a..4f14ce73 100644
--- a/src/pypdfium2/internal/bases.py
+++ b/src/pypdfium2/internal/bases.py
@@ -72,6 +72,7 @@ class AutoCloseable (AutoCastable):
# NOTE this function captures the value of the `parent` property at finalizer installation time - if it changes, detach the old finalizer and create a new one
assert self._finalizer is None
self._finalizer = weakref.finalize(self._obj, _close_template, self._close_func, self.raw, repr(self), self._autoclose_state, self.parent, *self._ex_args, **self._ex_kwargs)
+ self._finalizer.atexit = False
def _detach_finalizer(self):
self._finalizer.detach() Log trailer without patch
Log trailer with patch
^ The only problem I see with this patch is, if you embed a Python interpreter in another process that keeps running after the interpreter, not invoking finalizers at interpreter shutdown would cause an actual memory leak. Same goes if you get the "cannot close object" warning in an embedded interpreter. |
Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :) |
That's very weird... |
Nope cloned your repo and installed it in editable mode to apply your suggestion :) |
@justinjosephmkj Could you update your PR with the suggestion from #1624 (comment) please ? :) Line 41 in 674a875
to "pypdfium2>=4.11.0,<5.0.0", same here: Line 28 in 674a875
Thanks :) |
This reverts commit b21db56. Revert details : mindee#1624 (comment)
@felixdittrich92 I have updated the code in the PR. |
@justinjosephmkj Could you please update the version identifier as mentioned in #1624 (comment) |
@felixdittrich92 Updated.
|
.conda/meta.yaml
Outdated
@@ -25,7 +25,7 @@ requirements: | |||
- pillow >=9.2.0 | |||
- h5py >=3.1.0, <4.0.0 | |||
- opencv >=4.5.0, <5.0.0 | |||
- pypdfium2-team::pypdfium2_helpers >=4.0.0, <5.0.0 | |||
- ypdfium2-team::pypdfium2_helpers >=4.0.0, <5.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinjosephmkj
pypdfium2 and 4.11.0 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pypdfium2-team::pypdfium2_helpers >=4.11.0, <5.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixdittrich92 Done.
Revert "Pdf File close after opening" This reverts commit b21db56. Revert details : mindee#1624 (comment) Closing pdf after opening. Versions changes Versions changes Versions changes
9ae74dc
to
07583e4
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1624 +/- ##
==========================================
+ Coverage 96.25% 96.42% +0.17%
==========================================
Files 163 164 +1
Lines 7707 7745 +38
==========================================
+ Hits 7418 7468 +50
+ Misses 289 277 -12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Updates @justinjosephmkj 👍
I'm still confused about what may be the cause of this issue, why After all, pypdfium2 is simply a caller of atexit.register() / weakref.finalize(), and if that's just how these APIs end up (mis-)behaving for some users, I don't see what we could do about that from our side... Is there any chance this happens only with specific versions, or builds, of Python? |
@mara004 which OS do you use ? |
I'm on Fedora 37 with system-provided python 3.11.6. |
Ubuntu 22.04 LTS and python from conda. |
Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. Also, pypdfium2's own test suite does not show the issue on GH actions. |
Strange 😅 |
Fix for the issue #1623