Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO] Pdf File close after opening #1624

Merged

Conversation

justinjosephmkj
Copy link
Contributor

Fix for the issue #1623

@felixdittrich92
Copy link
Contributor

Hi @justinjosephmkj 👋,

Thanks for the PR :)

But i think that's already solved could you please check if upgrading pypdfium2 solves your issue ?

pypdfium2-team/pypdfium2#281

CC @mara004

@felixdittrich92 felixdittrich92 marked this pull request as draft May 31, 2024 10:45
@mara004
Copy link
Contributor

mara004 commented May 31, 2024

@justinjosephmkj Which version of pypdfium2 are you using?
Is there any caller-side multiprocessing around DocumentFile.from_pdf()?

That said, explicitly closing the PdfDocument might be advantageous, to immediately release the file handle, and should inherently prevent the Cannot close object, library is destroyed issue, for whatever reason it may be happening here.

However, I think it can be done more easily than this PR. Just wrap the current return ... in

try:
    return ...
finally:
    pdf.close()

Closing a parent object implicitly closes child objects (e.g. page handles) since v4.11, so this should be safe with respect to the order of closing. I will also re-introduce context manager functionality for this reason.1

Footnotes

  1. At the time where closing order was unprotected, I decided not to provide the with-block because it tempted callers to close the PDF, but not its child objects. (When only using garbage collection, order is ensured automatically by weakref.finalize() API contract.) Anyway, this is no longer an issue.

@mara004
Copy link
Contributor

mara004 commented May 31, 2024

(@felixdittrich92 Actually, I doubt whether updating pypdfium2 will help, because the reporter says to be using doctr v0.8.1, which does linear rendering in the first place, so pypdfium2-team/pypdfium2#281 isn't really applicable, unless the caller is doing multiprocessing.)

@justinjosephmkj
Copy link
Contributor Author

Hi @justinjosephmkj 👋,

Thanks for the PR :)

But i think that's already solved could you please check if upgrading pypdfium2 solves your issue ?

pypdfium2-team/pypdfium2#281

CC @mara004

Hi,
I am using
Name: pypdfium2
Version: 4.28.0

@mara004 Version: 4.28.0,

    try:
        pdf = pdfium.PdfDocument(file, password=password, autoclose=True)
        return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
    finally:
        pdf.close()

This is also solving the issue.

I am not using multiprocessing; I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

@mara004
Copy link
Contributor

mara004 commented May 31, 2024

I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

Please share this file, if possible – that would be a valuable testing item for us to investigate the issue.
I'm afraid I can't debug this unless I'm able to reproduce the issue.

Version: 4.28.0

You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.

@mara004
Copy link
Contributor

mara004 commented May 31, 2024

@justinjosephmkj Another question: can you confirm whether this is happening just before the process exits?

@justinjosephmkj
Copy link
Contributor Author

@mara004

I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

Please share this file, if possible – that would be a valuable testing item for us to investigate the issue. I'm afraid I can't debug this unless I'm able to reproduce the issue.

Unfortunately I wont be able to share the file, will try to see if I can find a file that might cause this issue from the internet.

Version: 4.28.0

You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.

I think the same

Another question: can you confirm whether this is happening just before the process exits?

Yes, this happens right before the process exits for a single file. I am getting the output that I am looking for but it ends with this error message. For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

@mara004
Copy link
Contributor

mara004 commented Jun 1, 2024

For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order?
In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?

@mara004
Copy link
Contributor

mara004 commented Jun 1, 2024

That said, I figured that, if you see this message only on process exit, it should be harmless.
It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet.
In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway.
However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

@justinjosephmkj
Copy link
Contributor Author

That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?

@mara004

I have a script that processes 100s of files one by one. However, the issue that I am encountering occurs randomly with one of these files, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.

The code comes to halt with this message

Cannot close object, library is destroyed

@justinjosephmkj
Copy link
Contributor Author

justinjosephmkj commented Jun 1, 2024

That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

@mara004
I am using a loop to iteratively process multiple files. However, during runtime, the message 'Cannot close object, library is destroyed' appears, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.

@mara004
Copy link
Contributor

mara004 commented Jun 1, 2024

Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?

Again, a reproducible example (code & pdf(s)) would be needed to investigate.

@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?

@justinjosephmkj
Copy link
Contributor Author

Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?

Again, a reproducible example (code & pdf(s)) would be needed to investigate.

@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?

@mara004

Yes the closing patch fixed the issue.

@mara004
Copy link
Contributor

mara004 commented Jun 1, 2024

FWIW, whether this error is shown or not is controlled by a conditional variable LIBRARY_AVAILABLE, which is a mutable bool (ctypes.c_bool()).
This is only ever set at the following 3 places: 0, 1, 2.
I don't see how this could go wrong... – unless maybe something has messed with the memory storing the bool?

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jun 2, 2024

Hi @justinjosephmkj @mara004 👋,

Soooo

@justinjosephmkj For the scope of this PR i would be fine with @mara004 suggestion:

pdf = pdfium.PdfDocument(file, password=password)
# Workaround ref.: https://github.com/mindee/doctr/issues/1623
try:
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
finally:
    pdf.close()

@justinjosephmkj @mara004 I was not able to reproduce any mem leak but the message appears if:

  1. the last file is finally processed
  2. trough a keyboard interrupt

case 2:

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
^CTraceback (most recent call last):
  File "/home/felix/Desktop/doctr/test_scripts/test_pdf.py", line 36, in <module>
    doc = DocumentFile.from_pdf(path)
  File "/home/felix/Desktop/doctr/doctr/io/reader.py", line 40, in from_pdf
    return read_pdf(file, **kwargs)
  File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in read_pdf
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
  File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in <listcomp>
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
  File "/home/felix/anaconda3/envs/doctr-dev/lib/python3.10/site-packages/pypdfium2/_helpers/page.py", line 428, in render
    pdfium_c.FPDF_RenderPageBitmap(*render_args)
KeyboardInterrupt
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!

case 1:

Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.46s/it]
-> Cannot close object, library is destroyed. This may cause a memory leak!

script to reproduce (doctr main branch):

import time

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

from tqdm import tqdm

paths = [
    "/home/felix/Desktop/2305.02577v1.pdf",
    "/home/felix/Desktop/602.pdf",
    "/home/felix/Desktop/1262.pdf",
    "/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
    "/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
] * 1000

model = (
    ocr_predictor(
        det_arch="fast_base",
        reco_arch="parseq",
        pretrained=True,
        det_bs=8,
        reco_bs=1024,
    )
    .cuda()  # remove without GPU
    .half()  # remove without GPU
)

for path in tqdm(paths, desc="Processing", total=len(paths)):
    start = time.time()
    doc = DocumentFile.from_pdf(path)
    pdf_len = len(doc)
    result = model(doc)
    json_res = result.export()
    res_len = len(json_res["pages"])
    print(time.time() - start)
    assert pdf_len == res_len, f"pdf_len: {pdf_len}, res_len: {res_len}"

memory was ~11 - 13 GB usage - no peaks (releasing / garabage collection looks correct)

data is attached :)
data_share.zip

@felixdittrich92
Copy link
Contributor

with the code change the message disappeared everything else was the same on my end

@mara004
Copy link
Contributor

mara004 commented Jun 2, 2024

Firstly, thanks for looking into this @felixdittrich92.
Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error (see attachment): log.txt
(I tried to use tensorflow rather than pytorch because the latter is so big with all its dependencies.)
A slightly more minimalistic example, using only pypdfium2 itself, would be great...

I was not able to reproduce any mem leak but the message appears if:

  • the last file is finally processed
  • trough a keyboard interrupt

As long as it's only on before exit, that should be harmless, as stated already in #1624 (comment)
However, that sounds like atexit.register() does not promise that exit handlers be called after garbage object finalizers.
Anyway, I would probably just leave things as-is in this regard, because it doesn't seem worth installing a global object weakref cache to walk through before destroying the library...

@mara004
Copy link
Contributor

mara004 commented Jun 2, 2024

Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error

Ah, I just had a mismatched tensorflow version. Doing pip3 install -e .[tf] rather than pip3 install -e .; pip3 install tensorflow fixed it 😅 Retrying now...

@mara004
Copy link
Contributor

mara004 commented Jun 2, 2024

Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...

@felixdittrich92
Copy link
Contributor

Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...

I assume you are running on CPU ? ^^
You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

@mara004
Copy link
Contributor

mara004 commented Jun 2, 2024

I assume you are running on CPU ? ^^

Yes.

You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....

@felixdittrich92
Copy link
Contributor

I assume you are running on CPU ? ^^

Yes.

You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....

Strange, i tested it also again with each single file .. no message .. tested with several bunchs .. no message .. tested again with all 5 file .. message appears xD

@mara004
script only to test the pypdfium2 part:

import time

from tqdm import tqdm

from doctr.io import DocumentFile

paths = [
    "/home/felix/Desktop/2305.02577v1.pdf",
    "/home/felix/Desktop/602.pdf",
    "/home/felix/Desktop/1262.pdf",
    "/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
    "/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
]

for path in tqdm(paths, desc="Processing", total=len(paths)):
    start = time.time()
    doc = DocumentFile.from_pdf(path)
    pdf_len = len(doc)
    print(len(doc))
    print(time.time() - start)

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?
By using the ocr_predictor you have some delay that seems to be the reason why it doesn't appears all the time

for example if i process with this script only "/home/felix/Desktop/602.pdf", which has 4 pages the message appears 4 times ^^

@mara004
Copy link
Contributor

mara004 commented Jun 3, 2024

I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...

@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend

import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True

and attach the output of a run that produced the error?

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?

autoclose unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...

@felixdittrich92
Copy link
Contributor

I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...

@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend

import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True

and attach the output of a run that produced the error?

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?

autoclose unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...

Here you go :)

Processing:   0%|                                                                                                                                                                               | 0/5 [00:00<?, ?it/s]19
0.7914893627166748
Processing:  20%|█████████████████████████████████▍                                                                                                                                     | 1/5 [00:00<00:03,  1.26it/s]3
0.02114415168762207
6
0.036025047302246094
Close (auto) <PdfPage uuid:333e76e7>
Close (auto) <PdfPage uuid:4b3cc566>
Close (auto) <PdfPage uuid:b706c0f4>
Close (auto) <PdfPage uuid:f9093a3d>
Close (auto) <PdfPage uuid:8e615b76>
Close (auto) <PdfPage uuid:4dd695af>
Close (auto) <PdfPage uuid:233a80ea>
Close (auto) <PdfPage uuid:0de4c8e0>
Close (auto) <PdfPage uuid:8394319f>
Close (auto) <PdfPage uuid:d2deeca4>
Close (auto) <PdfPage uuid:01602212>
Close (auto) <PdfPage uuid:4ea048f9>
Close (auto) <PdfPage uuid:78a2e272>
Close (auto) <PdfPage uuid:6d8fe051>
Close (auto) <PdfPage uuid:067c7284>
Close (auto) <PdfPage uuid:51e1682b>
Close (auto) <PdfPage uuid:267dd5af>
Close (auto) <PdfPage uuid:901b5056>
Close (auto) <PdfPage uuid:25733567>
Close (auto) <PdfPage uuid:9437fe80>
Close (auto) <PdfPage uuid:1bc110fd>
Close (auto) <PdfPage uuid:0bd464e8>
Close (auto) <PdfPage uuid:53a3c703>
Close (auto) <PdfPage uuid:e5dab713>
Close (auto) <PdfPage uuid:95aeb5aa>
Close (auto) <PdfPage uuid:4afdcda9>
Close (auto) <PdfPage uuid:fac8b29a>
Close (auto) <PdfPage uuid:771244dc>
Close (auto) <PdfPage uuid:597d6e69>
Close (auto) <PdfPage uuid:8ea275cb>
Close (auto) <PdfPage uuid:688eb2f6>
Close (auto) <PdfPage uuid:4e298037>
Close (auto) <PdfPage uuid:f898e06c>
20
0.13844823837280273
Processing:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                 | 4/5 [00:00<00:00,  4.93it/s]16
0.09407329559326172
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.62it/s]
Close (auto) <PdfPage uuid:e3c3896b>
Close (auto) <PdfPage uuid:29e9deff>
Close (auto) <PdfPage uuid:db94fb58>
Close (auto) <PdfPage uuid:1097da0b>
Close (auto) <PdfPage uuid:0ba7b27f>
Close (auto) <PdfPage uuid:c83f918b>
Close (auto) <PdfPage uuid:f4b33617>
Close (auto) <PdfPage uuid:0933d63b>
Close (auto) <PdfPage uuid:e737b1f5>
Close (auto) <PdfPage uuid:8e84a54e>
Close (auto) <PdfPage uuid:cb49229e>
Close (auto) <PdfPage uuid:83ed8321>
Close (auto) <PdfPage uuid:0476ffdf>
Close (auto) <PdfPage uuid:5ee10383>
Close (auto) <PdfPage uuid:0edbe852>
Close (auto) <PdfPage uuid:3c450b88>
Close (auto) <PdfPage uuid:bc3f33f0>
Close (auto) <PdfPage uuid:954516f5>
Close (auto) <PdfPage uuid:081b092d>
Close (auto) <PdfPage uuid:ddd88b13>
Close (auto) <PdfPage uuid:86c9d1f2>
Close (auto) <PdfPage uuid:c2c96fe6>
Close (auto) <PdfPage uuid:f2bb8611>
Close (auto) <PdfPage uuid:9bf25db7>
Close (auto) <PdfPage uuid:7da65012>
Close (auto) <PdfPage uuid:bb4188a4>
Close (auto) <PdfPage uuid:9ea12db3>
Close (auto) <PdfPage uuid:6203ce20>
Close (auto) <PdfPage uuid:c47cb474>
Close (auto) <PdfPage uuid:daafbdb6>
Close (auto) <PdfDocument uuid:60b0a002 from '/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf'>
Close (auto) <PdfPage uuid:ba5a1396>
Close (auto) <PdfDocument uuid:f3566c65 from '/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf'>
Close (auto) <PdfDocument uuid:125811aa from '/home/felix/Desktop/1262.pdf'>
Close (auto) <PdfDocument uuid:d0aa77e1 from '/home/felix/Desktop/602.pdf'>
Close (auto) <PdfDocument uuid:4b5964a9 from '/home/felix/Desktop/2305.02577v1.pdf'>
Destroy PDFium (auto)



----------------------------------------------------------------------------------------------------------------------------------------------------------------------

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
Processing:   0%|                                                                                                                                                                                           | 0/5 [00:00<?, ?it/s]19
0.7971949577331543
Processing:  20%|███████████████████████████████████▊                                                                                                                                               | 1/5 [00:00<00:03,  1.25it/s]3
0.021633386611938477
6
0.03635716438293457
20
0.14321637153625488
Processing:  80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 4/5 [00:00<00:00,  4.87it/s]16
0.09462189674377441
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.57it/s]
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ 

@felixdittrich92
Copy link
Contributor

@mara004 yeah 😅
autoclose sounds to me like a context manager under the hood which takes the job to close the file

@mara004
Copy link
Contributor

mara004 commented Jun 3, 2024

@felixdittrich92 Erm, concerning the output:

  • This is from two separate invocations, right?
  • In the first one, you added the debug info, but the error did not appear.
  • In the second one, where the error appeared, it seems like you may have forgotten to add the DEBUG_AUTOCLOSE ? Or did you, but there were no debug messages?

autoclose sounds to me like a context manager under the hood which takes the job to close the file

That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly.
I agree that naming was unfortunate. I will probably remove autoclose in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.

@felixdittrich92
Copy link
Contributor

@felixdittrich92 Erm, concerning the output:

  • This is from two separate invocations, right?
  • In the first one, you added the debug info, but the error did not appear.
  • In the second one, where the error appeared, it seems like you may have forgotten to add the DEBUG_AUTOCLOSE ? Or did you, but there were no debug messages?

autoclose sounds to me like a context manager under the hood which takes the job to close the file

That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly. I agree that naming was unfortunate. I will probably remove autoclose in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.

  • yes (with the provided script the message appears every time) first - with debug flag / sec - without
  • correct with setting the debug flag the message doesn't pop up / without it does all the time 😅

@mara004
Copy link
Contributor

mara004 commented Jun 4, 2024

Ah, I see. That's unfortunate, as what I am interested in is, of course, the error with debug info, because both alone is sort of useless.
I am guessing this is again simply due to the non-deterministic nature of garbage collection.
Feel free to report back if anyone manages to get the error with debug info.

@justinjosephmkj
Copy link
Contributor Author

@felixdittrich92 @mara004 Thank you both for looking into this issue, I have updated Name: pypdfium2 to
Version: 4.30.0. Now the message

-> Cannot close object, library is destroyed. This may cause a memory leak!

only shows up right before the process exits.

This will give the message for all pdf files.

gc.set_threshold(10, 1, 1)

The issue not causing an memory leak as of now but the message is coming up.

@mara004
Copy link
Contributor

mara004 commented Jun 5, 2024

I have updated Name: pypdfium2 to Version: 4.30.0. Now the message [...] only shows up right before the process exits.

As you can see yourself, there is no related change between 4.28...4.30, so any difference in runtime closing behavior would be coincidence.

gc.set_threshold(10, 1, 1)

FWIW, I still do not get the message despite adding this, with Fedora-provided Python 3.11.6.


That said, I looked into the weakref module again and found the finalizer atexit attribute.
From theoretical considerations I think setting this to False should make the message disappear – can you test the following patch?

diff --git a/src/pypdfium2/internal/bases.py b/src/pypdfium2/internal/bases.py
index af34c30a..4f14ce73 100644
--- a/src/pypdfium2/internal/bases.py
+++ b/src/pypdfium2/internal/bases.py
@@ -72,6 +72,7 @@ class AutoCloseable (AutoCastable):
         # NOTE this function captures the value of the `parent` property at finalizer installation time - if it changes, detach the old finalizer and create a new one
         assert self._finalizer is None
         self._finalizer = weakref.finalize(self._obj, _close_template, self._close_func, self.raw, repr(self), self._autoclose_state, self.parent, *self._ex_args, **self._ex_kwargs)
+        self._finalizer.atexit = False
     
     def _detach_finalizer(self):
         self._finalizer.detach()
Log trailer without patch
Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.89it/s]
Close (auto) <PdfPage uuid:59274ba0>
Close (auto) <PdfPage uuid:21e77cb9>
Close (auto) <PdfPage uuid:0b0d4032>
Close (auto) <PdfPage uuid:7d9abaa4>
Close (auto) <PdfPage uuid:1cad57bb>
Close (auto) <PdfPage uuid:82719a61>
Close (auto) <PdfPage uuid:2ad3461b>
Close (auto) <PdfPage uuid:8b6a231b>
Close (auto) <PdfPage uuid:90eac779>
Close (auto) <PdfPage uuid:14502067>
Close (auto) <PdfPage uuid:70236b43>
Close (auto) <PdfPage uuid:2c5202f4>
Close (auto) <PdfPage uuid:062634ce>
Close (auto) <PdfPage uuid:6fe20317>
Close (auto) <PdfPage uuid:c46c9f58>
Close (auto) <PdfPage uuid:2cf8174f>
Close (auto) <PdfDocument uuid:fd18cba8 from '/home/mara/tmp/doctr_test/data_share/vitstr_paper.pdf'>
Close (auto) <PdfPage uuid:4b707cdd>
Close (auto) <PdfPage uuid:0c33d589>
Close (auto) <PdfPage uuid:67fe8583>
Close (auto) <PdfPage uuid:f131e6de>
Close (auto) <PdfPage uuid:540d464e>
Close (auto) <PdfPage uuid:54fefd72>
Close (auto) <PdfPage uuid:c7df474b>
Close (auto) <PdfPage uuid:db291b39>
Close (auto) <PdfPage uuid:e84780a4>
Close (auto) <PdfPage uuid:53f3d845>
Close (auto) <PdfPage uuid:1941572b>
Close (auto) <PdfPage uuid:28d17c95>
Close (auto) <PdfPage uuid:69f0b448>
Close (auto) <PdfPage uuid:3b7b6d77>
Close (auto) <PdfPage uuid:e5bf0d5f>
Close (auto) <PdfPage uuid:d7a3fea0>
Close (auto) <PdfPage uuid:0edeba1e>
Close (auto) <PdfPage uuid:8161fbe8>
Close (auto) <PdfPage uuid:f8e338de>
Close (auto) <PdfPage uuid:b9483dbc>
Close (auto) <PdfDocument uuid:2a85bfa8 from '/home/mara/tmp/doctr_test/data_share/Versicherungsbedingungen-08-2021.pdf'>
Close (auto) <PdfPage uuid:3d209c98>
Close (auto) <PdfPage uuid:97fe884c>
Close (auto) <PdfPage uuid:00f9c604>
Close (auto) <PdfPage uuid:c7340a01>
Close (auto) <PdfPage uuid:083c8360>
Close (auto) <PdfDocument uuid:fc6ce085 from '/home/mara/tmp/doctr_test/data_share/1262.pdf'>
Close (auto) <PdfDocument uuid:24a33527 from '/home/mara/tmp/doctr_test/data_share/602.pdf'>
Close (auto) <PdfDocument uuid:9967944c from '/home/mara/tmp/doctr_test/data_share/2305.02577v1.pdf'>
Destroy PDFium (auto)
Log trailer with patch
Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.92it/s]
Destroy PDFium (auto)

^ The only problem I see with this patch is, if you embed a Python interpreter in another process that keeps running after the interpreter, not invoking finalizers at interpreter shutdown would cause an actual memory leak. Same goes if you get the "cannot close object" warning in an embedded interpreter.
But as long as Python is hosted in an own process, it should be fine...

@felixdittrich92
Copy link
Contributor

self._finalizer.atexit = False
pypdfium2                 4.30.0+10.gff2ff4b.dirty          pypi_0    pypi

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

@mara004
Copy link
Contributor

mara004 commented Jun 6, 2024

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

That's very weird...
You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jun 7, 2024

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

That's very weird... You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?

Nope cloned your repo and installed it in editable mode to apply your suggestion :)

@felixdittrich92
Copy link
Contributor

@justinjosephmkj Could you update your PR with the suggestion from #1624 (comment) please ? :)
And update

"pypdfium2>=4.0.0,<5.0.0",

to "pypdfium2>=4.11.0,<5.0.0",
same here:
- pypdfium2-team::pypdfium2_helpers >=4.0.0, <5.0.0

Thanks :)

justinjosephmkj added a commit to justinjosephmkj/doctr that referenced this pull request Jun 9, 2024
@justinjosephmkj
Copy link
Contributor Author

justinjosephmkj commented Jun 9, 2024

@felixdittrich92 I have updated the code in the PR.

@felixdittrich92
Copy link
Contributor

@justinjosephmkj Could you please update the version identifier as mentioned in #1624 (comment)
also ? And remove the autoclose=True ? :)

@justinjosephmkj
Copy link
Contributor Author

@felixdittrich92 Updated.

@justinjosephmkj Could you please update the version identifier as mentioned in #1624 (comment) also ? And remove the autoclose=True ? :)

.conda/meta.yaml Outdated
@@ -25,7 +25,7 @@ requirements:
- pillow >=9.2.0
- h5py >=3.1.0, <4.0.0
- opencv >=4.5.0, <5.0.0
- pypdfium2-team::pypdfium2_helpers >=4.0.0, <5.0.0
- ypdfium2-team::pypdfium2_helpers >=4.0.0, <5.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinjosephmkj
pypdfium2 and 4.11.0 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pypdfium2-team::pypdfium2_helpers >=4.11.0, <5.0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixdittrich92 felixdittrich92 self-assigned this Jun 9, 2024
@felixdittrich92 felixdittrich92 added type: bug Something isn't working topic: build Related to dependencies and build module: io Related to doctr.io labels Jun 9, 2024
@felixdittrich92 felixdittrich92 added this to the 0.9.0 milestone Jun 9, 2024
@felixdittrich92 felixdittrich92 marked this pull request as ready for review June 9, 2024 20:22
Revert "Pdf File close after opening"

This reverts commit b21db56.

Revert details : mindee#1624 (comment)

Closing pdf after opening.

Versions changes

Versions changes

Versions changes
Copy link

codecov bot commented Jun 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.42%. Comparing base (9c92f5c) to head (07583e4).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
+ Coverage   96.25%   96.42%   +0.17%     
==========================================
  Files         163      164       +1     
  Lines        7707     7745      +38     
==========================================
+ Hits         7418     7468      +50     
+ Misses        289      277      -12     
Flag Coverage Δ
unittests 96.42% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@felixdittrich92 felixdittrich92 changed the title Pdf File close after opening [IO] Pdf File close after opening Jun 10, 2024
Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Updates @justinjosephmkj 👍

@felixdittrich92 felixdittrich92 merged commit 3f8bd01 into mindee:main Jun 10, 2024
81 checks passed
@mara004
Copy link
Contributor

mara004 commented Jun 12, 2024

I'm still confused about what may be the cause of this issue, why .atexit = False reportedly does not prevent it, and why I can't reproduce.

After all, pypdfium2 is simply a caller of atexit.register() / weakref.finalize(), and if that's just how these APIs end up (mis-)behaving for some users, I don't see what we could do about that from our side...

Is there any chance this happens only with specific versions, or builds, of Python?

@felixdittrich92
Copy link
Contributor

@mara004 which OS do you use ?
Mh tried it on python 3.10 and 3.11 same thing so i don't think so 😅

@mara004
Copy link
Contributor

mara004 commented Jun 12, 2024

which OS do you use ?

I'm on Fedora 37 with system-provided python 3.11.6.
Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jun 13, 2024

which OS do you use ?

I'm on Fedora 37 with system-provided python 3.11.6. Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?

Ubuntu 22.04 LTS and python from conda.
So in this case it would be close to test it in a container again ^^

@mara004
Copy link
Contributor

mara004 commented Jun 13, 2024

Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce.
I'll test on my friends' Ubuntu 20 and 22 devices at some point.

Also, pypdfium2's own test suite does not show the issue on GH actions.

@felixdittrich92
Copy link
Contributor

Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. I'll test on my friends' Ubuntu 20 and 22 devices at some point.

Also, pypdfium2's own test suite does not show the issue on GH actions.

Strange 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: io Related to doctr.io topic: build Related to dependencies and build type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

-> Cannot close object, library is destroyed. This may cause a memory leak!
3 participants