PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

zuypt · 2023-12-04T04:27:30Z

Since b"" is not mutable it causes python to allocate and deallocate memory repeatedly in the for loop which cause hang/long runtime when handle very large string. For example when using add_js to to add a very big javascript code.

codecov · 2023-12-04T04:33:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (40e25ec) 94.37% compared to head (e3ec6cc) 94.37%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2325   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files          43       43           
  Lines        7660     7660           
  Branches     1515     1515           
=======================================
  Hits         7229     7229           
  Misses        267      267           
  Partials      164      164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2023-12-04T07:09:28Z

Could you please update the title to use the recommended naming scheme? https://pypdf.readthedocs.io/en/latest/dev/intro.html#commit-messages

MartinThoma · 2023-12-04T10:10:02Z

@zuypt Do you have an example that shows the difference? (It could be a toy-example - I'm just curious :-) )

zuypt · 2023-12-04T15:47:41Z

Could you please update the title to use the recommended naming scheme? https://pypdf.readthedocs.io/en/latest/dev/intro.html#commit-messages

I'm too lazy if some one have permission please help

zuypt · 2023-12-04T15:48:22Z

@zuypt Do you have an example that shows the difference? (It could be a toy-example - I'm just curious :-) )

just create a PdfWriter then call add_js with a super large string you will see. This is a pretty common python programming error.

MartinThoma · 2023-12-04T19:58:38Z

I've already adjusted the title

MartinThoma · 2023-12-04T20:04:13Z

import timeit

def benchmark_empty_bytes_literal():
    result = b""
    for _ in range(100000):
        result += b"a"

def benchmark_bytes_object():
    result = bytearray()
    for _ in range(100000):
        result += b"a"

if __name__ == "__main__":
    empty_bytes_literal_time = timeit.timeit(benchmark_empty_bytes_literal, number=100)
    bytes_object_time = timeit.timeit(benchmark_bytes_object, number=100)

    print(f"Empty Bytes Literal Time: {empty_bytes_literal_time:.1f}")
    print(f"bytearray Time: {bytes_object_time:.1f}")

shows:

Empty Bytes Literal Time: 21.4
bytearray Time: 0.5

MartinThoma · 2023-12-04T20:11:52Z

@zuypt Thanks for your contribution! If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

MartinThoma · 2023-12-04T20:12:02Z

It will be part of the next release on Sunday.

zuypt · 2023-12-06T08:59:23Z

@zuypt Thanks for your contribution! If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

sure. Thanks for the recognition

@pubpub-zz

## What's new ### Bug Fixes (BUG) - Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz - Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846 - check words length in _cmap type1_alternative function (#2310) by @Takher ### Robustness (ROB) - Relax flate decoding for too many lookup values (#2331) by @stefan6419846 - Let _build_destination skip in case of missing /D key (#2018) by @nickryand ### Documentation (DOC) - Note in reading form data (#2338) by @MartinThoma - Pull Request prefixes and size by @MartinThoma - Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma - Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846 ### Maintenance (MAINT) - Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann ### Testing (TST) - Centralize file downloads (#2324) by @MartinThoma ### Code Style (STY) - Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846 - Run black by @MartinThoma - Make Traceback in bug report template uppercase (#2304) by @stefan6419846 [Full Changelog](3.17.1...3.17.2)

Update _base.py

e3ec6cc

Since b"" is not mutable it causes python to allocate and deallocate memory repeatedly in the for loop which cause hang/long runtime when handle very large string. For example when using add_js to to add a very big javascript code.

MartinThoma changed the title ~~Update _base.py~~ PERF: Update _base.py Dec 4, 2023

MartinThoma changed the title ~~PERF: Update _base.py~~ PERF: Use bytearray instead of b"" in encode_pdfdocencoding Dec 4, 2023

MartinThoma merged commit 6cb5343 into py-pdf:main Dec 4, 2023
14 checks passed

MartinThoma added a commit that referenced this pull request Dec 6, 2023

DOC: Add https://github.com/zuypt for #2325 as a contributor

98e476a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

zuypt commented Dec 4, 2023

codecov bot commented Dec 4, 2023

stefan6419846 commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

zuypt commented Dec 4, 2023

zuypt commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

MartinThoma commented Dec 4, 2023 •

edited

Loading

MartinThoma commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

zuypt commented Dec 6, 2023

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

PERF: Use bytearray instead of b"" in encode_pdfdocencoding #2325

Conversation

zuypt commented Dec 4, 2023

codecov bot commented Dec 4, 2023

Codecov Report

stefan6419846 commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

zuypt commented Dec 4, 2023

zuypt commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

MartinThoma commented Dec 4, 2023 • edited Loading

MartinThoma commented Dec 4, 2023

MartinThoma commented Dec 4, 2023

zuypt commented Dec 6, 2023

MartinThoma commented Dec 4, 2023 •

edited

Loading