BUG: Ignore UTF-8 decode errors #1865

talibhmukadam · 2023-06-01T09:18:14Z

Problem
Some pdfs contain Latin characters, and when trying to read them using pypdf, it throws the following exception.|

    text = page.extract_text()
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/_page.py", line 1851, in extract_text
    return self._extract_text(
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/_page.py", line 1356, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 877, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 943, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1053, in read_object
    return read_string_from_stream(stream, forced_encoding)
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 107, in read_string_from_stream
    msg = rf"Unexpected escaped string: {tok.decode('utf8')}"
    ```

codecov · 2023-06-01T09:29:34Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (81a58da) 93.42% compared to head (ed2319e) 93.42%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1865   +/-   ##
=======================================
  Coverage   93.42%   93.42%           
=======================================
  Files          34       34           
  Lines        6634     6634           
  Branches     1303     1303           
=======================================
  Hits         6198     6198           
  Misses        284      284           
  Partials      152      152

Impacted Files	Coverage Δ
pypdf/generic/_utils.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

tasfiqul-ghani

Facing this issue : 'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte
This PR will fix the issue.

tasfiqul-ghani · 2023-06-01T10:18:04Z

@MartinThoma Please merge the PR.It will fix a major issue. For some PDFs we are getting this error :
'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte

pubpub-zz · 2023-06-01T17:21:20Z

@tasfiqul-ghani
can you share one pdf showing the issue ?

pypdf/generic/_utils.py

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma · 2023-06-03T07:11:35Z

Thank you for your contribution @talibhmukadam 🙏

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

pubpub-zz · 2023-06-03T07:15:29Z

@talibhmukadam / @tasfiqul-ghani
I' m still understanding in a sample file. I would like to check if nothing is hidden behind🤔

talibhmukadam · 2023-06-03T08:48:39Z

@MartinThoma , @pubpub-zz thank you guys for reviewing and merging the fix so fast. If I could ask, what would be a possible ETA to release the new version of pypdf with this fix? I am sorry to rush you but it is blocking us from releasing a new feature 😢

@MartinThoma , Yes, please feel free to add me to the contributor's list. 😄

@pubpub-zz , I would love to share the pdf file, but the few pdf files that we got the errors on, are resume files that contain PII information which my organization wouldn't allow me to share. I hope you understand.
If I could reproduce this error on any other pdf file. I will definitely share that with you.

MartinThoma · 2023-06-04T18:51:17Z

I'm creating a release at the moment. It will be on PyPI in less than 2 hours.

However, if you want to make sure that the fix stays in pypdf, we need to get a sample file. Otherwise it could happen in future that another change breaks it again (but I also understand the PII restrictions 😢 )

MartinThoma · 2023-06-04T18:51:57Z

Yes, please feel free to add me to the contributor's list. smile

Which name should I use and should I link to some profile (e.g. your Github profile?)

Deprecations (DEP) - Deprecate PdfMerger (#1866) Bug Fixes (BUG) - Ignore UTF-8 decode errors (#1865) Robustness (ROB) - Handle missing /Type entry in Page tree (#1859) [Full Changelog](3.9.0...3.9.1)

pubpub-zz · 2023-06-04T18:58:11Z

@talibhmukadam / @tasfiqul-ghani
Would you agree to share the file privately? If so, please email it to @MartinThoma (info@martin-thoma.de)

ignore decode errors

4edb954

talibhmukadam changed the title ~~ignore decode errors~~ BUG : Ignore UTF-8 decode errors Jun 1, 2023

tasfiqul-ghani approved these changes Jun 1, 2023

View reviewed changes

pubpub-zz approved these changes Jun 1, 2023

View reviewed changes

pypdf/generic/_utils.py Outdated Show resolved Hide resolved

Update pypdf/generic/_utils.py

3ff63fc

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma changed the title ~~BUG : Ignore UTF-8 decode errors~~ BUG: Ignore UTF-8 decode errors Jun 3, 2023

Merge branch 'main' into main

ed2319e

MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jun 3, 2023

MartinThoma merged commit 686adec into py-pdf:main Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Ignore UTF-8 decode errors #1865

BUG: Ignore UTF-8 decode errors #1865

talibhmukadam commented Jun 1, 2023

codecov bot commented Jun 1, 2023 •

edited

Loading

tasfiqul-ghani left a comment

tasfiqul-ghani commented Jun 1, 2023

pubpub-zz commented Jun 1, 2023

MartinThoma commented Jun 3, 2023

pubpub-zz commented Jun 3, 2023

talibhmukadam commented Jun 3, 2023

MartinThoma commented Jun 4, 2023

MartinThoma commented Jun 4, 2023

pubpub-zz commented Jun 4, 2023 •

edited by MartinThoma

Loading

BUG: Ignore UTF-8 decode errors #1865

BUG: Ignore UTF-8 decode errors #1865

Conversation

talibhmukadam commented Jun 1, 2023

codecov bot commented Jun 1, 2023 • edited Loading

Codecov Report

tasfiqul-ghani left a comment

Choose a reason for hiding this comment

tasfiqul-ghani commented Jun 1, 2023

pubpub-zz commented Jun 1, 2023

MartinThoma commented Jun 3, 2023

pubpub-zz commented Jun 3, 2023

talibhmukadam commented Jun 3, 2023

MartinThoma commented Jun 4, 2023

MartinThoma commented Jun 4, 2023

pubpub-zz commented Jun 4, 2023 • edited by MartinThoma Loading

codecov bot commented Jun 1, 2023 •

edited

Loading

pubpub-zz commented Jun 4, 2023 •

edited by MartinThoma

Loading