<class 'UnicodeDecodeError'> returned a result with an error set #144

qq774724635 · 2022-06-24T08:26:58Z

环境：python 3.7，pip 22.1.2，pdf2docx 0.5.4，PyMuPDF 1.20.0，python-docx 0.8.11

步骤代码：

报错情况：

qq774724635 · 2022-06-24T08:28:31Z

再上传一部分截图吧

dothinking · 2022-06-24T08:44:56Z

这个报错截图看不出问题，请把报错记录中显示原始出错位置的部分截图看看。

qq774724635 · 2022-06-24T08:46:33Z

目前看来应该是pdf格式的问题，我的pdf格式为ASNI，转换成utf-8是可以的，但是转换为已经没有内容了

qq774724635 · 2022-06-24T08:50:24Z

dothinking · 2022-06-24T09:06:34Z

问题的源头是倒数第四行：get_texttrace()，这是上游的一个处理PDF的库PyMuPDF的方法，已经超出我能控制的范围了。

https://pymupdf.readthedocs.io/en/latest/functions.html#Page.get_texttrace

dothinking · 2022-06-24T09:08:09Z

目前看来应该是pdf格式的问题，我的pdf格式为ASNI，转换成utf-8是可以的，但是转换为已经没有内容了

pdf格式为ASNI怎么理解，你是如何转换成utf-8的？

ranger2001 · 2022-07-08T08:45:18Z

遇到类似问题，是一个msoffice2007创建的pdf文件。

dothinking · 2022-07-21T15:29:38Z

遇到类似问题，是一个msoffice2007创建的pdf文件。

不知是否方便上传或者发我邮箱你的pdf以便查找原因？感谢。

ddzzhen · 2022-07-25T13:03:49Z

这个问题看起来与我的很相似，似乎是编码问题，无法执行get_texttrace函数
#155

ranger2001 · 2022-07-27T03:00:51Z

遇到类似问题，是一个msoffice2007创建的pdf文件。

不知是否方便上传或者发我邮箱你的pdf以便查找原因？感谢。
刚看到，希望有帮助。
2.PDF

dothinking · 2022-07-30T10:36:46Z

@ranger2001 感谢提供测试文件。

可以确定是 get_texttrace() 的问题，这是上游处理PDF的库 PyMuPDF 提供的方法，而它又是对上上游MuPDF相应函数的封装。彻底解决这个问题只有等他们官方修复，周期会比较长。

临时地，可以采用try-except的方式忽略这个错误。例如，找到 RawPageFitz.py 第70行（...\site-packages\pdf2docx\page\RawPageFitz.py）：

try:
    spans = self.page_engine.get_texttrace()
except SystemError:
    # logging.warning('Ignore hidden text checking due to UnicodeDecodeError in upstream library.')
    spans = []

dothinking · 2022-07-30T10:50:05Z

pdf2docx用get_texttrace() 来检测隐藏的文本，然后根据需要是否输出到转化后的docx。例如，一些扫描的PDF书籍尤其是年代较远的文献，在扫描的图片层后面隐藏着一个OCR的文本层，方便文字复制和搜索。此时，可以通过设置参数ocr=0或者ocr=1来选择只输出图片或者只输出文本到docx（避免图片和文本的重叠），参考 #132 。

综合来看，绝大多数情况下都不需要考虑隐藏文本的问题，并且get_texttrace() 仅仅是对某些中文字体可能有问题，因此以上临时修复适用于绝大多数情况。

dothinking added input needed Need test file upstream labels Jun 30, 2022

dothinking self-assigned this Jun 30, 2022

dothinking removed the input needed Need test file label Jul 30, 2022

dothinking added the bug Something isn't working label Jul 30, 2022

dothinking mentioned this issue Jul 30, 2022

UnicodeDecodeError problem when converting Chinese pdf #155

Closed

dothinking added a commit that referenced this issue Aug 11, 2022

workaround for UnicodeDecodeError issue: #144, #155

7208ce3

dothinking mentioned this issue Aug 11, 2022

读取表格时间异常且读取错误 A Runtime error occurred when reading the table #158

Closed

dothinking closed this as completed Jan 13, 2024

dothinking mentioned this issue Jan 19, 2024

SystemError: <built-in function Page_get_texttrace> returned a result with an error set #168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<class 'UnicodeDecodeError'> returned a result with an error set #144

<class 'UnicodeDecodeError'> returned a result with an error set #144

qq774724635 commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

dothinking commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

dothinking commented Jun 24, 2022

dothinking commented Jun 24, 2022

ranger2001 commented Jul 8, 2022

dothinking commented Jul 21, 2022

ddzzhen commented Jul 25, 2022

ranger2001 commented Jul 27, 2022

dothinking commented Jul 30, 2022

dothinking commented Jul 30, 2022

<class 'UnicodeDecodeError'> returned a result with an error set #144

<class 'UnicodeDecodeError'> returned a result with an error set #144

Comments

qq774724635 commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

dothinking commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

qq774724635 commented Jun 24, 2022

dothinking commented Jun 24, 2022

dothinking commented Jun 24, 2022

ranger2001 commented Jul 8, 2022

dothinking commented Jul 21, 2022

ddzzhen commented Jul 25, 2022

ranger2001 commented Jul 27, 2022

dothinking commented Jul 30, 2022

dothinking commented Jul 30, 2022