-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<class 'UnicodeDecodeError'> returned a result with an error set #144
Comments
这个报错截图看不出问题,请把报错记录中显示原始出错位置的部分截图看看。 |
目前看来应该是pdf格式的问题,我的pdf格式为ASNI,转换成utf-8是可以的,但是转换为已经没有内容了 |
问题的源头是倒数第四行: https://pymupdf.readthedocs.io/en/latest/functions.html#Page.get_texttrace |
pdf格式为ASNI怎么理解,你是如何转换成utf-8的? |
遇到类似问题,是一个msoffice2007创建的pdf文件。 |
不知是否方便上传或者发我邮箱你的pdf以便查找原因?感谢。 |
这个问题看起来与我的很相似,似乎是编码问题,无法执行 |
|
@ranger2001 感谢提供测试文件。 可以确定是 临时地,可以采用 try:
spans = self.page_engine.get_texttrace()
except SystemError:
# logging.warning('Ignore hidden text checking due to UnicodeDecodeError in upstream library.')
spans = [] |
综合来看,绝大多数情况下都不需要考虑隐藏文本的问题,并且 |
环境:python 3.7,pip 22.1.2,pdf2docx 0.5.4,PyMuPDF 1.20.0,python-docx 0.8.11
步骤代码:
报错情况:
The text was updated successfully, but these errors were encountered: