-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14
Comments
Hi Yang Liu, this may be a Python / Pandas question instead of TagUI for Python question. When I tried running below (stripping away the pandas code), it works even with ® in the output. modified code without pandas import tagui as t
t.init()
t.url("https://www.lazada.sg/")
t.url('https://www.lazada.sg/shop-traditional-laptops/')
for i in list(range(1,40)):
path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
name = t.read(path)
path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
price = t.read(path)
print(name, price, i) output
|
Adding on, have tried your original code and run without errors. Also, forming a dataframe this way in below code I'm also able to run without issues. Can you try and let me know where does the error happen and what is the error message? If error happens within Python can use normal try-catch. If error happens within TagUI, will need to know more replication details and example of error to see where is the part to dig further. import pandas as pd
import tagui as t
t.init()
t.url('https://www.lazada.sg/shop-traditional-laptops/')
result_list = []
for i in range(1,40):
path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
name = t.read(path)
path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
price = t.read(path)
result_list.append({'index': i, 'name': name, 'price': price})
result_df = pd.DataFrame(result_list, columns = ['index', 'name', 'price']) |
Above was with Jupyter notebook. Also wasn't able to replicate using your original script from running via terminal using python3 command. import pandas as pd
df = pd.DataFrame(columns=['index','name','price'])
import tagui as t
t.init()
t.url("https://www.lazada.sg/")
t.url('https://www.lazada.sg/shop-traditional-laptops/')
df = df.iloc[0:0]
for i in list(range(1,40)):
path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
name = t.read(path)
path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
price = t.read(path)
df.loc[i]=[i]+[name]+[price]
t.close() Until you can share the error details and how to replicate the environment / error, some other ideas you can try is using |
Thanks Yang Liu, for the error details! 👍 The difference in our env could be btw Anaconda installed Python vs direct installation, or maybe Windows vs my laptop (macOS). But it has to be resolved within TagUI for Python. Have an idea where to tackle this, will test and try to deploy a new package version that can fix this and remain Python 2/3 compatible. |
This change explicitly specifies unicode utf-8 as the encoding for both Python 2 and 3, and for read and write operations to files. More details at issue #14. This helps potential issues when TagUI outputs utf-8 encoded characters from extended ASCII set (eg webpage has special characters) but user Python locale preferred encoding is not utf-8, resulting in error in I/O between TagUI for Python and TagUI.
Hi Yang Liu, can you try TagUI writes in UTF-8 but your laptop preferred encoding may be something else, causing reading special characters from TagUI to result in error. This v1.4 version explicitly use UTF-8 for both Python 2 and 3 environments, regardless of user's local preferred encoding. Internal reference - made a commit with following comments
|
The problem can be solved practically by:
|
You just lose the characters that cannot be decoded. |
Cool stuff, thanks @convexset ! This workaround is not implementable for now because at this stage, need to address new known gaps where possible. Losing characters or data without error feedback to user can lead to situations where some users run marathon jobs for more than 10 hours only to find out much later they missed some data and need to re-run. I'll check back with you again later on to see if you can commit a PR for edge cases that can't have addressable solutions. |
Hi Ken, Sorry for the late reply as I was overseas and just came back to Singapore. Thank you for the fix, I've tested that it solves my issue : ) Best regards, |
Hi Yang Liu, no problem at all and welcome back to sunny island paradise :) Thanks for raising this and providing the replication details. This is an important issue to fix before the package gets used by more people, because it involves losing data and is a fundamental problem that has to be fixed. If you encounter more edge cases issues let me know, and I see whether can put in something to address the problem. When nothing more can be done, then I'll invite @convexset to commit his solution to drop characters on unrecoverable data read error, and with a warning message. |
still have the error when running the sample.py My env is a chinese windows 11 with python3.8.1 |
It occurs in r.init() |
Looks like there could be some characters that are not in UTF-8 format. Could you check if another computer or your friend can run this? And compare your PCs to see what setting to set to enable UTF-8. This package has been used by many Chinese users, so there should be a solution. |
Hi Ken,
I've encountered an error while crawling the products from Lazada. The error is due to the "R" symbol inside a certain product name: Windows 10 Home (64-bit) NVIDIA®
Attaching the python codes below:
crawlProduct.txt
TagUI stopped once it hit the error, how can I handle the error so that it will continue to crawl the next product?
Best regards,
Yang Liu
The text was updated successfully, but these errors were encountered: