Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

Closed
yangliu8912 opened this issue Jun 22, 2019 · 14 comments
Closed
Labels

Comments

@yangliu8912
Copy link

Hi Ken,

I've encountered an error while crawling the products from Lazada. The error is due to the "R" symbol inside a certain product name: Windows 10 Home (64-bit) NVIDIA®

Attaching the python codes below:
crawlProduct.txt

TagUI stopped once it hit the error, how can I handle the error so that it will continue to crawl the next product?

Best regards,
Yang Liu

@kensoh
Copy link
Member

kensoh commented Jun 22, 2019

Hi Yang Liu, this may be a Python / Pandas question instead of TagUI for Python question. When I tried running below (stripping away the pandas code), it works even with ® in the output.

modified code without pandas

import tagui as t
t.init()
t.url("https://www.lazada.sg/")
t.url('https://www.lazada.sg/shop-traditional-laptops/')

for i in list(range(1,40)):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    print(name, price, i)

output

HP Notebook - 14-cm0109au / 4GB / 128 SSD / 14" / WIN10 $549.00 1
Acer Swift 3 SF314-54 Thin and Light Narrow Border Design Laptop - 8th Generation i5 Processor $965.00 2
ASUS ZenBook 14 UX431FA-AM055T (Utopia Blue) $1,298.00 3
Laptop Ultra-thin Laptop 15.6 Inch Business Portable Laptop Durable Windows 10 4GB+64GB  Slim Black Notebook Computer Office Laptop (Wireless mouse for gift) $259.00 4
KawhiMall Ultrathin 14 inch Laptop Z8350 computer 2GB/4GB RAM+32GB/64GB ROM notebookQuad core 1.44GHZ 1366*768 TN wifi Bluetooth $209.99 5
NEW MICROSOFT SURFACE LAPTOP 2 BUSINESS- Intel i7 8GB RAM 256GB SSD in Matte Black $2,066.00 6
(Refurbished) Lenovo ThinkPad 11E 11.6" 11e Chromebook /Quad-Core CPU /4 GB RAM /16 GB SSD /Intel HD Graphics $218.00 7
Certified Refurbished Lenovo Thinkpad T430 Laptop (Core i5 3rd Gen/4 GB/320 GB/Windows 10 Pro) $327.99 8
HP Notebook - 14s-cf1020tx/ i5-8265U/ 8 GB RAM/ 1 TB 5400 rpm SATA/ AMD Radeon™ 530 Graphics/WIN 10 $949.00 9
[Online Exclusive] Acer Swift 3 SF314-54 Thin & Light Narrow Border Laptop - Latest Model $766.00 10
(Refurbished) Lenovo ThinkPad Yoga 260 12.5-inch - i5 6th Gen 8GB Ram 192GB SSD $489.90 11
VOYO i8 Max Tablet PC MTK X20 Deca-core 4GB RAM 64GB ROM 10.1" 1920*1200 IPS Android 7.0 LTE WCDMA Tablet with WiFi Bluetooth - intl $267.00 12
[Laptop] Microsoft New Surface Laptop 2 $1,528.00 13
HP Pavilion - 14-ce0090tx Intel i5-8250U Nvidia MX150 8GB DDR4 RAM 1TB Hard drive 128GB SSD $849.00 14
Acer Swift 3 SF314-56G-71GP(Blue) 14"/ i7-8565U/ NVIDIA GEFORCE MX250/ 8GB DDR4 RAM/ 256GB SSD + 1TB HDD $1,466.00 15
HP ELITEBOOK 840 G5 i5 (3UP98PA) $2,408.00 16
Dell Latitude E7450 14inch Business Laptop Computer, Intel Core i7-5600U #2.6Ghz 8GB RAM, 256GB SSD, 802.11ac, Bluetooth, HDMI, USB 3.0, Windows 10 Professional Refurbished $488.00 17
GOOD Ultra-thin Laptop PC 14.1-inch Netbook 1366*768P Display pixel 4GB+64GB $269.99 18
[Refurbished] Lenovo ThinkPad X240 Ultrabook / Intel Core i5 / 4GB RAM / 128GB SSD / Windows 10 Pro Laptop (Black) $259.00 19
Dell XPS 13 9380 -i5 8GB 256 | Stunning inside and out $1,899.00 20
CELE Tablet 10.1 inch Tablet 4GB RAM 64GB ROM for Android 7.0 Phablet Tablet Pc $90.53 21
Acer Spin 3 SP314-52-57FR Convertible Laptop (Grey) $1,198.00 22
[GSS] Microsoft Surface Laptop 2 Platinum i5/8gb/256gb $1,728.00 23
[NEW ARRIVAL] Acer Aspire 3 A314-32 14-Inch Laptop $568.00 24
YEPO 737A6 15.6-Inch 1080P HD Notebook 6+64G Gaming Working Laptop For Windows10 0.3MP Camera Notebook Computer $343.53 25
VOYO i8 Max Tablet PC MTK X20 Deca-core 4GB RAM 64GB ROM 10.1" 1920*1200 IPS Android 7.0 LTE WCDMA Tablet with WiFi Bluetooth $247.00 26
Lenovo Ideapad S130 -11.6 HD -N4000 -4GB DDR4 -32G EMMC $399.00 27
HP Laptop 14s-df0004TU/ i3-8130U/ 4 GB RAM/ 128 GB SSD/ WIN 10 $799.00 28
KawhiMall 14 inch Laptop Computer  Intel Cherry trail Z8350 Quad-Core 1.92GHz Notebook Durable LPDDR3 2GB EMMC 32GB With wifi bluetooth Exquisite Gadget Working Supply Learning Tools $219.00 29
Dell Latitude E7450 14inch Business Laptop Computer, Intel Core i7-5600U #2.6Ghz 16GB RAM, 256GB SSD, 802.11ac, Bluetooth, HDMI, USB 3.0, Windows 10 Professional Refurbished $538.00 30
[Promotion] 14.1 Inch EZbook 2 Notebook 1920x1080 FHD 4GB+64GB Laptop Computer EU Charger $288.88 31
original Jumper EZbook X4 Notebook 14.0 inch Windows 10 Home Version Intel Celeron J3455 Quad Core 1.5GHz (6GB RAM 128GB )SSD 2.0MP Front Camera Dual Band 4600mAh Built-in $411.00 32
Acer Aspire 5 A515-52G-58V9 15.6-Inch Narrow Bezel INTEL i5 with NVIDIA Graphics Card and Intel Optane Memory Laptop $998.00 33
Acer Swift 3 SF314-54G Thin and Light Narrow Border Design Laptop - 8th Generation i7 Processor with NVIDIA Graphics Card $1,398.00 34
Asus L203MA-FD071T $499.00 35
Lenovo ThinkPad T540P 15.6in LED Laptop i5-4300M #2.6Ghz 8GB DDR3 240GB SSD Win 10 Pro One Month Warranty-Refurbished $488.00 36
ASUS X411UF-BV070T i5-8250U Processor 1.6GHz (6M Cache, Up to 3.4GHz) Windows 10 Home (64-bit) NVIDIA® GeForce® GT MX130 with 2GB DDR5 14.0” LED-backlit Ultra Slim HD 1,366 x 768 Display 8GB DDR4 RAM & 1TB SATA HDD $928.00 37
BELLE 14 inch for Windows 10 Redstone OS Notebook PC Laptop 1920*1080P HD Display US Plug - intl $209.90 38
[Online Exclusive] Acer Swift 1 SF114-32-C5FL (Gold) Thin and Light Narrow-Bezel Display Laptop $598.00 39

@kensoh
Copy link
Member

kensoh commented Jun 22, 2019

Adding on, have tried your original code and run without errors. Also, forming a dataframe this way in below code I'm also able to run without issues. Can you try and let me know where does the error happen and what is the error message?

If error happens within Python can use normal try-catch. If error happens within TagUI, will need to know more replication details and example of error to see where is the part to dig further.

import pandas as pd
import tagui as t
t.init()
t.url('https://www.lazada.sg/shop-traditional-laptops/')

result_list = []
for i in range(1,40):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    result_list.append({'index': i, 'name': name, 'price': price})

result_df = pd.DataFrame(result_list, columns = ['index', 'name', 'price'])

@kensoh kensoh changed the title Error handling when encountered undefined character Error handling when encountered undefined character - need replication / error details Jun 22, 2019
@kensoh kensoh changed the title Error handling when encountered undefined character - need replication / error details Error handling when encountered undefined character - need eg of error Jun 22, 2019
@kensoh kensoh added the query label Jun 22, 2019
@kensoh
Copy link
Member

kensoh commented Jun 22, 2019

Above was with Jupyter notebook. Also wasn't able to replicate using your original script from running via terminal using python3 command.

import pandas as pd
df = pd.DataFrame(columns=['index','name','price'])

import tagui as t
t.init()
t.url("https://www.lazada.sg/")


t.url('https://www.lazada.sg/shop-traditional-laptops/')

df = df.iloc[0:0]

for i in list(range(1,40)):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    df.loc[i]=[i]+[name]+[price]

t.close()

Until you can share the error details and how to replicate the environment / error, some other ideas you can try is using # -*- coding: utf-8 -*- at top of your script.

@yangliu8912
Copy link
Author

Hi Ken,

Thank you so much for your reply. I've tried taking out the pandas code but I still encounter the same error. Please kindly see the error details below:

errorDetails

I've also tried using # -- coding: utf-8 -- at the first line of my script but it did not work.

Best regards,
Yang Liu

@kensoh
Copy link
Member

kensoh commented Jun 24, 2019

Thanks Yang Liu, for the error details! 👍 The difference in our env could be btw Anaconda installed Python vs direct installation, or maybe Windows vs my laptop (macOS).

But it has to be resolved within TagUI for Python. Have an idea where to tackle this, will test and try to deploy a new package version that can fix this and remain Python 2/3 compatible.

kensoh added a commit that referenced this issue Jun 24, 2019
This change explicitly specifies unicode utf-8 as the encoding for both Python 2 and 3, and for read and write operations to files. More details at issue #14.

This helps potential issues when TagUI outputs utf-8 encoded characters from extended ASCII set (eg webpage has special characters) but user Python locale preferred encoding is not utf-8, resulting in error in I/O between TagUI for Python and TagUI.
@kensoh
Copy link
Member

kensoh commented Jun 24, 2019

Hi Yang Liu, can you try pip install tagui --upgrade, restart Jupyter notebook kernel and try again? I was not able to replicate it in my environment but suspect the cause to be related to this.

TagUI writes in UTF-8 but your laptop preferred encoding may be something else, causing reading special characters from TagUI to result in error. This v1.4 version explicitly use UTF-8 for both Python 2 and 3 environments, regardless of user's local preferred encoding.

Internal reference - made a commit with following comments

This change explicitly specifies unicode utf-8 as the encoding for both Python 2 and 3, and for read and write operations to files. More details at issue #14.

This helps potential issues when TagUI outputs utf-8 encoded characters from extended ASCII set (eg webpage has special characters) but user Python locale preferred encoding is not utf-8, resulting in error in I/O between TagUI for Python and TagUI.

@kensoh kensoh added bug and removed query labels Jun 24, 2019
@kensoh kensoh changed the title Error handling when encountered undefined character - need eg of error Error handling when encountered special characters - non-UTF-8 env [fixed] Jun 24, 2019
@convexset
Copy link

The problem can be solved practically by:

try:
	return t.text()
except UnicodeDecodeError as e:
	return e.args[1].decode(e.args[0], errors='ignore')
# except ...:
#	whatever

@convexset
Copy link

You just lose the characters that cannot be decoded.

@kensoh
Copy link
Member

kensoh commented Jun 25, 2019

Cool stuff, thanks @convexset ! This workaround is not implementable for now because at this stage, need to address new known gaps where possible. Losing characters or data without error feedback to user can lead to situations where some users run marathon jobs for more than 10 hours only to find out much later they missed some data and need to re-run. I'll check back with you again later on to see if you can commit a PR for edge cases that can't have addressable solutions.

@kensoh kensoh changed the title Error handling when encountered special characters - non-UTF-8 env [fixed] UnicodeDecodeError when encounter special characters - non-UTF-8 env [fixed] Jun 28, 2019
@kensoh kensoh changed the title UnicodeDecodeError when encounter special characters - non-UTF-8 env [fixed] UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] Jun 28, 2019
@yangliu8912
Copy link
Author

Hi Ken,

Sorry for the late reply as I was overseas and just came back to Singapore. Thank you for the fix, I've tested that it solves my issue : )

Best regards,
Yang Liu

@kensoh
Copy link
Member

kensoh commented Jul 1, 2019

Hi Yang Liu, no problem at all and welcome back to sunny island paradise :) Thanks for raising this and providing the replication details. This is an important issue to fix before the package gets used by more people, because it involves losing data and is a fundamental problem that has to be fixed.

If you encounter more edge cases issues let me know, and I see whether can put in something to address the problem. When nothing more can be done, then I'll invite @convexset to commit his solution to drop characters on unrecoverable data read error, and with a warning message.

@wwwfzt
Copy link

wwwfzt commented Aug 20, 2022

still have the error when running the sample.py My env is a chinese windows 11 with python3.8.1
[RPA][ERROR] - 'utf-8' codec can't decode bytes in position 82-83: invalid continuation byte

@wwwfzt
Copy link

wwwfzt commented Aug 20, 2022

still have the error when running the sample.py My env is a chinese windows 11 with python3.8.1 [RPA][ERROR] - 'utf-8' codec can't decode bytes in position 82-83: invalid continuation byte

It occurs in r.init()

@kensoh
Copy link
Member

kensoh commented Aug 20, 2022

Looks like there could be some characters that are not in UTF-8 format. Could you check if another computer or your friend can run this? And compare your PCs to see what setting to set to enable UTF-8. This package has been used by many Chinese users, so there should be a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants