UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

yangliu8912 · 2019-06-22T15:57:19Z

Hi Ken,

I've encountered an error while crawling the products from Lazada. The error is due to the "R" symbol inside a certain product name: Windows 10 Home (64-bit) NVIDIA®

Attaching the python codes below:
crawlProduct.txt

TagUI stopped once it hit the error, how can I handle the error so that it will continue to crawl the next product?

Best regards,
Yang Liu

kensoh · 2019-06-22T16:56:57Z

Hi Yang Liu, this may be a Python / Pandas question instead of TagUI for Python question. When I tried running below (stripping away the pandas code), it works even with ® in the output.

modified code without pandas

import tagui as t
t.init()
t.url("https://www.lazada.sg/")
t.url('https://www.lazada.sg/shop-traditional-laptops/')

for i in list(range(1,40)):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    print(name, price, i)

output

HP Notebook - 14-cm0109au / 4GB / 128 SSD / 14" / WIN10 $549.00 1
Acer Swift 3 SF314-54 Thin and Light Narrow Border Design Laptop - 8th Generation i5 Processor $965.00 2
ASUS ZenBook 14 UX431FA-AM055T (Utopia Blue) $1,298.00 3
Laptop Ultra-thin Laptop 15.6 Inch Business Portable Laptop Durable Windows 10 4GB+64GB  Slim Black Notebook Computer Office Laptop (Wireless mouse for gift) $259.00 4
KawhiMall Ultrathin 14 inch Laptop Z8350 computer 2GB/4GB RAM+32GB/64GB ROM notebookQuad core 1.44GHZ 1366*768 TN wifi Bluetooth $209.99 5
NEW MICROSOFT SURFACE LAPTOP 2 BUSINESS- Intel i7 8GB RAM 256GB SSD in Matte Black $2,066.00 6
(Refurbished) Lenovo ThinkPad 11E 11.6" 11e Chromebook /Quad-Core CPU /4 GB RAM /16 GB SSD /Intel HD Graphics $218.00 7
Certified Refurbished Lenovo Thinkpad T430 Laptop (Core i5 3rd Gen/4 GB/320 GB/Windows 10 Pro) $327.99 8
HP Notebook - 14s-cf1020tx/ i5-8265U/ 8 GB RAM/ 1 TB 5400 rpm SATA/ AMD Radeon™ 530 Graphics/WIN 10 $949.00 9
[Online Exclusive] Acer Swift 3 SF314-54 Thin & Light Narrow Border Laptop - Latest Model $766.00 10
(Refurbished) Lenovo ThinkPad Yoga 260 12.5-inch - i5 6th Gen 8GB Ram 192GB SSD $489.90 11
VOYO i8 Max Tablet PC MTK X20 Deca-core 4GB RAM 64GB ROM 10.1" 1920*1200 IPS Android 7.0 LTE WCDMA Tablet with WiFi Bluetooth - intl $267.00 12
[Laptop] Microsoft New Surface Laptop 2 $1,528.00 13
HP Pavilion - 14-ce0090tx Intel i5-8250U Nvidia MX150 8GB DDR4 RAM 1TB Hard drive 128GB SSD $849.00 14
Acer Swift 3 SF314-56G-71GP(Blue) 14"/ i7-8565U/ NVIDIA GEFORCE MX250/ 8GB DDR4 RAM/ 256GB SSD + 1TB HDD $1,466.00 15
HP ELITEBOOK 840 G5 i5 (3UP98PA) $2,408.00 16
Dell Latitude E7450 14inch Business Laptop Computer, Intel Core i7-5600U #2.6Ghz 8GB RAM, 256GB SSD, 802.11ac, Bluetooth, HDMI, USB 3.0, Windows 10 Professional Refurbished $488.00 17
GOOD Ultra-thin Laptop PC 14.1-inch Netbook 1366*768P Display pixel 4GB+64GB $269.99 18
[Refurbished] Lenovo ThinkPad X240 Ultrabook / Intel Core i5 / 4GB RAM / 128GB SSD / Windows 10 Pro Laptop (Black) $259.00 19
Dell XPS 13 9380 -i5 8GB 256 | Stunning inside and out $1,899.00 20
CELE Tablet 10.1 inch Tablet 4GB RAM 64GB ROM for Android 7.0 Phablet Tablet Pc $90.53 21
Acer Spin 3 SP314-52-57FR Convertible Laptop (Grey) $1,198.00 22
[GSS] Microsoft Surface Laptop 2 Platinum i5/8gb/256gb $1,728.00 23
[NEW ARRIVAL] Acer Aspire 3 A314-32 14-Inch Laptop $568.00 24
YEPO 737A6 15.6-Inch 1080P HD Notebook 6+64G Gaming Working Laptop For Windows10 0.3MP Camera Notebook Computer $343.53 25
VOYO i8 Max Tablet PC MTK X20 Deca-core 4GB RAM 64GB ROM 10.1" 1920*1200 IPS Android 7.0 LTE WCDMA Tablet with WiFi Bluetooth $247.00 26
Lenovo Ideapad S130 -11.6 HD -N4000 -4GB DDR4 -32G EMMC $399.00 27
HP Laptop 14s-df0004TU/ i3-8130U/ 4 GB RAM/ 128 GB SSD/ WIN 10 $799.00 28
KawhiMall 14 inch Laptop Computer  Intel Cherry trail Z8350 Quad-Core 1.92GHz Notebook Durable LPDDR3 2GB EMMC 32GB With wifi bluetooth Exquisite Gadget Working Supply Learning Tools $219.00 29
Dell Latitude E7450 14inch Business Laptop Computer, Intel Core i7-5600U #2.6Ghz 16GB RAM, 256GB SSD, 802.11ac, Bluetooth, HDMI, USB 3.0, Windows 10 Professional Refurbished $538.00 30
[Promotion] 14.1 Inch EZbook 2 Notebook 1920x1080 FHD 4GB+64GB Laptop Computer EU Charger $288.88 31
original Jumper EZbook X4 Notebook 14.0 inch Windows 10 Home Version Intel Celeron J3455 Quad Core 1.5GHz (6GB RAM 128GB )SSD 2.0MP Front Camera Dual Band 4600mAh Built-in $411.00 32
Acer Aspire 5 A515-52G-58V9 15.6-Inch Narrow Bezel INTEL i5 with NVIDIA Graphics Card and Intel Optane Memory Laptop $998.00 33
Acer Swift 3 SF314-54G Thin and Light Narrow Border Design Laptop - 8th Generation i7 Processor with NVIDIA Graphics Card $1,398.00 34
Asus L203MA-FD071T $499.00 35
Lenovo ThinkPad T540P 15.6in LED Laptop i5-4300M #2.6Ghz 8GB DDR3 240GB SSD Win 10 Pro One Month Warranty-Refurbished $488.00 36
ASUS X411UF-BV070T i5-8250U Processor 1.6GHz (6M Cache, Up to 3.4GHz) Windows 10 Home (64-bit) NVIDIA® GeForce® GT MX130 with 2GB DDR5 14.0” LED-backlit Ultra Slim HD 1,366 x 768 Display 8GB DDR4 RAM & 1TB SATA HDD $928.00 37
BELLE 14 inch for Windows 10 Redstone OS Notebook PC Laptop 1920*1080P HD Display US Plug - intl $209.90 38
[Online Exclusive] Acer Swift 1 SF114-32-C5FL (Gold) Thin and Light Narrow-Bezel Display Laptop $598.00 39

kensoh · 2019-06-22T17:11:28Z

Adding on, have tried your original code and run without errors. Also, forming a dataframe this way in below code I'm also able to run without issues. Can you try and let me know where does the error happen and what is the error message?

If error happens within Python can use normal try-catch. If error happens within TagUI, will need to know more replication details and example of error to see where is the part to dig further.

import pandas as pd
import tagui as t
t.init()
t.url('https://www.lazada.sg/shop-traditional-laptops/')

result_list = []
for i in range(1,40):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    result_list.append({'index': i, 'name': name, 'price': price})

result_df = pd.DataFrame(result_list, columns = ['index', 'name', 'price'])

kensoh · 2019-06-22T18:00:08Z

Above was with Jupyter notebook. Also wasn't able to replicate using your original script from running via terminal using python3 command.

import pandas as pd
df = pd.DataFrame(columns=['index','name','price'])

import tagui as t
t.init()
t.url("https://www.lazada.sg/")


t.url('https://www.lazada.sg/shop-traditional-laptops/')

df = df.iloc[0:0]

for i in list(range(1,40)):
    path = '(//div[@class="c16H9d"]//a/text())[' + str(i) +']'
    name = t.read(path)
    path = '(//span[@class="c13VH6"]/text())[' + str(i) +']'
    price = t.read(path)
    df.loc[i]=[i]+[name]+[price]

t.close()

Until you can share the error details and how to replicate the environment / error, some other ideas you can try is using # -*- coding: utf-8 -*- at top of your script.

yangliu8912 · 2019-06-23T07:14:05Z

Hi Ken,

Thank you so much for your reply. I've tried taking out the pandas code but I still encounter the same error. Please kindly see the error details below:

I've also tried using # -- coding: utf-8 -- at the first line of my script but it did not work.

Best regards,
Yang Liu

kensoh · 2019-06-24T01:50:37Z

Thanks Yang Liu, for the error details! 👍 The difference in our env could be btw Anaconda installed Python vs direct installation, or maybe Windows vs my laptop (macOS).

But it has to be resolved within TagUI for Python. Have an idea where to tackle this, will test and try to deploy a new package version that can fix this and remain Python 2/3 compatible.

This change explicitly specifies unicode utf-8 as the encoding for both Python 2 and 3, and for read and write operations to files. More details at issue #14. This helps potential issues when TagUI outputs utf-8 encoded characters from extended ASCII set (eg webpage has special characters) but user Python locale preferred encoding is not utf-8, resulting in error in I/O between TagUI for Python and TagUI.

kensoh · 2019-06-24T12:09:17Z

Hi Yang Liu, can you try pip install tagui --upgrade, restart Jupyter notebook kernel and try again? I was not able to replicate it in my environment but suspect the cause to be related to this.

TagUI writes in UTF-8 but your laptop preferred encoding may be something else, causing reading special characters from TagUI to result in error. This v1.4 version explicitly use UTF-8 for both Python 2 and 3 environments, regardless of user's local preferred encoding.

Internal reference - made a commit with following comments

This change explicitly specifies unicode utf-8 as the encoding for both Python 2 and 3, and for read and write operations to files. More details at issue #14.

This helps potential issues when TagUI outputs utf-8 encoded characters from extended ASCII set (eg webpage has special characters) but user Python locale preferred encoding is not utf-8, resulting in error in I/O between TagUI for Python and TagUI.

convexset · 2019-06-24T15:43:25Z

The problem can be solved practically by:

try:
	return t.text()
except UnicodeDecodeError as e:
	return e.args[1].decode(e.args[0], errors='ignore')
# except ...:
#	whatever

convexset · 2019-06-24T15:44:15Z

You just lose the characters that cannot be decoded.

kensoh · 2019-06-25T11:46:38Z

Cool stuff, thanks @convexset ! This workaround is not implementable for now because at this stage, need to address new known gaps where possible. Losing characters or data without error feedback to user can lead to situations where some users run marathon jobs for more than 10 hours only to find out much later they missed some data and need to re-run. I'll check back with you again later on to see if you can commit a PR for edge cases that can't have addressable solutions.

yangliu8912 · 2019-07-01T06:00:38Z

Hi Ken,

Sorry for the late reply as I was overseas and just came back to Singapore. Thank you for the fix, I've tested that it solves my issue : )

Best regards,
Yang Liu

kensoh · 2019-07-01T06:25:38Z

Hi Yang Liu, no problem at all and welcome back to sunny island paradise :) Thanks for raising this and providing the replication details. This is an important issue to fix before the package gets used by more people, because it involves losing data and is a fundamental problem that has to be fixed.

If you encounter more edge cases issues let me know, and I see whether can put in something to address the problem. When nothing more can be done, then I'll invite @convexset to commit his solution to drop characters on unrecoverable data read error, and with a warning message.

wwwfzt · 2022-08-20T16:01:39Z

still have the error when running the sample.py My env is a chinese windows 11 with python3.8.1
[RPA][ERROR] - 'utf-8' codec can't decode bytes in position 82-83: invalid continuation byte

wwwfzt · 2022-08-20T16:02:03Z

still have the error when running the sample.py My env is a chinese windows 11 with python3.8.1 [RPA][ERROR] - 'utf-8' codec can't decode bytes in position 82-83: invalid continuation byte

It occurs in r.init()

kensoh · 2022-08-20T22:35:07Z

Looks like there could be some characters that are not in UTF-8 format. Could you check if another computer or your friend can run this? And compare your PCs to see what setting to set to enable UTF-8. This package has been used by many Chinese users, so there should be a solution.

kensoh changed the title ~~Error handling when encountered undefined character~~ Error handling when encountered undefined character - need replication / error details Jun 22, 2019

kensoh changed the title ~~Error handling when encountered undefined character - need replication / error details~~ Error handling when encountered undefined character - need eg of error Jun 22, 2019

kensoh added the query label Jun 22, 2019

kensoh added bug and removed query labels Jun 24, 2019

kensoh changed the title ~~Error handling when encountered undefined character - need eg of error~~ Error handling when encountered special characters - non-UTF-8 env [fixed] Jun 24, 2019

kensoh mentioned this issue Jun 24, 2019

Fatal: Access is denied; did you install phantomjs? - firewall / virus scanner #15

Closed

kensoh changed the title ~~Error handling when encountered special characters - non-UTF-8 env [fixed]~~ UnicodeDecodeError when encounter special characters - non-UTF-8 env [fixed] Jun 28, 2019

kensoh changed the title ~~UnicodeDecodeError when encounter special characters - non-UTF-8 env [fixed]~~ UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] Jun 28, 2019

kensoh mentioned this issue Jul 6, 2019

Sample code to highlight major advantages of TagUI for Python #25

Closed

kensoh closed this as completed Jul 12, 2019

kensoh mentioned this issue Oct 10, 2019

[TAGUI][ERROR] - failed downloading from.. macOS + Python 3 issue [fixed] #6

Closed

kensoh mentioned this issue Dec 3, 2019

Automation example - RedMart rescheduling or repeating groceries order #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

yangliu8912 commented Jun 22, 2019

kensoh commented Jun 22, 2019

kensoh commented Jun 22, 2019

kensoh commented Jun 22, 2019

yangliu8912 commented Jun 23, 2019

kensoh commented Jun 24, 2019

kensoh commented Jun 24, 2019

convexset commented Jun 24, 2019

convexset commented Jun 24, 2019

kensoh commented Jun 25, 2019

yangliu8912 commented Jul 1, 2019

kensoh commented Jul 1, 2019

wwwfzt commented Aug 20, 2022

wwwfzt commented Aug 20, 2022

kensoh commented Aug 20, 2022

UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

UnicodeDecodeError when read() special characters - non-UTF-8 env [fixed] #14

Comments

yangliu8912 commented Jun 22, 2019

kensoh commented Jun 22, 2019

kensoh commented Jun 22, 2019

kensoh commented Jun 22, 2019

yangliu8912 commented Jun 23, 2019

kensoh commented Jun 24, 2019

kensoh commented Jun 24, 2019

convexset commented Jun 24, 2019

convexset commented Jun 24, 2019

kensoh commented Jun 25, 2019

yangliu8912 commented Jul 1, 2019

kensoh commented Jul 1, 2019

wwwfzt commented Aug 20, 2022

wwwfzt commented Aug 20, 2022

kensoh commented Aug 20, 2022