This package provides functions for:
- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standard
Feel free to give feedback in Telegram groups: @grablab and @grablab_ru.
pip install -U unicodec
Download web document with urllib and convert its content to Unicode.
from urllib.request import urlopen
from unicodec import decode_content, detect_content_encoding
res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Download web document with urllib3 and convert its content to Unicode.
from urllib3 import PoolManager
from unicodec import decode_content, detect_content_encoding
res = PoolManager().urlopen("GET", "http://lib.ru")
rawdata = res.data
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
Output:
<html><head><title>Lib.Ru: Библиотека Максима Мошкова</title></head><b
koi8-r
Convert names of encodings to canonical form (according to WHATWG HTML standard).
from unicodec.normalization import normalize_encoding_name
for name in ["iso8859-1", "utf8", "cp1251"]:
print("{} -> {}".format(name, normalize_encoding_name(name)))
Output:
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251