Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml will not parse unicode xml #17

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

lxml will not parse unicode xml #17

wants to merge 1 commit into from

Conversation

cappy123abc
Copy link

This is my first pull request... so be Nice! I get the following error when trying to parse the returned xml as following: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

@sharoonthomas
Copy link

@cappy123abc thanks for the pr.

changing the encoding on the xml header is fine, but my guess is all the content inside the xml is still unicode encoded as UTF-8. There is a test file which generates xml and spits it to the terminal. Perhaps we could add one more with a name with unicode characters like Søréñ and check how the xml is treated by the system ?

@cappy123abc
Copy link
Author

Ok, I'll add a test. I was looking at the documents at http://lxml.de/parsing.html and the author writes " You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone." Which seems odd to me since so much xml must have international characters in it and unicode is the most elegant way to deal with it. What does he expect ASCII:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants