Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html not reading some UTF-8 characters properly from a file #293

Closed
wkmor1 opened this issue Apr 3, 2020 · 4 comments
Closed

read_html not reading some UTF-8 characters properly from a file #293

wkmor1 opened this issue Apr 3, 2020 · 4 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@wkmor1
Copy link

wkmor1 commented Apr 3, 2020

In version 1.3.0 some characters are not being read properly from a file

# xml 1.2.5
tmp <- tempfile()
cat("", file = tmp)
xml2::read_html(tmp, encoding = "UTF-8")
#> {html_document}
#> <html>
#> [1] <body><p>’</p></body>

# xml 1.3.0
tmp <- tempfile()
cat("", file = tmp)
xml2::read_html(tmp, encoding = "UTF-8")
#> {html_document}
#> <html>
#> [1] <body><p>â\u0080\u0099</p></body>
@hadley
Copy link
Member

hadley commented Apr 3, 2020

This appears to be breaking pkgdown sites 😞

@jayhesselberth
Copy link
Contributor

jayhesselberth commented Apr 4, 2020

git bisect report using the reprex above, starting at 1.2.5 and ending at 1.3.0:

03adb8f5695383270a029128a112ad645bde2df3 is the first bad commit
commit 03adb8f5695383270a029128a112ad645bde2df3
Author: Jim Hester <james.f.hester@gmail.com>
Date:   Fri Mar 27 09:47:17 2020 -0400

    Convert doc_parse_file

 R/RcppExports.R     |  4 ----
 R/xml_parse.R       |  2 +-
 src/RcppExports.cpp | 14 --------------
 src/init.c          |  4 ++--
 src/xml2_doc.cpp    | 31 +++++++++++++++++++------------
 src/xml2_init.cpp   |  2 +-
 6 files changed, 23 insertions(+), 34 deletions(-)

This is the offending line:

strncmp(encoding, "", 0) == 0 ? NULL : encoding,

Not sure what the strncmp is guarding against but seems encoding and NULL are swapped?

@jimhester
Copy link
Member

Fixed by 6543857

@jimhester
Copy link
Member

jimhester commented Apr 4, 2020

The issue was strncmp(encoding, "anything", 0) is always == 0, because strncmp() only compares up to the minimum number of characters, 0 in this case.

Anyway since C strings are NUL terminated (and therefore always have at least one character) I can just check if encoding[0] == '\0', which is what happens now.

Sorry for breaking this, I will likely do a release early next week to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants