read_html not reading some UTF-8 characters properly from a file #293

wkmor1 · 2020-04-03T13:45:33Z

In version 1.3.0 some characters are not being read properly from a file

# xml 1.2.5
tmp <- tempfile()
cat("’", file = tmp)
xml2::read_html(tmp, encoding = "UTF-8")
#> {html_document}
#> <html>
#> [1] <body><p>’</p></body>

# xml 1.3.0
tmp <- tempfile()
cat("’", file = tmp)
xml2::read_html(tmp, encoding = "UTF-8")
#> {html_document}
#> <html>
#> [1] <body><p>â\u0080\u0099</p></body>

hadley · 2020-04-03T20:37:44Z

This appears to be breaking pkgdown sites 😞

jayhesselberth · 2020-04-04T18:36:33Z

git bisect report using the reprex above, starting at 1.2.5 and ending at 1.3.0:

03adb8f5695383270a029128a112ad645bde2df3 is the first bad commit
commit 03adb8f5695383270a029128a112ad645bde2df3
Author: Jim Hester <james.f.hester@gmail.com>
Date:   Fri Mar 27 09:47:17 2020 -0400

    Convert doc_parse_file

 R/RcppExports.R     |  4 ----
 R/xml_parse.R       |  2 +-
 src/RcppExports.cpp | 14 --------------
 src/init.c          |  4 ++--
 src/xml2_doc.cpp    | 31 +++++++++++++++++++------------
 src/xml2_init.cpp   |  2 +-
 6 files changed, 23 insertions(+), 34 deletions(-)

This is the offending line:

xml2/src/xml2_doc.cpp

Line 197 in 1079c51

strncmp(encoding, "", 0) == 0 ? NULL : encoding,

Not sure what the strncmp is guarding against but seems encoding and NULL are swapped?

jimhester · 2020-04-04T19:10:59Z

Fixed by 6543857

jimhester · 2020-04-04T19:13:48Z

The issue was strncmp(encoding, "anything", 0) is always == 0, because strncmp() only compares up to the minimum number of characters, 0 in this case.

Anyway since C strings are NUL terminated (and therefore always have at least one character) I can just check if encoding[0] == '\0', which is what happens now.

Sorry for breaking this, I will likely do a release early next week to fix it.

Bisaloo mentioned this issue Apr 3, 2020

Broken characters with pkgdown 1.5.0 r-lib/pkgdown#1284

Closed

jimhester added the bug an unexpected problem or unintended behavior label Apr 3, 2020

wkmor1 mentioned this issue Apr 3, 2020

pkgdown does not recognize non-ASCII characters when deploying to Travis r-lib/pkgdown#1287

Closed

cjyetman mentioned this issue Apr 3, 2020

Fix encoding in website's home RMI-PACTA/r2dii.match#185

Closed

jayhesselberth mentioned this issue Apr 3, 2020

Fix syntax highlighting on windows r-lib/pkgdown#1288

Merged

jimhester closed this as completed Apr 4, 2020

jayhesselberth mentioned this issue Apr 8, 2020

unicode literals in README.md are mangled r-lib/pkgdown#1294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html not reading some UTF-8 characters properly from a file #293

read_html not reading some UTF-8 characters properly from a file #293

wkmor1 commented Apr 3, 2020

hadley commented Apr 3, 2020

jayhesselberth commented Apr 4, 2020 •

edited

Loading

jimhester commented Apr 4, 2020

jimhester commented Apr 4, 2020 •

edited

Loading

read_html not reading some UTF-8 characters properly from a file #293

read_html not reading some UTF-8 characters properly from a file #293

Comments

wkmor1 commented Apr 3, 2020

hadley commented Apr 3, 2020

jayhesselberth commented Apr 4, 2020 • edited Loading

jimhester commented Apr 4, 2020

jimhester commented Apr 4, 2020 • edited Loading

jayhesselberth commented Apr 4, 2020 •

edited

Loading

jimhester commented Apr 4, 2020 •

edited

Loading