Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding in website's home #185

Closed
maurolepore opened this issue Apr 1, 2020 · 13 comments
Closed

Fix encoding in website's home #185

maurolepore opened this issue Apr 1, 2020 · 13 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@maurolepore
Copy link
Contributor

https://2degreesinvesting.github.io/r2dii.match/

Relates to https://github.com/2DegreesInvesting/r2dii.data/issues/36

@maurolepore maurolepore added the bug an unexpected problem or unintended behavior label Apr 1, 2020
@cjyetman
Copy link
Member

cjyetman commented Apr 2, 2020

These are the characters that seem to get messed up (non-breaking space and curly quotes)...

Unicodecode point character UTF-8(hex.) name
U+00A0   c2 a0 NO-BREAK SPACE
U+2018 e2 80 98 LEFT SINGLE QUOTATION MARK
U+2019 e2 80 99 RIGHT SINGLE QUOTATION MARK

Seems like when README.md gets processed into README.Rmd, those characters are converted into something appropriate. But when converted to index.html, they get converted improperly.

So that points to the gh-pages process, where a virtual server is spun-up and the pkgdown package does its magic. Will take some time to dig into that.

@cjyetman
Copy link
Member

cjyetman commented Apr 2, 2020

Locally, I can run pkgdown::build_site() on my macOS system, and those characters come out properly in the index.html. So maybe something in the workflow file pkgdown.yaml needs to be adjusted to make sure the R environment there is in utf-8?

@maurolepore
Copy link
Contributor Author

Thanks a lot! That brings me some ideas to fix it. You did enough.

Right now I build the site on github actions. I think the action used macos and I changed it to ubuntu. I see the problem locally on my ubuntu. So I may fix quickly by building on macos, then think for a solution on ubuntu.

@maurolepore
Copy link
Contributor Author

We discussed on slack and these are some notes:

  • My comment above is wrong. The CI uses macos so the problem is elsewhere.
  • A quick fix might be to edit the .md. Still left with the ultimate problem that causes the .Rmd to render oddly.

@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2020

I did a bunch of hunting around and... I'm fairly certain this is caused by a bug in xml2::read_html which reads a string literal in UTF-8 correctly, but reads a file incorrectly...

# pkgdown::deploy_to_branch
#   ↳ pkgdown::build_site
#     ↳ pkgdown:::build_site_local
#       ↳ pkgdown::build_home
#         ↳ pkgdown:::build_home_md
#           ↳ pkgdown:::render_md
#             ↳ pkgdown:::markdown
#               ↳ xml2:::read_html
#                 ↳ xml2:::read_html.default
text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
f <- tempfile()
utf8 <- enc2utf8(text)
con <- file(f, open = "w+", encoding = "native.enc")
writeLines(utf8, con = con, useBytes = TRUE)
close(con)
readLines(f, encoding = "UTF-8")
#> [1] "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
xml2::read_html(text)
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body>
xml2::read_html(f)
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2°C  ...
unlink(f)
text <- "<body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;</body>"
xml2::write_html(xml2::read_html(text), 'test.html')
xml2::read_html('test.html')
#> {html_document}
#> <html>
#> [1] <body>brûlée 鬼 test 'stuff' and â\u0080\u0098PACTAâ\u0080\u0099 2°C  ...
readLines('test.html', encoding = "UTF-8")
#> [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">"
#> [2] "<html><body>brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C €</body></html>"

maurolepore added a commit that referenced this issue Apr 3, 2020
@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2020

and this is my pretty minimal reprex of this issue...

install.packages('usethis')
install.packages('pkgdown')
usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)

usethis::use_readme_md(open = FALSE)
write("brûlée 鬼 test 'stuff' and ‘PACTA’ 2°C &euro;", file = "README.md", append = TRUE)
usethis::use_pkgdown()
pkgdown::build_home(preview = TRUE)

On my macOS, the resulting index.html looks ok.
On a brand new RStudio Cloud instance, I get the same encoding garbling.
On Windows 10, I get this error...

> pkgdown::build_home(preview = TRUE)
-- Building home ---------------------------------------------------------------
  Writing 'authors.html'
UTF-8 decoding error in C:/Users/cjyetman/Documents/test4/README.md at byte offset 390 (fb).
The input must be a UTF-8 encoded text.
Error: pandoc document conversion failed with error 92
Error: [ENOENT] Failed to remove 'C:/Users/cjyetman/AppData/Local/Temp/RtmpSmRuLn/file71036f51e07.html': no such file or directory

🤷‍♂

@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2020

here's an even more minimal reprex that still mangles the '...' in the example README that it creates when run on RStudio Cloud...

usethis::create_package(getwd(), fields = NULL, rstudio = FALSE, open = FALSE)
usethis::use_readme_md(open = FALSE)
pkgdown::build_home(preview = TRUE)

@maurolepore
Copy link
Contributor Author

That's awesome! Thanks CJ! Do you plan to open an issue in xml2?

(I reopen because I closed unintentionally via a sloppy use of the word "fix" in a commit message.)

@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2020

If you have the bandwidth to do it, please feel free to use this reprex. Not sure if/when I'll get around to it. Feel like I maxed out my time to screw around with this today for the next week or so. 😉

@maurolepore maurolepore reopened this Apr 3, 2020
@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2020

actually looks like it's a regression in xml2 v1.3.0 and it's already been reported...
r-lib/xml2#293
r-lib/pkgdown#1284
r-lib/pkgdown#1287

@maurolepore
Copy link
Contributor Author

Best case scenario ;)

@cjyetman
Copy link
Member

cjyetman commented Apr 9, 2020

Looks like this is probably fixed in r-lib/xml2@6543857

and on CRAN already xml2 v131

@maurolepore
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants