Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control character warning for weird column name in tibble in list #574

Closed
oloverm opened this issue Feb 7, 2019 · 9 comments
Closed

Control character warning for weird column name in tibble in list #574

oloverm opened this issue Feb 7, 2019 · 9 comments
Milestone

Comments

@oloverm
Copy link

oloverm commented Feb 7, 2019

I've got a tibble where one of the column names is ComparedtoPHECentres(2015)valueorpercentiles (not my choice). If I print it within a list, and my console isn't wide enough to fit that column, I get a warning from fansi::strwrap_ctl(). It also prints the column name as �ÿþComparedtoPHECentres(2015)valueorpercentiles�ÿþ. I don't know if it's the length of the column name or the fact that it's got parentheses, or why it only happens if it's an element in a list.

library(pacman)

p_load(dplyr, fingertipsR)

df <- fingertips_data(91523, ParentAreaTypeID = 104) %>% 
    as_tibble()

list(df)
# [[1]]
# # A tibble: 972 x 26
#    IndicatorID IndicatorName ParentCode ParentName AreaCode AreaName
#          <int> <chr>         <chr>      <chr>      <chr>    <chr>   
#  1       91523 All new STI ~ NA         NA         E920000~ England 
#  2       91523 All new STI ~ E92000001  England    E450000~ London ~
#  3       91523 All new STI ~ E92000001  England    E450000~ West Mi~
#  4       91523 All new STI ~ E92000001  England    E450000~ North E~
#  5       91523 All new STI ~ E92000001  England    E450000~ Yorkshi~
#  6       91523 All new STI ~ E92000001  England    E450000~ East Mi~
#  7       91523 All new STI ~ E92000001  England    E450000~ East of~
#  8       91523 All new STI ~ E92000001  England    E450000~ North W~
#  9       91523 All new STI ~ E92000001  England    E450000~ South E~
# 10       91523 All new STI ~ E92000001  England    E450000~ South W~
# # ... with 962 more rows, and 20 more variables: AreaType <chr>,
# #   Sex <chr>, Age <chr>, CategoryType <chr>, Category <chr>,
# #   Timeperiod <chr>, Value <dbl>, LowerCI95.0limit <dbl>,
# #   UpperCI95.0limit <dbl>, LowerCI99.8limit <dbl>,
# #   UpperCI99.8limit <dbl>, Count <dbl>, Denominator <dbl>,
# #   Valuenote <chr>, RecentTrend <chr>,
# #   ComparedtoEnglandvalueorpercentiles <chr>,
# #   `�ÿþComparedtoPHECentres(2015)valueorpercentiles�ÿþ` <chr>,
# #   TimeperiodSortable <int>, Newdata <chr>, Comparedtogoal <chr>
# 
# Warning message:
# In fansi::strwrap_ctl(x, width = max(width, 0), indent = indent,  :
#   Encountered a C0 control character, see `?unhandled_ctl`; you can use `warn=FALSE` to turn off these warnings.
@oloverm
Copy link
Author

oloverm commented Feb 11, 2019

Also, if there are multiple tibbles in the list with the same column names, the warning only shows up once, and that column name is printed normally for all but the first one:

image

@hadley
Copy link
Member

hadley commented Mar 21, 2019

I think is the same problem as tidyverse/dbplyr#223, and is probably a bug in base R.

@hlynurhallgrims
Copy link

hlynurhallgrims commented Apr 30, 2019

I think this is possibly connected to this issue I came accross on Stack Overflow, which I could only recreate using readxl.

When I try to recreate the SO issue by creating the tibble using tribble, I don't get the SO error (I only get that with tibbles created by reading from readxl), I get the same error as @cucumberry here, but only if the console is too narrow to print all the columns.

my_tibble <- tibble::tribble(~good_column, ~'very bad\ncolumn', ~'terribly\nlong column name here', ~'more', ~'and then even', ~'more than that',
                             1, 2, 3, 4, 5, 6,
                             7, 8, 9, 10, 11, 12)
my_tibble
#> # A tibble: 2 x 6
#>   good_column `very bad\ncolu~ `terribly\nlong~  more `and then even`
#>         <dbl>            <dbl>            <dbl> <dbl>           <dbl>
#> 1           1                2                3     4               5
#> 2           7                8                9    10              11
#> # ... with 1 more variable: `more than that` <dbl>

list(my_tibble, my_tibble)
#> Warning message:
#>In fansi::strwrap_ctl(x, width = max(width, 0), indent = indent,  :
#> Encountered a C0 control character, see `?unhandled_ctl`; you can use `warn=FALSE` to turn off these warnings.

I should add that the above is not a reprex as rendering the reprex didn't render the error, no matter how wide or narrow the console and viewer panes were.

Much like the screenshot above from @cucumberry, the ÿþ mark also shows up in one of the column names in the first printing of the tibble, but not the second (See picture).

image

But on to the SO example I linked to above

The same bad characters result in a different error when the tibble in question is the result of being read in from Excel through readxl::read_excel(). Here's the link to the Excel file in question if anyone is interested.

Here I get a different error.

all_sheets <- readxl::excel_sheets(path = here::here("data", "Posti-Letto-Istat.xls"))

all_sheets %>% 
  purrr::map(.x = .,
             .f = ~readxl::read_excel(path  = here::here("data", "Posti-Letto-Istat.xls"),
                                      sheet = .x,
                                      skip  = 4))
#>[[1]]
#>Error in nchar(x[is_na], type = "width") : 
#>  invalid multibyte string, element 1

Of course any measure to get rid of the bad characters before printing the list of tibbles fixes this.

all_sheets <- readxl::excel_sheets(path = here::here("data", "Posti-Letto-Istat.xls"))

all_sheets %>% 
  purrr::map(.x = .,
             .f = ~readxl::read_excel(path  = here::here("data", "Posti-Letto-Istat.xls"),
                                      sheet = .x,
                                      skip  = 4)) %>% 
  map(janitor::clean_names)
# This prints just fine, obviously

Maybe I'm mistaken and it's not connected, but I figured I'd mention it if there's a chance that it is.

@krlmlr
Copy link
Member

krlmlr commented Mar 21, 2020

@brodieG: What's the best way to deal with unsanitized user input (in the form of borked column names) for display? I'm happy with printing a demangled version and mentioning in the output that some of the names were mangled originally. Can I safely strip_sgr(warn = FALSE) and then compare if the names changed?

Also, when reviewing the wrapping we need to take a look at why column names with spaces distort the output, at least in RStudio, when they appear in the footer of a tibble (too many columns).

@krlmlr krlmlr added this to the 3.0.0 milestone Mar 21, 2020
@brodieG
Copy link

brodieG commented Mar 21, 2020

You probably want strip_ctl(warn=FALSE) as strip_sgr will only do the formatting sequences, but otherwise it should be relatively safe to do as you suggest. This will not address the invalid multi-byte sequences mentioned above.

In re: warn=FALSE, keep in mind the warning is there b/c fansi does not understand the semantics of the control sequences in the context in question. So for strip_ctl it won't even warn for C0 control sequences because it knows they are one byte long and can strip them without caring about what they do to cursor position, etc., but it will warn for malformed (or correctly formed but unsupported) CSI sequences because it doesn't necessarily know where they end and might be stripping stuff it should not or not stripping stuff it should.

So in short strip_ctl(., warn=FALSE) and compare output is probably fine, or even strip_ctl(., warn=FALSE, ctl=c('all', 'sgr', 'nl')) if you want to allow the known controls1.

In re: spaces in footer, is this something fansi related? If so, could you give me an example, I skimmed the thread and couldn't quite tell what you were referencing.

Footnotes

  1. this will leave in unknown but syntactically valid SGR sequences, which then later may cause other functions to emit warnings.

@krlmlr
Copy link
Member

krlmlr commented Mar 21, 2020

Thanks. I don't think that column names should contain any controls -- will proceed.

Related to names with spaces, the following is an example where the tibble is too wide to fit one line and the "with ... more variables" is shown in the footer. Names are wrapped, and the first name in each footer row is printed badly. Not sure whose responsibility this is. (The SGR codes are stripped in the reprex, I can replicate in a terminal and in RStudio.)

library(tidyverse)

N <- 16
data <- tibble(letter = letters[1:N], i = 1:N)

cross <- crossing(data, j = 1:N)

row <-
  cross %>%
  filter(i >= j) %>%
  group_by(j) %>%
  summarize(name = paste(letter, collapse = " ")) %>%
  ungroup() %>%
  select(name, j) %>%
  deframe()

tbl <- tibble(!!!row)

options(crayon = TRUE)
fmt <- format(tbl)
fmt
#> [1] "# A tibble: 1 x 16"                                                                                                                                                                                                                                                                                             
#> [2] "  `a b c d e f g … `b c d e f g h … `c d e f g h i … `d e f g h i j …"                                                                                                                                                                                                                                          
#> [3] "             <int>            <int>            <int>            <int>"                                                                                                                                                                                                                                          
#> [4] "1                1                2                3                4"                                                                                                                                                                                                                                          
#> [5] "# … with 12 more variables: `e f g h i j k l m n o p` <int>, `f g h i j k l m n\n#   o p` <int>, `g h i j k l m n o p` <int>, `h i j k l m n o p` <int>, `i j k\n#   l m n o p` <int>, `j k l m n o p` <int>, `k l m n o p` <int>, `l m n o\n#   p` <int>, `m n o p` <int>, `n o p` <int>, `o p` <int>, p <int>"
cat(fmt, sep = "\n")
#> # A tibble: 1 x 16
#>   `a b c d e f g … `b c d e f g h … `c d e f g h i … `d e f g h i j …
#>              <int>            <int>            <int>            <int>
#> 1                1                2                3                4
#> # … with 12 more variables: `e f g h i j k l m n o p` <int>, `f g h i j k l m n
#> #   o p` <int>, `g h i j k l m n o p` <int>, `h i j k l m n o p` <int>, `i j k
#> #   l m n o p` <int>, `j k l m n o p` <int>, `k l m n o p` <int>, `l m n o
#> #   p` <int>, `m n o p` <int>, `n o p` <int>, `o p` <int>, p <int>

Created on 2020-03-21 by the reprex package (v0.3.0)

@brodieG
Copy link

brodieG commented Mar 21, 2020

Ah, I see. strwrap_ctl has no concept of words beyond white-space delimited tokens. There is no parsing of strings to detect quoted tokens or anything of the sort. This is the same as with strwrap. One way to solve it might be to replace the column names with equal length, space-less strings, wrap that, compute the lengths of the resulting strings, and substring the original based on those lengths.

@krlmlr
Copy link
Member

krlmlr commented Mar 28, 2020

I can't replicate the original problem in R 3.6.3.

@krlmlr krlmlr closed this as completed Mar 28, 2020
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants