Cannot read CSV or Parquet with dplyr::tbl #133

cy-james-lee · 2024-03-29T17:24:58Z

Hello, it seems that the simple operation is not working as shown on the manual: https://duckdb.org/docs/api/r#dbplyr

Here is my reprex for both formats:

library(duckdb)
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
arrow::write_parquet(mtcars, "~/mtcars.parquet")
dplyr::tbl(conn_duck, "~/mtcars.parquet")
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.parquet) q01 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

write.csv(mtcars, "~/mtcars.csv")
dplyr::tbl(conn_duck, '~/mtcars.csv')
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.csv) q02 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

^{Created on 2024-03-29 with reprex v2.1.0}

krlmlr · 2024-03-29T22:29:59Z

Thanks. Can you please try path_expand("~/...") or normalizePath("~/...") in the dpyr::tbl() call?

cy-james-lee · 2024-04-03T14:58:11Z

@krlmlr I did try with absolute paths. It still didn't work. However, I found out that duckdb was expecting single quote around the file paths. If I do sprintf("'%s'", filepath), it works.

library(duckdb)
#> Loading required package: DBI
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
file_pq <- path.expand("~/mtcars.parquet")
arrow::write_parquet(mtcars, file_pq)
dplyr::tbl(conn_duck, file_pq)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.parquet".*
#> FROM "C:/Users/user1/Documents/mtcars.parquet"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.parquet" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_pq))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

file_csv <- path.expand("~/mtcars.csv")
write.csv(mtcars, file_csv)
dplyr::tbl(conn_duck, file_csv)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.csv".*
#> FROM "C:/Users/user1/Documents/mtcars.csv"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.csv" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_csv))
#> # Source:   SQL [?? x 12]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>    column00      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

Created on 2024-04-03 with reprex v2.1.0

krlmlr · 2024-04-05T13:08:48Z

Thanks, good catch! Do we want to add this to the documentation?

cy-james-lee · 2024-04-05T14:32:14Z

Wouldn't it make more sense to update/add function to handle these file paths?

This seems like a package design decision. What I'm understanding is the package handles files in the same directory differently than other directories.

If a file in the same directory is provided, dplyr::tbl automatically adds single quotes.
example: dplyr::tbl(conn_duck, "mtcars.parquet")

However, for other absolute file paths, the user needs to manually add single quotes.
example: dplyr::tbl(conn_duck, "'C:/Users/user1/Documents/mtcars.parquet'")

krlmlr · 2024-04-05T14:51:34Z

duckdb 0.10.1 has tbl_file() that should work out of the box, totally forgot about that. 🙃

No quotes should be needed with that function.

cy-james-lee · 2024-04-05T14:53:23Z

I was just writing to let you know that I didn't know about the tbl_file function. Was there a reason why tbl_file and tbl_query functions needed to be created separately instead of implementing them into tbl?

krlmlr · 2024-04-05T14:54:40Z

tbl() is designed to work with table names or queries. Parquet or CSV files are neither. The new functions help disambiguate this and be explicit about the intent.

thisisnic · 2024-06-09T09:31:15Z

I ended up here as I was going to report a bug but realised my issue is a duplicate of this one (i.e. I should be using tbl_file() instead of tbl()). Not sure if this reprex adds anything to the current discussion but adding it just in case it's helpful - it's hard to work out that the wrong function is the source of the problem here, as there are some circumstances in which it works and others in which it doesn't.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile(fileext = ".csv")
readr::write_csv(mtcars, tf)

con <- dbConnect(duckdb())

# filter without duckdb
readr::read_csv(tf) |>
  filter(hp > 250)
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

# raises an error
tbl(con, tf) |>
  filter(hp > 250) |>
  collect()
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv".*
#> FROM "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv"
#> WHERE (hp > 250.0)
#> Error: Binder Error: Referenced table "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv" not found!
#> Candidate tables: "file7c7e83242c8f2"

# works when we do a no-op mutate before the filter
tbl(con, tf) |>
  mutate(hp = hp + 0) |>
  filter(hp > 250) |>
  collect()
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

eitsupi · 2024-06-11T16:00:38Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read CSV or Parquet with dplyr::tbl #133

Cannot read CSV or Parquet with dplyr::tbl #133

cy-james-lee commented Mar 29, 2024

krlmlr commented Mar 29, 2024

cy-james-lee commented Apr 3, 2024

krlmlr commented Apr 5, 2024

cy-james-lee commented Apr 5, 2024

krlmlr commented Apr 5, 2024

cy-james-lee commented Apr 5, 2024

krlmlr commented Apr 5, 2024

thisisnic commented Jun 9, 2024 •

edited

Loading

eitsupi commented Jun 11, 2024

krlmlr commented Aug 16, 2024

Cannot read CSV or Parquet with dplyr::tbl #133

Cannot read CSV or Parquet with dplyr::tbl #133

Comments

cy-james-lee commented Mar 29, 2024

krlmlr commented Mar 29, 2024

cy-james-lee commented Apr 3, 2024

krlmlr commented Apr 5, 2024

cy-james-lee commented Apr 5, 2024

krlmlr commented Apr 5, 2024

cy-james-lee commented Apr 5, 2024

krlmlr commented Apr 5, 2024

thisisnic commented Jun 9, 2024 • edited Loading

eitsupi commented Jun 11, 2024

krlmlr commented Aug 16, 2024

thisisnic commented Jun 9, 2024 •

edited

Loading