Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read CSV or Parquet with dplyr::tbl #133

Closed
cy-james-lee opened this issue Mar 29, 2024 · 10 comments · Fixed by #211
Closed

Cannot read CSV or Parquet with dplyr::tbl #133

cy-james-lee opened this issue Mar 29, 2024 · 10 comments · Fixed by #211

Comments

@cy-james-lee
Copy link

Hello, it seems that the simple operation is not working as shown on the manual: https://duckdb.org/docs/api/r#dbplyr

Here is my reprex for both formats:

library(duckdb)
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
arrow::write_parquet(mtcars, "~/mtcars.parquet")
dplyr::tbl(conn_duck, "~/mtcars.parquet")
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.parquet) q01 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

write.csv(mtcars, "~/mtcars.csv")
dplyr::tbl(conn_duck, '~/mtcars.csv')
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.csv) q02 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

Created on 2024-03-29 with reprex v2.1.0

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 29, 2024

Thanks. Can you please try path_expand("~/...") or normalizePath("~/...") in the dpyr::tbl() call?

@cy-james-lee
Copy link
Author

@krlmlr I did try with absolute paths. It still didn't work. However, I found out that duckdb was expecting single quote around the file paths. If I do sprintf("'%s'", filepath), it works.

library(duckdb)
#> Loading required package: DBI
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
file_pq <- path.expand("~/mtcars.parquet")
arrow::write_parquet(mtcars, file_pq)
dplyr::tbl(conn_duck, file_pq)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.parquet".*
#> FROM "C:/Users/user1/Documents/mtcars.parquet"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.parquet" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_pq))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

file_csv <- path.expand("~/mtcars.csv")
write.csv(mtcars, file_csv)
dplyr::tbl(conn_duck, file_csv)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.csv".*
#> FROM "C:/Users/user1/Documents/mtcars.csv"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.csv" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_csv))
#> # Source:   SQL [?? x 12]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>    column00      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

Created on 2024-04-03 with reprex v2.1.0

@krlmlr
Copy link
Collaborator

krlmlr commented Apr 5, 2024

Thanks, good catch! Do we want to add this to the documentation?

@cy-james-lee
Copy link
Author

Wouldn't it make more sense to update/add function to handle these file paths?

This seems like a package design decision. What I'm understanding is the package handles files in the same directory differently than other directories.

If a file in the same directory is provided, dplyr::tbl automatically adds single quotes.
example: dplyr::tbl(conn_duck, "mtcars.parquet")

However, for other absolute file paths, the user needs to manually add single quotes.
example: dplyr::tbl(conn_duck, "'C:/Users/user1/Documents/mtcars.parquet'")

@krlmlr
Copy link
Collaborator

krlmlr commented Apr 5, 2024

duckdb 0.10.1 has tbl_file() that should work out of the box, totally forgot about that. 🙃

No quotes should be needed with that function.

@cy-james-lee
Copy link
Author

I was just writing to let you know that I didn't know about the tbl_file function. Was there a reason why tbl_file and tbl_query functions needed to be created separately instead of implementing them into tbl?

@krlmlr
Copy link
Collaborator

krlmlr commented Apr 5, 2024

tbl() is designed to work with table names or queries. Parquet or CSV files are neither. The new functions help disambiguate this and be explicit about the intent.

@thisisnic
Copy link

thisisnic commented Jun 9, 2024

I ended up here as I was going to report a bug but realised my issue is a duplicate of this one (i.e. I should be using tbl_file() instead of tbl()). Not sure if this reprex adds anything to the current discussion but adding it just in case it's helpful - it's hard to work out that the wrong function is the source of the problem here, as there are some circumstances in which it works and others in which it doesn't.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile(fileext = ".csv")
readr::write_csv(mtcars, tf)

con <- dbConnect(duckdb())

# filter without duckdb
readr::read_csv(tf) |>
  filter(hp > 250)
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

# raises an error
tbl(con, tf) |>
  filter(hp > 250) |>
  collect()
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv".*
#> FROM "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv"
#> WHERE (hp > 250.0)
#> Error: Binder Error: Referenced table "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv" not found!
#> Candidate tables: "file7c7e83242c8f2"

# works when we do a no-op mutate before the filter
tbl(con, tf) |>
  mutate(hp = hp + 0) |>
  filter(hp > 250) |>
  collect()
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

@eitsupi
Copy link
Contributor

eitsupi commented Jun 11, 2024

See also #38

I would recommend using something like the read_csv() function based on the standard DuckDB functions.

tbl(con, "read_csv('mtcars.csv')")

So, if we understand that this is embedded in SQL, we can immediately imagine that other functions such as delta_scan() can be used here as well.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 16, 2024

Thanks. I renamed tbl_query() to tbl_function() and added more examples to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants