Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider using disk.frame as a backend to dbplyr #163

Open
kendonB opened this issue Sep 12, 2019 · 8 comments
Open

consider using disk.frame as a backend to dbplyr #163

kendonB opened this issue Sep 12, 2019 · 8 comments
Labels
enhancement revenue Potential business model and revenue generating models

Comments

@kendonB
Copy link

kendonB commented Sep 12, 2019

disk.frame seems naturally to fit as an alternative backend to dbplyr. Have you considered using the infrastructure you've set up in this way? The disk.frame package would then be like the database access package and then all the front-end interaction stuff would be through dplyr?

The interaction for the user would be substantially the same but the benefits would be:

  1. Wider adoption
  2. Support from tidyverse developers
  3. Tried and true standardised interface

https://db.rstudio.com/dplyr/

@kendonB kendonB changed the title consider remaking disk.frame as a backend to dbplyr consider using disk.frame as a backend to dbplyr Sep 12, 2019
@xiaodaigh xiaodaigh added enhancement revenue Potential business model and revenue generating models labels Sep 12, 2019
@xiaodaigh
Copy link
Collaborator

Thanks for bringing this to my attention. Are you saying that we should set up disk.frame like a remote server database? That could be interesting.

I probably need time to digest this. Some quick thoughts:

  • The attraction of disk.frame for me is in the minimal setup required which is install.packages("disk.frame"). So setting up a remote database might be counter to that. But it's worth exploring, because disk.frame` allows you to run native R functions. A managed disk.frame service on the cloud could be a valid business idea.
  • dbplyr is used to connect to databases and expose dplyr verbs so easy manipulation. For this aspect, disk.frame already exposes dplyr verbs, so for the use-case where no remote server is involved, it is pretty similar.
  • I want to introduce a disk.frame.db package where group_by is done one in one-step not over two stages. So unless a remote server is involved, there is minimal advantage for disk.frame

@kendonB
Copy link
Author

kendonB commented Sep 12, 2019

The database wouldn't necessarily need to be remote to work with dbplyr, I believe. The advantage would be that you wouldn't have to maintain the dplyr interface.

@xiaodaigh
Copy link
Collaborator

you wouldn't have to maintain the dplyr interface

I think you do, actually. This is because dbplyr translates the dplyr verbs into SQL (or whichever querying language the DB is using). This translation step doesn't happen automatically, and needs to be created by hand and customised for each DB. So there needs to be customisation for disk.frame as well. Unless my understanding of dbplyr is all wrong.

Unless it's remote, the attraction isn't there, because I want setting up a disk.frame to be of minimum hassle. A managed disk.frame.db would be really nice for the occasionally large data workload, or used as part of a pipeline.

@kendonB
Copy link
Author

kendonB commented Sep 12, 2019

I also don't have much understanding of dbplyr - but this page: https://db.rstudio.com/best-practices/select-interface/ seems to suggest that the translation happens based on a "consistent set of functions that work across all connections" provided via the DBI package

@xiaodaigh
Copy link
Collaborator

From what I can see, DBI can send SQL statements as strings to databases.

image

It doesn't use dplyr, so I deduce from that that dbplyr must translate the dplyr verbs into one of the SQL strings, as seems the case in the post
image

Currently, disk.frame doesn't support SQL (and don't plan to), so there would be some minor maintenance work to "translate" the dplyr verbs. The great thing is that disk.frame just uses data.frames so no translation is necessary! All I need to do is to faithfully execute the dplyr verbs as defined! The non-dbplyr version of disk.frame can already do this.

So I think a disk.frame for dbplyr makes the most sense as a remote database.

@kmishra9
Copy link

kmishra9 commented May 8, 2020

While I think a disk.frame db would be great, I'm curious how it would be different from setting up a SQLite database (which is fairly reasonable and has very few dependencies). In particular, I think the latter option would probably have much higher peak performance due to indexing

@xiaodaigh
Copy link
Collaborator

I'm curious how it would be different from setting up a SQLite database

With disk.frame you can apply native R functions efficiently but sqlite is more limited in that regard. But you can test out the speed? Or show me a dataset and a set of operation in sqlite and I can do a compare of the speed?

@kmishra9
Copy link

kmishra9 commented Feb 16, 2021

I actually agree that disk.frame as a dbplyr-backend makes sense and would be about as performant as is possible to use custom R functions (sqlite is definitely more limited and slightly more of a headache to manage)! For peak performance, SQLite might still win, so there's just a tradeoff for folks to consider (similar to data.table vs dplyr!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement revenue Potential business model and revenue generating models
Projects
None yet
Development

No branches or pull requests

3 participants