-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skimr against a database #358
Comments
Hi JD! This is something that we could eventually support. It would be easier to do v2, which has a simpler codebase. I'm currently on pat leave, and I had hoped to spend some time on skimr, but it's going slowly. The last time I talked to @elinw, we were targeting a v2 release at the beginning of next year. After that, we can start fleshing out features like this. Thanks! |
Congratulations on the new new Mini Mighty Quinn! What I'm hearing you say is that if I want to hack around the edges on this I should be using the I'm going to also look at the code base for
|
There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one. |
That's a grand idea Edgar. I think refactoring those to do multiple fields would be a helpful step. Any help is always appreciated. Plus I'll learn a few things along the way. |
I just merged updates into the master and develop branches so that skimr will work with dplyr 0.8.0. I'm going to see how it goes with v2 but I'd suggest if you are working on this idea that you pull down the rc branch for dplyr and work from there. |
Now the updates are in the v2 branch. |
I've been looking at this a tiny bit. Right now this works
Because Then you'd want to have skimmers that are for databases, so limited to what can be done there. I think we would likely have to do or require users to do some preprocessing. For example we'd need the variable names and the types. You can get those with a prior round trip to the database, and trying to make it as generic as possible (because supporting multiple databases would make doing queries that are too specific a problem).
Then we use that with our skimmers to go do the work of calculating the statistics. Then back to R to update the resulting data frame and on to the next. |
I'd like to upvote this suggestion/issue, if such a thing is possible. This is a fabulous package. If it were more database compatible, the improvement in speed would make it that much better. For the particular database I'm using, the following snippet of code runs reasonably well, with an odbc database connection using dplyr, dplyr, etc.
|
Thanks! V2 complicates this a bit. To get a collection of skimmer functions for a type of data in your data frame, we rely on S3. The function See here: Lines 197 to 200 in 966b865
And see here: This was a particular design decision to make skimmer much more extensible. Instead of us having to create some sort of dispatch system for every possible class of data (like in v1), we can let other developers define their own methods. If we're going to a DB, how do we get the types of columns that we are summarizing? |
Hey kind
skimr
folks. For the last year or so I've been pondering how nice it would be to runskim
against my big ol' Redshift database. I suspect I'm not the only one who's thought about this. The general idea would be to have as much code as possible execute on the database using the magic ofdbplyr
.Today I decided to spend some time trying to understand the
skimr
code base and think about what might be necessary in order to refactor the code into functions that can play with the limited subset of functions thatdbplyr
can execute in SQL.I've done a quick pass through the code and before I started really digging into this I wanted to see if any of you wise folks had given this thought or maybe seen something else implemented elsewhere. It seems like a database friendly
skimr
would provide a lot of value, it's not a trivial exercise to refactorskimr
.Any input you all have would be much appreciated.
The text was updated successfully, but these errors were encountered: