Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skimr against a database #358

Open
CerebralMastication opened this issue Dec 2, 2018 · 9 comments
Open

skimr against a database #358

CerebralMastication opened this issue Dec 2, 2018 · 9 comments

Comments

@CerebralMastication
Copy link

CerebralMastication commented Dec 2, 2018

Hey kind skimr folks. For the last year or so I've been pondering how nice it would be to run skim against my big ol' Redshift database. I suspect I'm not the only one who's thought about this. The general idea would be to have as much code as possible execute on the database using the magic of dbplyr.

Today I decided to spend some time trying to understand the skimr code base and think about what might be necessary in order to refactor the code into functions that can play with the limited subset of functions that dbplyr can execute in SQL.

I've done a quick pass through the code and before I started really digging into this I wanted to see if any of you wise folks had given this thought or maybe seen something else implemented elsewhere. It seems like a database friendly skimr would provide a lot of value, it's not a trivial exercise to refactor skimr.

Any input you all have would be much appreciated.

@michaelquinn32
Copy link
Collaborator

Hi JD!

This is something that we could eventually support. It would be easier to do v2, which has a simpler codebase. I'm currently on pat leave, and I had hoped to spend some time on skimr, but it's going slowly. The last time I talked to @elinw, we were targeting a v2 release at the beginning of next year.

After that, we can start fleshing out features like this.

Thanks!

@CerebralMastication
Copy link
Author

Congratulations on the new new Mini Mighty Quinn!

What I'm hearing you say is that if I want to hack around the edges on this I should be using the v2 branch :)

I'm going to also look at the code base for dbplot as some of the functions @edgararuiz has put in there may be helpful. In particular these:

db_compute_bins() - Returns a data frame with the bins and count per bin
db_compute_count() - Returns a data frame with the count per discrete value
db_compute_raster() - Returns a data frame with the results per x/y intersection
db_compute_boxplot() - Returns a data frame with boxplot calculations

@edgararuiz-zz
Copy link

There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.

@CerebralMastication
Copy link
Author

There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.

That's a grand idea Edgar. I think refactoring those to do multiple fields would be a helpful step. Any help is always appreciated. Plus I'll learn a few things along the way.

@elinw
Copy link
Collaborator

elinw commented Jan 6, 2019

I just merged updates into the master and develop branches so that skimr will work with dplyr 0.8.0. I'm going to see how it goes with v2 but I'd suggest if you are working on this idea that you pull down the rc branch for dplyr and work from there.

@elinw
Copy link
Collaborator

elinw commented Jan 9, 2019

Now the updates are in the v2 branch.
So I am hoping to release a final v1 pretty soon and definitely before the new dplyr eta of February 1.
That's also around the same time that I hope to release v2 which is looking pretty good.

@elinw
Copy link
Collaborator

elinw commented Jan 13, 2019

I've been looking at this a tiny bit. Right now this works

# not in the data base
flights_db %>% select(flights_db$ops$vars) %>% skim()

# in the data base
flights_db %>%  db_compute_count(dep_delay) 

Because skim() in v2 is totally focused on data frames I'm wondering if we would want to create skim_db() that would work like skim() but instead of checking to see if data is a data frame, it would check to see if it is a data base or more specifically a dbplyr connection to a data base. Probably by checking for "tbl_sql" in the classes.

Then you'd want to have skimmers that are for databases, so limited to what can be done there.

I think we would likely have to do or require users to do some preprocessing. For example we'd need the variable names and the types. You can get those with a prior round trip to the database, and trying to make it as generic as possible (because supporting multiple databases would make doing queries that are too specific a problem).

db_head <- flights_db %>% head(0) %>% as.data.frame()
db_vars <- colnames(db_head)
db_classes <- lapply(db_head, class)

Then we use that with our skimmers to go do the work of calculating the statistics.

Then back to R to update the resulting data frame and on to the next.
Then print.
And that will replicate what happens in skim() but then we could try to make the performance in both versions better.

@tedmoorman
Copy link

tedmoorman commented Aug 18, 2020

I'd like to upvote this suggestion/issue, if such a thing is possible. This is a fabulous package. If it were more database compatible, the improvement in speed would make it that much better. For the particular database I'm using, the following snippet of code runs reasonably well, with an odbc database connection using dplyr, dplyr, etc.

my_db %>% select(<a few columns>) %>% distinct_all() %>% collect() %>% select_if(~!bit64::is.integer64(.)) %>% skim()

@michaelquinn32
Copy link
Collaborator

Thanks!

V2 complicates this a bit. To get a collection of skimmer functions for a type of data in your data frame, we rely on S3. The function get_skimmers() is a generic, and each type of data returns a different method.

See here:

skimr/R/skim_with.R

Lines 197 to 200 in 966b865

get_final_skimmers <- function(column, data, local_skimmers, append) {
defaults <- get_skimmers(data[[column]])
all_classes <- skim_class(data[[column]])
locals <- get_local_skimmers(all_classes, local_skimmers)

And see here:
https://github.com/ropensci/skimr/blob/master/R/get_skimmers.R

This was a particular design decision to make skimmer much more extensible. Instead of us having to create some sort of dispatch system for every possible class of data (like in v1), we can let other developers define their own methods.

If we're going to a DB, how do we get the types of columns that we are summarizing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants