Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on empty data.frames #23

Closed
petermeissner opened this issue Oct 9, 2017 · 9 comments
Closed

Error on empty data.frames #23

petermeissner opened this issue Oct 9, 2017 · 9 comments
Assignees
Labels
pr_pending PR has been raised v1.1.2

Comments

@petermeissner
Copy link

petermeissner commented Oct 9, 2017

Hey,

thnaks for the package this is very use ful and very handy - we love the summary and the reporting!

What irritates me is the following:

I have two data.frames, e.g.:

library(dataCompareR)

df_1 <- data.frame(a = character(0), b = integer(0))
df_2 <- data.frame(a = character(0), b = integer(0))

rCompare(df_1, df_2)
## Running rCompare...
## Error in checkEmpty(df1)  : ERROR : One or more dataframes are empty

Obviously this is not a bug but intended behaviour (right?) BUT I would argue that

  1. both data.frames are valid
  2. they are equal (same columns, same data). Why impose on the user that data is only valid if its filled?

I would suggest to either redesign the function to make it handle 0 row data.frames just like any other data.frame or allow the user to prevent this error by setting a parameter (e.g.: rCompare(df_1, df_2, do_not_error_on_emty_df = TRUE)).

What do you think?

@robne1982
Copy link
Collaborator

Thanks for the feedback @petermeissner. I agree there's no reason why it needs to be this way, in fact, I can think of example where this could be frustrating if people are running a long piece of R code and are expecting to find a dataCompareR output at the end, only to have an error because one data.frame is empty.

I don't think this should be that difficult to change, so will look to change the behaviour in the next update.

Note that we're just about to merge dev > master and update the CRAN package to fix a few of the dplyr deprecation warnings that I'm sure you've seen, so any changes around empty data.frames this will come in the next version.

@RjLi13
Copy link
Contributor

RjLi13 commented Jul 27, 2018

I'm curious on what the original reasoning was in including the check for empty data frames as part of validating the data before comparison, and what comparison functions might break if the checkEmptys were removed. Would help with understanding the codebase!

@robne1982
Copy link
Collaborator

Good question - we generally went with the approach of validating the data passed to rCompare upfront, to avoid having to validate the data in every function downstream. As to why we specifically excluded datasets with no rows, I'm not sure if a lot of thought went in to the options.

Based on the code-base and the most useful experience for users, I'd probably suggest:

  • we should allow comparison of data.frames with no rows. My guess is this should not lead to too many changes in the code base - I'd expect the dplyr filters etc to work, but the best way to understand is just to remove the validation and have a go

  • whilst it may be useful in some scenarios, allowing comparison of data.frames with no columns is likely to be harder, as I'd anticipate more problems with the current code base. We could either keep the current error in this scenario, or catch the situation and handle it in a separate workflow

In either case I'd like to ensure we produce valid output objects where possible - in the longer term, I'd like to work on #8, and so avoiding bespoke output in these scenarios will help a lot!

@robne1982
Copy link
Collaborator

Working on this now - bit harder than I expected, as follows:

  • there was some reorganisation of the validation needed to get data.frames with cols but no rows to validate. The issue is catching when someone calls rCompare(a,b) but a or b is NA at runtime. It's very easy for these to end up converted to empty data.frames. Split up validation in two - part 1 validates the argument for the case where a or b is NA. Then later we assess nrow and ncol, and error if ncol=0
  • After this all the unit tests pass, however, the output is not very useful, getting
> nocoldf <- data.frame(Car = character(),
+                       Date = as.Date(character()),
+                       Model = character(),
+                       stringsAsFactors = FALSE)
> rCompare(nocoldf, nocoldf)
Running rCompare...
All columns were compared, all rows were compared 
No variables match

I dislike this message - it is unclear.

and for summary

Meta Summary
============


|Dataset Name |Number of Rows |Number of Columns |
|:------------|:--------------|:-----------------|
|nocoldf      |0              |3                 |
|nocoldf      |0              |3                 |


Variable Summary
================

Number of columns in common: 3  
Number of columns only in nocoldf: 0  
Number of columns only in nocoldf: 0  
Number of columns with a type mismatch: 0  
No match key used, comparison is by row



Row Summary
===========

Total number of rows read from nocoldf: 0  
Total number of rows read from nocoldf: 0    
Number of rows in common: 2  
Number of rows dropped from nocoldf: 0  
Number of rows dropped from  nocoldf: 0  


Data Values Comparison Summary
==============================

Number of columns compared with ALL rows equal: 1  
Number of columns compared with SOME rows unequal: 0  
Number of columns with missing value differences: 0  

Columns with all rows equal : 

The line Number of rows in common: 2 is plain wrong.

And I'm struggling to see how this can be interpreted!

Number of columns compared with ALL rows equal: 1  
Number of columns compared with SOME rows unequal: 0  

@robne1982
Copy link
Collaborator

Picking them up one by one, the line Number of rows in common: 2 is a simple fix, caused by the fact that we use seq(1:nrow) in the case where there's no match key. Oddly (or maybe not) the output of seq(1:0) is 1,2.

@robne1982
Copy link
Collaborator

After some work, summary now looks decent

> summary(rCompare(nocoldf, nocoldf))
Running rCompare...
dataCompareR is generating the summary...

Data Comparison
===============

Date comparison run: 2018-08-24 13:38:07  
Comparison run on R version 3.3.3 (2017-03-06)  
With dataCompareR version 0.1.1  


Meta Summary
============


|Dataset Name |Number of Rows |Number of Columns |
|:------------|:--------------|:-----------------|
|nocoldf      |0              |3                 |
|nocoldf      |0              |3                 |


Variable Summary
================

Number of columns in common: 3  
Number of columns only in nocoldf: 0  
Number of columns only in nocoldf: 0  
Number of columns with a type mismatch: 0  
No match key used, comparison is by row



Row Summary
===========

Total number of rows read from nocoldf: 0  
Total number of rows read from nocoldf: 0    
Number of rows in common: 0  
Number of rows dropped from nocoldf: 0  
Number of rows dropped from  nocoldf: 0  


Data Values Comparison Summary
==============================

No rows were compared, so no summary can be provided

@robne1982
Copy link
Collaborator

However, I do not like the print output

> print(rCompare(nocoldf, nocoldf))
Running rCompare...
All columns were compared, all rows were compared 
No variables match

@robne1982
Copy link
Collaborator

When comparisons happen, print displays

> print(rCompare(iris, iris))
Running rCompare...
All columns were compared, all rows were compared 
All compared variables match 
 Number of rows compared: 150 
 Number of columns compared: 5

or

> print(rCompare(iris, iris[1:140,]))
Running rCompare...
All columns were compared, 10 row(s) were dropped from comparison
All compared variables match 
 Number of rows compared: 140 
 Number of columns compared: 5

@robne1982
Copy link
Collaborator

Modified the code so that the current unit tests pass and we get

> print(rCompare(nocoldf, nocoldf))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = "Car"))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = c("Car","Date")))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = c("Car","Date","Model")))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr_pending PR has been raised v1.1.2
Projects
None yet
Development

No branches or pull requests

3 participants