Error on empty data.frames #23

petermeissner · 2017-10-09T10:29:55Z

Hey,

thnaks for the package this is very use ful and very handy - we love the summary and the reporting!

What irritates me is the following:

I have two data.frames, e.g.:

library(dataCompareR)

df_1 <- data.frame(a = character(0), b = integer(0))
df_2 <- data.frame(a = character(0), b = integer(0))

rCompare(df_1, df_2)
## Running rCompare...
## Error in checkEmpty(df1)  : ERROR : One or more dataframes are empty

Obviously this is not a bug but intended behaviour (right?) BUT I would argue that

both data.frames are valid
they are equal (same columns, same data). Why impose on the user that data is only valid if its filled?

I would suggest to either redesign the function to make it handle 0 row data.frames just like any other data.frame or allow the user to prevent this error by setting a parameter (e.g.: rCompare(df_1, df_2, do_not_error_on_emty_df = TRUE)).

What do you think?

The text was updated successfully, but these errors were encountered:

robne1982 · 2017-10-12T12:01:00Z

Thanks for the feedback @petermeissner. I agree there's no reason why it needs to be this way, in fact, I can think of example where this could be frustrating if people are running a long piece of R code and are expecting to find a dataCompareR output at the end, only to have an error because one data.frame is empty.

I don't think this should be that difficult to change, so will look to change the behaviour in the next update.

Note that we're just about to merge dev > master and update the CRAN package to fix a few of the dplyr deprecation warnings that I'm sure you've seen, so any changes around empty data.frames this will come in the next version.

RjLi13 · 2018-07-27T06:33:22Z

I'm curious on what the original reasoning was in including the check for empty data frames as part of validating the data before comparison, and what comparison functions might break if the checkEmptys were removed. Would help with understanding the codebase!

robne1982 · 2018-08-15T08:10:58Z

Good question - we generally went with the approach of validating the data passed to rCompare upfront, to avoid having to validate the data in every function downstream. As to why we specifically excluded datasets with no rows, I'm not sure if a lot of thought went in to the options.

Based on the code-base and the most useful experience for users, I'd probably suggest:

we should allow comparison of data.frames with no rows. My guess is this should not lead to too many changes in the code base - I'd expect the dplyr filters etc to work, but the best way to understand is just to remove the validation and have a go
whilst it may be useful in some scenarios, allowing comparison of data.frames with no columns is likely to be harder, as I'd anticipate more problems with the current code base. We could either keep the current error in this scenario, or catch the situation and handle it in a separate workflow

In either case I'd like to ensure we produce valid output objects where possible - in the longer term, I'd like to work on #8, and so avoiding bespoke output in these scenarios will help a lot!

robne1982 · 2018-08-24T09:53:32Z

Working on this now - bit harder than I expected, as follows:

there was some reorganisation of the validation needed to get data.frames with cols but no rows to validate. The issue is catching when someone calls rCompare(a,b) but a or b is NA at runtime. It's very easy for these to end up converted to empty data.frames. Split up validation in two - part 1 validates the argument for the case where a or b is NA. Then later we assess nrow and ncol, and error if ncol=0
After this all the unit tests pass, however, the output is not very useful, getting

> nocoldf <- data.frame(Car = character(),
+                       Date = as.Date(character()),
+                       Model = character(),
+                       stringsAsFactors = FALSE)
> rCompare(nocoldf, nocoldf)
Running rCompare...
All columns were compared, all rows were compared 
No variables match

I dislike this message - it is unclear.

and for summary

Meta Summary
============


|Dataset Name |Number of Rows |Number of Columns |
|:------------|:--------------|:-----------------|
|nocoldf      |0              |3                 |
|nocoldf      |0              |3                 |


Variable Summary
================

Number of columns in common: 3  
Number of columns only in nocoldf: 0  
Number of columns only in nocoldf: 0  
Number of columns with a type mismatch: 0  
No match key used, comparison is by row



Row Summary
===========

Total number of rows read from nocoldf: 0  
Total number of rows read from nocoldf: 0    
Number of rows in common: 2  
Number of rows dropped from nocoldf: 0  
Number of rows dropped from  nocoldf: 0  


Data Values Comparison Summary
==============================

Number of columns compared with ALL rows equal: 1  
Number of columns compared with SOME rows unequal: 0  
Number of columns with missing value differences: 0  

Columns with all rows equal :

The line Number of rows in common: 2 is plain wrong.

And I'm struggling to see how this can be interpreted!

Number of columns compared with ALL rows equal: 1  
Number of columns compared with SOME rows unequal: 0

robne1982 · 2018-08-24T09:56:36Z

Picking them up one by one, the line Number of rows in common: 2 is a simple fix, caused by the fact that we use seq(1:nrow) in the case where there's no match key. Oddly (or maybe not) the output of seq(1:0) is 1,2.

robne1982 · 2018-08-24T12:40:49Z

After some work, summary now looks decent

> summary(rCompare(nocoldf, nocoldf))
Running rCompare...
dataCompareR is generating the summary...

Data Comparison
===============

Date comparison run: 2018-08-24 13:38:07  
Comparison run on R version 3.3.3 (2017-03-06)  
With dataCompareR version 0.1.1  


Meta Summary
============


|Dataset Name |Number of Rows |Number of Columns |
|:------------|:--------------|:-----------------|
|nocoldf      |0              |3                 |
|nocoldf      |0              |3                 |


Variable Summary
================

Number of columns in common: 3  
Number of columns only in nocoldf: 0  
Number of columns only in nocoldf: 0  
Number of columns with a type mismatch: 0  
No match key used, comparison is by row



Row Summary
===========

Total number of rows read from nocoldf: 0  
Total number of rows read from nocoldf: 0    
Number of rows in common: 0  
Number of rows dropped from nocoldf: 0  
Number of rows dropped from  nocoldf: 0  


Data Values Comparison Summary
==============================

No rows were compared, so no summary can be provided

robne1982 · 2018-08-24T12:44:21Z

However, I do not like the print output

> print(rCompare(nocoldf, nocoldf))
Running rCompare...
All columns were compared, all rows were compared 
No variables match

robne1982 · 2018-08-24T13:02:15Z

When comparisons happen, print displays

> print(rCompare(iris, iris))
Running rCompare...
All columns were compared, all rows were compared 
All compared variables match 
 Number of rows compared: 150 
 Number of columns compared: 5

or

> print(rCompare(iris, iris[1:140,]))
Running rCompare...
All columns were compared, 10 row(s) were dropped from comparison
All compared variables match 
 Number of rows compared: 140 
 Number of columns compared: 5

robne1982 · 2018-08-24T13:24:03Z

Modified the code so that the current unit tests pass and we get

> print(rCompare(nocoldf, nocoldf))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = "Car"))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = c("Car","Date")))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match
> print(rCompare(nocoldf, nocoldf, keys = c("Car","Date","Model")))
Running rCompare...
All columns were compared,  no rows compared because at least one table has no rows 
No variables match

robne1982 self-assigned this Oct 12, 2017

robne1982 added the enhancement label Oct 12, 2017

robne1982 added the hacktoberfest label Oct 25, 2017

robne1982 added help wanted and removed hacktoberfest labels Nov 7, 2017

robne1982 added the v1.1.2 label May 14, 2018

robne1982 mentioned this issue Aug 15, 2018

Provide boolean outputs that report on equality for certain conditions #8

Open

robne1982 mentioned this issue Aug 24, 2018

Allows data frames with 0 rows to be compared #57

Merged

8 tasks

robne1982 added pr_pending PR has been raised and removed enhancement help wanted labels Aug 24, 2018

robne1982 closed this as completed Aug 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on empty data.frames #23

Error on empty data.frames #23

petermeissner commented Oct 9, 2017 •

edited

Loading

robne1982 commented Oct 12, 2017

RjLi13 commented Jul 27, 2018

robne1982 commented Aug 15, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

Error on empty data.frames #23

Error on empty data.frames #23

Comments

petermeissner commented Oct 9, 2017 • edited Loading

robne1982 commented Oct 12, 2017

RjLi13 commented Jul 27, 2018

robne1982 commented Aug 15, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

robne1982 commented Aug 24, 2018

petermeissner commented Oct 9, 2017 •

edited

Loading