Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly handling missingness in join columns #2499

Closed
adkabo opened this issue Oct 24, 2020 · 31 comments · Fixed by #2504
Closed

Explicitly handling missingness in join columns #2499

adkabo opened this issue Oct 24, 2020 · 31 comments · Fixed by #2504
Labels
breaking The proposed change is breaking. feature
Milestone

Comments

@adkabo
Copy link

adkabo commented Oct 24, 2020

Julia has a great philosophy of taking missingness seriously. For example, unlike in Pandas or Postgres, sum([1,2,missing]) gives missing. However, this philosophy hasn't yet been applied to all of the functions in the JuliaData ecosystem. I'll give an example to illustrate.

My goal is to find the relationship between age and salary. To find out, I will combine observations from two datasets.

One with age,

DataFrame([(name="John", employer=missing, age=40)])

and one with salary.

DataFrame([(name="John", employer="Julia Computing", salary=9999)])

In a complete-data environment, if these observations correspond to the same person, I want one row in the joined dataframe; on the other hand, if they correspond to different people, I want zero rows in the join.

Given the missingness in the "employer" column, we don't know if that's the same person or not. So when I join them on ("name", "employer"), we cannot know the right answer. Yet

innerjoin(
    DataFrame([(name="John", employer=missing, age=40)]),
    DataFrame([(name="John", employer="Julia Computing", salary=9999)]), 
    on=[:name, :employer]
)

makes a decision implicitly, returning an empty dataframe. If the observations correspond to the same person, this result -- failing to match the two observations -- is a false negative.

To avoid drawing mistaken conclusions from analysis, I would like to extend the practice of enforcing explicit handling of missing values. So I would like to get an error message in this case by default, and to actively tell innerjoin() how it should handle missing values.

The passmissing() and skipmissing() patterns used elsewhere in JuliaData are a great reassurance that Julia is looking out for missing data problems. When applied to joins, I would like to consider:

  • In an outer join, should rows with missingness all be dropped from the output or all be kept in?
  • In an inner join, should they be considered matches or nonmatches?
  • What if there is missingness on both sides or one side only?

I'm not sure if something like passmissing(innerjoin)(a,b) would work, or if it should be more like innerjoin(a,b, missingrule=:drop) or something else. But I do want to start the conversation about it.


#2243 has some discussion of missingness and joins, mainly focused on the a.fillna(b) use case.

@bkamins bkamins added breaking The proposed change is breaking. feature labels Oct 24, 2020
@bkamins bkamins added this to the 1.x milestone Oct 24, 2020
@bkamins
Copy link
Member

bkamins commented Oct 24, 2020

Thank you for commenting on it. The API is that by default joins are performed using isequal test (technically matching hash values). This way we can make joins fast. But the consequence is that missing is treated just as a separate value. Note that this is the same that we do in groupby (although technically you could say that if we see missing we do not know to which group it should fall into).

However, I understand this request and we can add new functionality in which you could opt-in to handle missings in a special way (indeed what you write makes sense in general). I add it as a feature to add in the future (probably post 1.0 release, as we would implement it in a non-breaking way). In the mean-time in this PR a discussion about the preferred API for this can be continued.

@adkabo
Copy link
Author

adkabo commented Oct 24, 2020

The API is that by default joins are performed using isequal test (technically matching hash values). This way we can make joins fast.

I like that joins are fast. I want to add a dispatch for joining columns of type Array{Union{T,Missing}} that raises an error unless missingness is explicitly handled. This wouldn't add any runtime overhead for joining complete data columns, nor for joining nullable columns when the missingness behavior is specified.

you could say that if we see missing we do not know to which group it should fall into

That's a great point that merits more thinking. If missing is a separate group identifier, that is helpful relative to silently dropping the values, but it has the side effect of excluding observations from the groups where they really belong -- which could be nonuniform.

I add it as a feature to add in the future (probably post 1.0 release, as we would implement it in a non-breaking way)

I'm glad the stability of a 1.0 release is taken seriously. That said, the primary goal of this issue is to intentionally break unsafe code by raising an error -- which would ideally happen before 1.0. If more leniency is desired after that release, it's easy to switch to a more permissive model without breaking code.

@bkamins bkamins modified the milestones: 1.x, 1.0 Oct 24, 2020
@bkamins
Copy link
Member

bkamins commented Oct 24, 2020

I want to add a dispatch for joining columns of type Array{Union{T,Missing}}

The problem is that such columns might not contain missing (we know only eltype not the contents). With missing we normally do not throw errors based on type only on value. (having said that any(ismissing, col) is relatively cheap so we could do this)

the primary goal of this issue is to intentionally break unsafe code by raising an error -- which would ideally happen before 1.0.

I understand this. In general we have a tension between two completely opposing points of view:

  1. some users (like you) want DataFrames.jl to be strict.
  2. some users want exactly the opposite (and make it do the operation that is most commonly right by default)

E.g. for groupby we have skipmissing kwarg, but it is false by default.

In general - while I perfectly understand your reasoning for the above reasons this is not a simple decision (if we wanted to go this way we should consult the community first and get the consensus). The additional point is that the current behavior does not lead to irreversible data corruption - you just get more results than you would get with a strict approach that are easy to be filtered out later. Also - as commented - the most consistent approach with groupby would be to have skipmissing kwarg but set to false by default (and if we went this way this change is non breaking so it becomes just a feature request that can be implemented later).

In general this request is kind of similar to asking Dict to throw an error when someone wants to put a missing value in it as a key. As I have said - I perfectly understand the logic behind wanting this (the true value behind missing might or might not be the same as the already existing key, and if you get a second missing you do not know if it is the same or a different value) but probably people would find this approach too strict (though most consistent logically).

Looking at it from another angle the question is: how often the behavior we currently have would lead to significant problems? (leaving aside the logical purity of your proposal which I appreciate and agree with)

@nalimilan - can you please comment on what you think here.

I am tagging it for a decision before 1.0 release.

@adkabo
Copy link
Author

adkabo commented Oct 24, 2020

The additional point is that the current behavior does not lead to irreversible data corruption - you just get more results than you would get with a strict approach that are easy to be filtered out later.

This is true in the case of groupby: a missing group appears. In the case of innerjoin(), my original example shows that it does corrupt the data by falsely failing to match rows and dropping them from the output without any indication.

In general we have a tension between two completely opposing points of view:

  1. some users (like you) want DataFrames.jl to be strict.
  2. some users want exactly the opposite (and make it do the operation that is most commonly right by default)

Perhaps there's a way for users to specify how strict they want to be.

using DataFrames.Strict

vs

using DataFrames.Permissive

and using DataFrames could pick one of these or neither. Or using the "context" pattern,

ctx = DataFrames.strict()
c = innerjoin(ctx, a,b)

@adkabo
Copy link
Author

adkabo commented Oct 24, 2020

By the way @bkamins , I want to thank you for considering this suggestion. Missingness handling is a big decision, and regardless of how DataFrames.jl ultimately decides to proceed, I'm appreciative that you're taking the time to think seriously about the varied needs of the JuliaData community.

@bkamins
Copy link
Member

bkamins commented Oct 24, 2020

my original example shows that it does corrupt the data by falsely failing to match rows and dropping them from the output without any indication.

indeed it drops it, but there is no way to "properly fix it" without filling the missings (in your approach you would throw an error in this case but then the user gets nothing - actually this is something I do in #2494 by default, but there it is less debatable).

Perhaps there's a way for users to specify how strict they want to be.

We avoid keeping global state in DataFrames.jl. Probably if we add it then it will be via kwarg. The decision to make is what we want to be the default (as if we keep the current default this PR is is not urgent, but if we decide that the default is what you propose we need to implement it now).

@bkamins bkamins mentioned this issue Oct 24, 2020
20 tasks
@nalimilan
Copy link
Member

Thanks for spotting this. Indeed the presence of missing values in the join keys seems to be more dangerous than e.g. in groupby, as it can silently drop missing values. AFAICT this problem is most problematic for inner joins: in other types of joins a row with missing will still be present in the result, so missing is propagated somehow (like with groupby).

I suspect that in practice people don't perform joins on columns with missing values very frequently, and that when they do so it's often due to a mistake. We could deprecate this in 0.22 and see whether we get complaints.

A keyword argument would have to be added to disable the deprecation warning. But there are several possible behaviors to choose: throw an error (default); treat missing as a separate value equal to itself (current behavior); treat missing as a separate value not equal to itself (equivalent to skipping missing values for inner joins); skip rows with missing. Maybe we only need to support the first two behaviors, in which case a single Boolean argument would be enough (not sure how to call it: allowmissing?). Skipping rows with missing values can be done manually using dropmissing; and treating missing as not equal to itself could be allowed later via an additional argument if it's really needed.

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

If we went for this I would do allowmissing with meaning:

  • if true: treat missing as a separate value equal to itself (current behavior)
  • if false (default): throw an error

I will ask for opinions on slack

@pdeffebach
Copy link
Contributor

Do any languages support joining where missings are equal? iirc Stata treats all missings as different or throws an error.

I would be most comfortable with always throwing an error and not allowing an option to do otherwise.

@nalimilan
Copy link
Member

FWIW dplyr supports this and does the same as we currently do by default:

na_matches: Should ‘NA’ and ‘NaN’ values match one another?

          The default, ‘"na"’, treats two ‘NA’ or ‘NaN’ values as
          equal, like ‘%in%’, ‘match()’, ‘merge()’.

          Use ‘"never"’ to always treat two ‘NA’ or ‘NaN’ values as
          different, like joins for database sources, similarly to
          ‘merge(incomparables = FALSE)’.

(Looks like they anticipated having to support more than two options but so far didn't add them. :-)

We should also consider NaN, which could also be a source of bugs.

@tbeason
Copy link
Contributor

tbeason commented Oct 25, 2020

I just don't see how the example in the OP can do anything other than what it currently does. You are labelling it to be a false negative only because you constructed the example. Suppose instead that I hand you those two datasets, now how certain are you that those observations correspond to each other? Suppose you observe another John who worked at a different place (or the same place!). Then what? Real data is messy like this.

I think the typical way (at least, what I've encountered and what I usually do) to merge datasets when you have missing keys is a multiple step merge. You first merge based on what you know gives the best match, then from the remaining unmatched subsets you might match on different -- less good -- criteria. But you might also not do this second step. It requires assumptions and outside knowledge. This isn't something the package should do for you.

If anything like this is implemented I would want it to be off by default (put me in the strict camp I guess).

@oxinabox
Copy link
Contributor

I would be unfavor of doing any(ismissing, col) || error(...) right now for the 1.0 release.

Since turning errors into non-errors is not breaking.
And current behaviour is weird.
We can error for now and workout exactly how we want the future API to look later.

The deprecation is fiddly.
I think we would need to introduce a new singleton sentinel for it and replace missing a with that sentinel.

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

unfavor

I understand you wanted to say "in favor". Right?


This is what pandas does (essentially what we do now):

>>> df
    id  x
0  0.0  0
1  NaN  1
2  2.0  2
3  3.0  3
4  4.0  4
5  5.0  5
6  6.0  6
7  7.0  7
8  8.0  8
9  9.0  9
>>> df2
    id  y
0  0.0  0
1  1.0  1
2  NaN  2
3  3.0  3
4  4.0  4
5  5.0  5
6  6.0  6
7  7.0  7
8  8.0  8
9  9.0  9
>>> pd.merge(df, df2, on='id', how='inner')
    id  x  y
0  0.0  0  0
1  NaN  1  2
2  3.0  3  3
3  4.0  4  4
4  5.0  5  5
5  6.0  6  6
6  7.0  7  7
7  8.0  8  8
8  9.0  9  9

In summary we essentially have three options then:

  1. throw an error (and I understand that the community preference is to be breaking and choose it as the default)
  2. treat missings as equal (this is what we already have so we can keep it)
  3. treat missings as distinct (this would have to be implemented and requires special handling - we could add it later)

So we would a kwarg that would most likely take Symbols as values: :error (the default), :equal, :distinct. What could be the name for it? missingmatching?

@pdeffebach
Copy link
Contributor

As mentioned above, I don't think we should allow other options besides erroring at all. We can always add it later.

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

I don't think we should allow other options besides erroring at all. We can always add it later.

But why? If we make error the default, then I think it is better to allow users to get the current behavior by using a kwarg (so that they can easily update their codes). What downsides of this approach do you see?

@pdeffebach
Copy link
Contributor

Only that exposing a larger API means more to maintain.

I also find it hard to motivate real-world examples where matching missings would be a good idea. At the very least this is a vote against allowing option 2. I would imagine that the vast majority of uses-cases are where users would like to see an error.

@nalimilan
Copy link
Member

Given how simple it is to support the current behavior using a keyword argument, it would be quite extreme to completely stop supporting it.

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

  • R and pandas support this behavior (so even if it is not super consistent if someone wants to port the code from e.g. pandas to DataFrames.jl 1-2-1 it could be useful).

@pdeffebach
Copy link
Contributor

Okay, then I think we should support the keyword arguments. :error, :equal, and :distinct seem good. matchmissing seems like a good name to me.

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

@adkabo - in summary we will follow your recommendation then most probably (let us wait a few days for the feedback). Thank you for raising the issue.

@tbeason
Copy link
Contributor

tbeason commented Oct 25, 2020

I guess maybe I misunderstood exactly what was on the table here? I do support the proposal if it is just essentially just about the matching-ness of missing in that either they are ignored completely, throw errors, or sort of match together.

Can someone clarify what would happen, though, in OPs example? Either of the non-erroring options still returns empty, yes?

@nalimilan
Copy link
Member

Do any languages implement the "distinct" solution? Otherwise it's probably not worth supporting that. (I couldn't find what Stata does.)

Can someone clarify what would happen, though, in OPs example? Either of the non-erroring options still returns empty, yes?

Yes.

@oxinabox
Copy link
Contributor

oxinabox commented Oct 25, 2020

:distinct sounds useful for with cross joins.
It also lines up with the mental model of it could be anything

@bkamins
Copy link
Member

bkamins commented Oct 25, 2020

It also lines up with the mental model of it could be anything

This is what I meant at the top of this discussion. We will not implement it now, but as it is the most logically consistent approach so I would not rule it out in the design (just like R did).

@nalimilan
Copy link
Member

Cross joins are a very special kind of join (I almost wouldn't consider it as a join). They don't need to compare values so I don't think :distinct applies there (each row is distinct). Actually crossjoin doesn't take the same arguments currently and doesn't use the same implementation already.

@nalimilan
Copy link
Member

Another data point: in SQL inner joins, NULL is treated as distinct from all other values, which is considered a trap (see Wikipedia). However NULLs are preserved by left/right/outer/cross joins.

@ppalmes
Copy link
Contributor

ppalmes commented Oct 25, 2020

i think in typical relational database, columns that you use as primary keys or foreign keys should be unique, distinct, and not empty. i think we are complicating the processing by virtue that the assumption does not hold. by making the API flexible, we are allowing the mistake in data encoding to be addressed by the dataframe. some basic assumptions in relational data should be met and if they are not met, the API should not be changed because it will just make the design messy.

@ppalmes
Copy link
Contributor

ppalmes commented Oct 25, 2020

the data should be fixed and not the algorithm for joining tables. perhaps the join operation will spit an error if the index column has missing values because the table violates the requirement of joining indexed keys or skip those with missing values because the algorithm should not guess. keys with missing values lacks the information to relate information from one table to another.

@xiaodaigh
Copy link
Contributor

I support throwing an error but then allow an option to allow matching.

@Nosferican
Copy link
Contributor

I favor following SQL behavior.

@bkamins
Copy link
Member

bkamins commented Oct 29, 2020

See #2504 for a proposed implementation (finalizing is pending approval of #2503).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants