0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

bluefir · 2013-12-05T15:32:33Z

I encountered a very strange problem in 0.12 when a straightforward join fails while it works just fine in 0.11.

df1

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 3 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
dtypes: datetime64ns, int64(2)

df1.dtypes

FactorDate datetime64[ns]
FundSecNo int64
SourceFundSecNo int64
dtype: object

df2

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 94868 entries, (2011-10-18 00:00:00, 283) to (2011-10-19 00:00:00, 127419)
Data columns (total 2 columns):
DemeanGroupId 94868 non-null values
NormalizationGroupId 94868 non-null values
dtypes: float64(1), int64(1)

for ind in df2.index.levels:
    print(type(ind))

<class 'pandas.tseries.index.DatetimeIndex'>
<class 'pandas.core.index.Int64Index'>

test = df1.join(df2, on=['FactorDate', 'SourceFundSecNo'])
test

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
DemeanGroupId 0 non-null values
NormalizationGroupId 0 non-null values
dtypes: datetime64ns, float64(2), int64(2)

date = datetime.datetime(2011, 10, 18)
sfsn = 283
test.loc[(test.FactorDate == date) & (test.SourceFundSecNo == sfsn)]

          FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
1 2011-10-18 00:00:00 283 283 NaN NaN
17933 2011-10-18 00:00:00 65213 283 NaN NaN

df2.loc[(date, sfsn)]

DemeanGroupId 350171
NormalizationGroupId 35
Name: (2011-10-18 00:00:00, 283), dtype: float64

I can attach the frames, but I forgot how to do it :-)

The text was updated successfully, but these errors were encountered:

jreback · 2013-12-05T23:23:54Z

can you put up a small reproducible example if possible.

bluefir · 2013-12-05T23:53:04Z

I tried quickly but couldn't come up with one that exhibits the same behavior. Can I put the frames into an HDF5 file and somehow ship it to you guys? This would be the easiest way.

jreback · 2013-12-06T00:07:59Z

sure....show complete code to get them out...and put on dropbox like service and post the link

bluefir · 2013-12-06T14:25:17Z

https://www.dropbox.com/s/c0hq6uqn8y5vkbk/issue5647.h5

store = pd.HDFStore('issue5647.h5', mode='r')
df1 = store.select('df1')
df2 = store.select('df2')
store.close()

jreback · 2013-12-06T15:21:56Z

I think you want something like this
The problem is it doesn't know how to match up the 'SourceFundSecNo' with 'FundSecNo'.

In [72]: x = df2.reset_index()

In [73]: x['SourceFundSecNo'] = x['FundSecNo']

In [74]: x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94868 entries, 0 to 94867
Data columns (total 5 columns):
FactorDate              94868  non-null values
FundSecNo               94868  non-null values
DemeanGroupId           94868  non-null values
NormalizationGroupId    94868  non-null values
SourceFundSecNo         94868  non-null values
dtypes: datetime64[ns](1), float64(1), int64(3)
In [75]: df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 3 columns):
FactorDate         71958  non-null values
FundSecNo          71958  non-null values
SourceFundSecNo    71958  non-null values
dtypes: datetime64[ns](1), int64(2)

inner join

In [76]: pd.merge(df1,x,on=['FactorDate','SourceFundSecNo']).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 71957
Data columns (total 6 columns):
FactorDate              71958  non-null values
FundSecNo_x             71958  non-null values
SourceFundSecNo         71958  non-null values
FundSecNo_y             71958  non-null values
DemeanGroupId           71958  non-null values
NormalizationGroupId    71958  non-null values
dtypes: datetime64[ns](1), float64(1), int64(4)

outer join I think is what you want (e.g. many-many)

In [77]: pd.merge(df1,x,on=['FactorDate','SourceFundSecNo'],how='outer').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103922 entries, 0 to 103921
Data columns (total 6 columns):
FactorDate              103922  non-null values
FundSecNo_x             71958  non-null values
SourceFundSecNo         103922  non-null values
FundSecNo_y             103922  non-null values
DemeanGroupId           103922  non-null values
NormalizationGroupId    103922  non-null values
dtypes: datetime64[ns](1), float64(3), int64(2)

bluefir · 2013-12-06T15:40:41Z

Well, as you can see, the index in df2 is unique (at least it's supposed to be; by the way, is there a function that checks for that, similar to is_lexsorted?). So, the idea is to propagate unique values by (FactorDate, FundSecNo) from df2 into each (FactorDate, SourceFundSecNo) in df1. In other words, many-to-one. This is a pretty standard idea.

I can work around that, sure, but it looks ugly and wastes CPU and memory. The problem is that it worked perfectly in 0.11 while it doesn't work in 0.12. Do you observe the same behavior? If so, why? Was 0.11 behavior wrong? join() is just syntactic sugar for merge(), right? Any toy example I tried to come up with in 0.12 worked as expected.

jreback · 2013-12-06T15:54:41Z

join default is 'left', I think you just need 'outer' here. Not sure whyy its different in 0.11; I don't think it changed.

you can do

df2.index.is_unique
(which is True here)

In [84]: df1.join(df2,on=['FactorDate','SourceFundSecNo'],how='outer')
Out[84]: 
               FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0     2011-10-18 00:00:00        280            12432        1520161                   152
1883  2011-10-18 00:00:00      12432            12432        1520161                   152
1     2011-10-18 00:00:00        283              283         350171                    35
17933 2011-10-18 00:00:00      65213              283         350171                    35
2     2011-10-18 00:00:00        285              285        1510157                   151
3     2011-10-18 00:00:00        286              286        1510157                   151
4     2011-10-18 00:00:00        287              287         350161                    35
5     2011-10-18 00:00:00        294              294         350175                    35
6     2011-10-18 00:00:00        300              300         350174                    35

bluefir · 2013-12-06T16:19:26Z

At the risk of beating the dead horse here, but I do want 'left'! The fact that 'left' returns no joins at all while 'outer' returns all of them just doesn't make sense. 0.11 behavior did make sense. How can we make sense of test2 below given test3?

test2 = pd.merge(df1, df2, how='left', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
test2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
DemeanGroupId 0 non-null values
NormalizationGroupId 0 non-null values
dtypes: datetime64ns, float64(2), int64(2)

test3 = pd.merge(df1, df2, how='outer', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
test3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103922 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 103922 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 103922 non-null values
DemeanGroupId 103922 non-null values
NormalizationGroupId 103922 non-null values
dtypes: datetime64ns, float64(3), int64(1)

jreback · 2013-12-06T16:33:28Z

not sure...maybe andy can chime in @hayd ?

hayd · 2013-12-06T20:35:54Z

So I think we can see this from the head of df1 and df2, the left doesn't look right (it has only NaNs):

In [11]: pd.merge(df1.head(), df2.head(), how='outer', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[11]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0 2011-10-18 00:00:00        280            12432            NaN                   NaN
1 2011-10-18 00:00:00        283              283         350171                    35
2 2011-10-18 00:00:00        285              285        1510157                   151
3 2011-10-18 00:00:00        286              286        1510157                   151
4 2011-10-18 00:00:00        287              287            NaN                   NaN
4 2011-10-19 00:00:00        NaN              283         350171                    35
4 2011-10-19 00:00:00        NaN              285        1510157                   151

[7 rows x 5 columns]

# inner
In [12]: pd.merge(df1.head(), df2.head(), left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[12]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
1 2011-10-18 00:00:00        283              283         350171                    35
2 2011-10-18 00:00:00        285              285        1510157                   151
3 2011-10-18 00:00:00        286              286        1510157                   151 

[3 rows x 5 columns]

In [13]: pd.merge(df1.head(), df2.head(), how='left', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[13]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0 2011-10-18 00:00:00        280            12432            NaN                   NaN
1 2011-10-18 00:00:00        283              283            NaN                   NaN
2 2011-10-18 00:00:00        285              285            NaN                   NaN
3 2011-10-18 00:00:00        286              286            NaN                   NaN
4 2011-10-18 00:00:00        287              287            NaN                   NaN

[5 rows x 5 columns]

Clearly these shouldn't all be NaNs...

altaurog · 2014-06-17T13:55:20Z

I seem to be having the same problem with datetime and string (object). Here's my MRWE:

In [1]: import pandas as pd
In [2]: dfa = pd.DataFrame([['2012-08-02', 'L', 10],
   ...:                     ['2012-08-02', 'J', 15],
   ...:                     ['2013-04-06', 'L', 20],
   ...:                     ['2013-04-06', 'J', 25],],
   ...:                      columns=['x','y','a'])
In [3]: dfa['x'] = pd.to_datetime(dfa['x'])
In [4]: dfb = pd.DataFrame([['2012-08-02','J',1],
   ...:                     ['2013-04-06','L',2]],
   ...:                      columns=['x','y','z'], index=[2,4,])
In [5]: dfb['x'] = pd.to_datetime(dfb['x'])
In [6]: pd.merge(dfa, dfb)  # produces correct result
Out[6]: 
    a          x  y  z
0  15 2012-08-02  J  1
1  20 2013-04-06  L  2

[2 rows x 4 columns]
In [7]: dfb.set_index(['x','y']).join(dfa.set_index(['x','y']))  # correct
Out[7]: 
              z   a
x          y       
2012-08-02 J  1  15
2013-04-06 L  2  20

[2 rows x 2 columns]
In [8]: dfb.join(dfa.set_index(['x','y']), on=['x','y'])  # does not produce correct result
Out[8]: 
           x  y  z   a
2 2012-08-02  J  1 NaN
4 2013-04-06  L  2 NaN

[2 rows x 4 columns]

The problem is, I'd like to retain the index on dfb, so the best option for me would have been the one which doesn't work. As a workaround I can reset the index on dfb and set the index again after the merge:

In [9]: pd.merge(dfa, dfb.reset_index()).set_index('index')
Out[9]: 
        a          x  y  z
index                     
2      15 2012-08-02  J  1
4      20 2013-04-06  L  2

[2 rows x 4 columns]

This seems to be related to the fact that it is a MultiIndex and that there is a datetime in the index. The join() method works just fine if the index is just a simple datetime and it also works with the MultiIndex if I replace the datetime column with int values.

In [10]: dfb[['x','z']].join(dfa.loc[1:2].set_index(['x']), on=['x'])
Out[10]: 
           x  z   a  y
2 2012-08-02  1  15  J
4 2013-04-06  2  20  L

[2 rows x 4 columns]
In [11]: dfa['x'] = [0,0,1,1]
In [12]: dfb['x'] = [0,1]
In [13]: dfb.join(dfa.set_index(['x','y']), on=['x','y'])
Out[13]: 
   x  y  z   a
2  0  J  1  15
4  1  L  2  20

[2 rows x 4 columns]

altaurog · 2014-06-17T14:13:33Z

I neglected to mention I'm using pandas 0.13.1

dsm054 · 2018-11-13T19:58:48Z

Is this still a problem? Even as far back as 0.18.1, which is the oldest pandas I can easily install, I get saner results for both @hayd and @altaurog's examples.

jreback · 2018-11-13T20:16:32Z

just use merge_asof here

jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Sep 29, 2019

mroeschke mentioned this issue Dec 24, 2019

TST: Add tests for fixed issues #30444

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019

jreback closed this as completed in #30444 Dec 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

bluefir commented Dec 5, 2013

jreback commented Dec 5, 2013

bluefir commented Dec 5, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

hayd commented Dec 6, 2013

altaurog commented Jun 17, 2014

altaurog commented Jun 17, 2014

dsm054 commented Nov 13, 2018

jreback commented Nov 13, 2018

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

Comments

bluefir commented Dec 5, 2013

jreback commented Dec 5, 2013

bluefir commented Dec 5, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

bluefir commented Dec 6, 2013

jreback commented Dec 6, 2013

hayd commented Dec 6, 2013

altaurog commented Jun 17, 2014

altaurog commented Jun 17, 2014

dsm054 commented Nov 13, 2018

jreback commented Nov 13, 2018