Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

Closed
bluefir opened this issue Dec 5, 2013 · 14 comments · Fixed by #30444
Closed

0.12: join with on= fails with TimeStamp and Int64 MultiIndex #5647

bluefir opened this issue Dec 5, 2013 · 14 comments · Fixed by #30444
Labels
good first issue MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@bluefir
Copy link

bluefir commented Dec 5, 2013

I encountered a very strange problem in 0.12 when a straightforward join fails while it works just fine in 0.11.

df1

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 3 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
dtypes: datetime64ns, int64(2)

df1.dtypes

FactorDate datetime64[ns]
FundSecNo int64
SourceFundSecNo int64
dtype: object

df2

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 94868 entries, (2011-10-18 00:00:00, 283) to (2011-10-19 00:00:00, 127419)
Data columns (total 2 columns):
DemeanGroupId 94868 non-null values
NormalizationGroupId 94868 non-null values
dtypes: float64(1), int64(1)

for ind in df2.index.levels:
    print(type(ind))

<class 'pandas.tseries.index.DatetimeIndex'>
<class 'pandas.core.index.Int64Index'>

test = df1.join(df2, on=['FactorDate', 'SourceFundSecNo'])
test

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
DemeanGroupId 0 non-null values
NormalizationGroupId 0 non-null values
dtypes: datetime64ns, float64(2), int64(2)

date = datetime.datetime(2011, 10, 18)
sfsn = 283
test.loc[(test.FactorDate == date) & (test.SourceFundSecNo == sfsn)]
          FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId

1 2011-10-18 00:00:00 283 283 NaN NaN
17933 2011-10-18 00:00:00 65213 283 NaN NaN

df2.loc[(date, sfsn)]

DemeanGroupId 350171
NormalizationGroupId 35
Name: (2011-10-18 00:00:00, 283), dtype: float64

I can attach the frames, but I forgot how to do it :-)

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

can you put up a small reproducible example if possible.

@bluefir
Copy link
Author

bluefir commented Dec 5, 2013

I tried quickly but couldn't come up with one that exhibits the same behavior. Can I put the frames into an HDF5 file and somehow ship it to you guys? This would be the easiest way.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

sure....show complete code to get them out...and put on dropbox like service and post the link

@bluefir
Copy link
Author

bluefir commented Dec 6, 2013

https://www.dropbox.com/s/c0hq6uqn8y5vkbk/issue5647.h5

store = pd.HDFStore('issue5647.h5', mode='r')
df1 = store.select('df1')
df2 = store.select('df2')
store.close()

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

I think you want something like this
The problem is it doesn't know how to match up the 'SourceFundSecNo' with 'FundSecNo'.

In [72]: x = df2.reset_index()

In [73]: x['SourceFundSecNo'] = x['FundSecNo']

In [74]: x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94868 entries, 0 to 94867
Data columns (total 5 columns):
FactorDate              94868  non-null values
FundSecNo               94868  non-null values
DemeanGroupId           94868  non-null values
NormalizationGroupId    94868  non-null values
SourceFundSecNo         94868  non-null values
dtypes: datetime64[ns](1), float64(1), int64(3)
In [75]: df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 3 columns):
FactorDate         71958  non-null values
FundSecNo          71958  non-null values
SourceFundSecNo    71958  non-null values
dtypes: datetime64[ns](1), int64(2)

inner join

In [76]: pd.merge(df1,x,on=['FactorDate','SourceFundSecNo']).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 71957
Data columns (total 6 columns):
FactorDate              71958  non-null values
FundSecNo_x             71958  non-null values
SourceFundSecNo         71958  non-null values
FundSecNo_y             71958  non-null values
DemeanGroupId           71958  non-null values
NormalizationGroupId    71958  non-null values
dtypes: datetime64[ns](1), float64(1), int64(4)

outer join I think is what you want (e.g. many-many)

In [77]: pd.merge(df1,x,on=['FactorDate','SourceFundSecNo'],how='outer').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103922 entries, 0 to 103921
Data columns (total 6 columns):
FactorDate              103922  non-null values
FundSecNo_x             71958  non-null values
SourceFundSecNo         103922  non-null values
FundSecNo_y             103922  non-null values
DemeanGroupId           103922  non-null values
NormalizationGroupId    103922  non-null values
dtypes: datetime64[ns](1), float64(3), int64(2)

@bluefir
Copy link
Author

bluefir commented Dec 6, 2013

Well, as you can see, the index in df2 is unique (at least it's supposed to be; by the way, is there a function that checks for that, similar to is_lexsorted?). So, the idea is to propagate unique values by (FactorDate, FundSecNo) from df2 into each (FactorDate, SourceFundSecNo) in df1. In other words, many-to-one. This is a pretty standard idea.

I can work around that, sure, but it looks ugly and wastes CPU and memory. The problem is that it worked perfectly in 0.11 while it doesn't work in 0.12. Do you observe the same behavior? If so, why? Was 0.11 behavior wrong? join() is just syntactic sugar for merge(), right? Any toy example I tried to come up with in 0.12 worked as expected.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

join default is 'left', I think you just need 'outer' here. Not sure whyy its different in 0.11; I don't think it changed.

you can do

df2.index.is_unique
(which is True here)

In [84]: df1.join(df2,on=['FactorDate','SourceFundSecNo'],how='outer')
Out[84]: 
               FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0     2011-10-18 00:00:00        280            12432        1520161                   152
1883  2011-10-18 00:00:00      12432            12432        1520161                   152
1     2011-10-18 00:00:00        283              283         350171                    35
17933 2011-10-18 00:00:00      65213              283         350171                    35
2     2011-10-18 00:00:00        285              285        1510157                   151
3     2011-10-18 00:00:00        286              286        1510157                   151
4     2011-10-18 00:00:00        287              287         350161                    35
5     2011-10-18 00:00:00        294              294         350175                    35
6     2011-10-18 00:00:00        300              300         350174                    35

@bluefir
Copy link
Author

bluefir commented Dec 6, 2013

At the risk of beating the dead horse here, but I do want 'left'! The fact that 'left' returns no joins at all while 'outer' returns all of them just doesn't make sense. 0.11 behavior did make sense. How can we make sense of test2 below given test3?

test2 = pd.merge(df1, df2, how='left', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
test2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71958 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 71958 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 71958 non-null values
DemeanGroupId 0 non-null values
NormalizationGroupId 0 non-null values
dtypes: datetime64ns, float64(2), int64(2)

test3 = pd.merge(df1, df2, how='outer', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
test3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103922 entries, 0 to 104014
Data columns (total 5 columns):
FactorDate 103922 non-null values
FundSecNo 71958 non-null values
SourceFundSecNo 103922 non-null values
DemeanGroupId 103922 non-null values
NormalizationGroupId 103922 non-null values
dtypes: datetime64ns, float64(3), int64(1)

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

not sure...maybe andy can chime in @hayd ?

@hayd
Copy link
Contributor

hayd commented Dec 6, 2013

So I think we can see this from the head of df1 and df2, the left doesn't look right (it has only NaNs):

In [11]: pd.merge(df1.head(), df2.head(), how='outer', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[11]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0 2011-10-18 00:00:00        280            12432            NaN                   NaN
1 2011-10-18 00:00:00        283              283         350171                    35
2 2011-10-18 00:00:00        285              285        1510157                   151
3 2011-10-18 00:00:00        286              286        1510157                   151
4 2011-10-18 00:00:00        287              287            NaN                   NaN
4 2011-10-19 00:00:00        NaN              283         350171                    35
4 2011-10-19 00:00:00        NaN              285        1510157                   151

[7 rows x 5 columns]
# inner
In [12]: pd.merge(df1.head(), df2.head(), left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[12]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
1 2011-10-18 00:00:00        283              283         350171                    35
2 2011-10-18 00:00:00        285              285        1510157                   151
3 2011-10-18 00:00:00        286              286        1510157                   151 

[3 rows x 5 columns]
In [13]: pd.merge(df1.head(), df2.head(), how='left', left_on=['FactorDate', 'SourceFundSecNo'], right_index=True)
Out[13]: 
           FactorDate  FundSecNo  SourceFundSecNo  DemeanGroupId  NormalizationGroupId
0 2011-10-18 00:00:00        280            12432            NaN                   NaN
1 2011-10-18 00:00:00        283              283            NaN                   NaN
2 2011-10-18 00:00:00        285              285            NaN                   NaN
3 2011-10-18 00:00:00        286              286            NaN                   NaN
4 2011-10-18 00:00:00        287              287            NaN                   NaN

[5 rows x 5 columns]

Clearly these shouldn't all be NaNs...

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014
@altaurog
Copy link

I seem to be having the same problem with datetime and string (object). Here's my MRWE:

In [1]: import pandas as pd
In [2]: dfa = pd.DataFrame([['2012-08-02', 'L', 10],
   ...:                     ['2012-08-02', 'J', 15],
   ...:                     ['2013-04-06', 'L', 20],
   ...:                     ['2013-04-06', 'J', 25],],
   ...:                      columns=['x','y','a'])
In [3]: dfa['x'] = pd.to_datetime(dfa['x'])
In [4]: dfb = pd.DataFrame([['2012-08-02','J',1],
   ...:                     ['2013-04-06','L',2]],
   ...:                      columns=['x','y','z'], index=[2,4,])
In [5]: dfb['x'] = pd.to_datetime(dfb['x'])
In [6]: pd.merge(dfa, dfb)  # produces correct result
Out[6]: 
    a          x  y  z
0  15 2012-08-02  J  1
1  20 2013-04-06  L  2

[2 rows x 4 columns]
In [7]: dfb.set_index(['x','y']).join(dfa.set_index(['x','y']))  # correct
Out[7]: 
              z   a
x          y       
2012-08-02 J  1  15
2013-04-06 L  2  20

[2 rows x 2 columns]
In [8]: dfb.join(dfa.set_index(['x','y']), on=['x','y'])  # does not produce correct result
Out[8]: 
           x  y  z   a
2 2012-08-02  J  1 NaN
4 2013-04-06  L  2 NaN

[2 rows x 4 columns]

The problem is, I'd like to retain the index on dfb, so the best option for me would have been the one which doesn't work. As a workaround I can reset the index on dfb and set the index again after the merge:

In [9]: pd.merge(dfa, dfb.reset_index()).set_index('index')
Out[9]: 
        a          x  y  z
index                     
2      15 2012-08-02  J  1
4      20 2013-04-06  L  2

[2 rows x 4 columns]

This seems to be related to the fact that it is a MultiIndex and that there is a datetime in the index. The join() method works just fine if the index is just a simple datetime and it also works with the MultiIndex if I replace the datetime column with int values.

In [10]: dfb[['x','z']].join(dfa.loc[1:2].set_index(['x']), on=['x'])
Out[10]: 
           x  z   a  y
2 2012-08-02  1  15  J
4 2013-04-06  2  20  L

[2 rows x 4 columns]
In [11]: dfa['x'] = [0,0,1,1]
In [12]: dfb['x'] = [0,1]
In [13]: dfb.join(dfa.set_index(['x','y']), on=['x','y'])
Out[13]: 
   x  y  z   a
2  0  J  1  15
4  1  L  2  20

[2 rows x 4 columns]

@altaurog
Copy link

I neglected to mention I'm using pandas 0.13.1

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@dsm054
Copy link
Contributor

dsm054 commented Nov 13, 2018

Is this still a problem? Even as far back as 0.18.1, which is the oldest pandas I can easily install, I get saner results for both @hayd and @altaurog's examples.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2018

just use merge_asof here

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Sep 29, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants