Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed regression of Multi index with NaN #25424

Closed
wants to merge 1 commit into from
Closed

Fixed regression of Multi index with NaN #25424

wants to merge 1 commit into from

Conversation

hksonngan
Copy link
Contributor

@hksonngan hksonngan commented Feb 24, 2019

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25424 into master will decrease coverage by <.01%.
The diff coverage is 84.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25424      +/-   ##
==========================================
- Coverage   91.25%   91.24%   -0.01%     
==========================================
  Files         172      172              
  Lines       52973    53007      +34     
==========================================
+ Hits        48338    48366      +28     
- Misses       4635     4641       +6
Flag Coverage Δ
#multiple 89.82% <84.21%> (-0.01%) ⬇️
#single 41.76% <55.26%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.26% <84.21%> (-0.36%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eec9b8...7f7a12a. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25424 into master will decrease coverage by 50.04%.
The diff coverage is 45%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25424       +/-   ##
===========================================
- Coverage   91.73%   41.69%   -50.05%     
===========================================
  Files         173      173               
  Lines       52856    52876       +20     
===========================================
- Hits        48490    22048    -26442     
- Misses       4366    30828    +26462
Flag Coverage Δ
#multiple ?
#single 41.69% <45%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 34.23% <45%> (-61.39%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3855a27...b2879de. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding a significant amount of non trivial code here - which likely duplicates a fair amount of existing

likely this is change needs some work

@hksonngan
Copy link
Contributor Author

@jreback
I want to check NaN in MultiIndex, but function isna() isn't implemented in MultiIndex.
How can I change?
and I change from np.where() to use pd.core.algorithms.isin() but can not pass the test cases.
Should I use directly np.isin() or modify algorithms.isin()?

@pep8speaks
Copy link

pep8speaks commented Feb 25, 2019

Hello @hksonngan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-14 21:47:12 UTC

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version MultiIndex labels Feb 25, 2019
@jreback
Copy link
Contributor

jreback commented Mar 3, 2019

@hksonngan you are adding an amazing amount of complexity here. Please step thru the setting code; This is likely a very small change.

@hksonngan
Copy link
Contributor Author

@jreback I think one problem remains when set value multi-index with NaN, get_loc, will crash. I will investigate this.

@hksonngan
Copy link
Contributor Author

@jreback I add code to pass of the case when setting value with NaN index, but I don't know why I get a lot of linting errors when I rebase. I don't change these files. Do you know why?

@hksonngan hksonngan mentioned this pull request Mar 13, 2019
@hksonngan
Copy link
Contributor Author

@jreback ping

@jreback
Copy link
Contributor

jreback commented Mar 14, 2019

@hksonngan this is a massive amount of code. I am not even sure what you are trying to do your fix is VERY complicated. Please point out exactly where the issue is.

@hksonngan
Copy link
Contributor Author

hksonngan commented Mar 15, 2019

@hksonngan this is a massive amount of code. I am not even sure what you are trying to do your fix is VERY complicated. Please point out exactly where the issue is.

@jreback if we have NaN in MultiIndex as:

df = pd.DataFrame(
    [
        ['A', np.nan, 1.23, 4.56],
        ['A', 'G', 1.32, 4.65],
        ['A', 'D', 9.87, 10.54],
    ],
    columns=['pivot_0', 'pivot_1', 'col_1', 'col_2'],
)
df.set_index(['pivot_0', 'pivot_1'], inplace=True)
                                 col_1   col_2
pivot_0  pivot_1
A            NaN           1.23     4.56
              G                1.32     4.65
              D                9.87    10.54

Now I set new df.at[('A', 'F'), 'col_2'] = 0.0 I get wrong result in col_1 as:

                                 col_1   col_2
pivot_0  pivot_1
A            NaN           1.23      4.56
              G                1.32      4.65
              D                9.87    10.54
              F                 1.23      0.0

The right result is must F NaN 0.0
This is because get_indexer indexer = self._base.get_indexer(self, lab_ints) return from hash with duplicate value for index NaN and new as F here.
For above example indexer = [-1 4 6 -1]

@hksonngan
Copy link
Contributor Author

Or I add new MultiIndex with NaN, I get exception like this code:

df = pd.DataFrame(
    [
        ['A', 'G', 1.32, 4.65],
        ['A', 'D', 9.87, 10.54],
    ],
    columns=['pivot_0', 'pivot_1', 'col_1', 'col_2'],
)
df.set_index(['pivot_0', 'pivot_1'], inplace=True)
df.at[('A', np.nan), 'col_2'] = 0.0 # Get exception in here

@hksonngan
Copy link
Contributor Author

hksonngan commented Mar 15, 2019

  • So I fixed simple by check 'NaN' exist or not in MultiIndex with function _hasnans, because when add new value to DataFrame, the get_indexer return from hash with dulicate value -1, as NaN value and new value is same result from hashtable.
  • Adding new value when have NaN in index will get the exception at line 684 index.pyx lab_int was not calculated for NaN value at class MultiIndex with function _engine() line 1234 . So I compensate with position of NaN in levels from line 688 index.pyx, I raise exception to call function _hasnans, and recalculate position of NaN in function _engine() line 1234
  • When MultiIndex initialize, also call function _hasnans, because I want to cache this value. If the new index NaN, I can raise exception in get_loc() as above.

@hksonngan
Copy link
Contributor Author

close as I will investigate another way to solve this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex Bug Copying Values Incorrectly When Adding Values To Index
4 participants