fix the function find_common_types bug #25320

ghost · 2019-02-14T13:25:32Z

types[0] can raise a KeyError when types is a pd.Series . see issue #25270

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

` types[0]` can raise a KeyError when `types` is a `pd.Series` . see issue #25270

WillAyd

Haven't reviewed all of the failures but this doesn't seem right given this is a very generic function. Does the error affect things outside of SparseDataFrame? If not then seems like the issue needs to be addressed directly there

WillAyd · 2019-02-15T04:12:10Z

Also please add test(s) - should be the first part to any PR

ghost · 2019-02-15T13:36:01Z

Ee, how to add test(s)? 😄

jreback · 2019-02-16T16:41:53Z

pandas/core/dtypes/cast.py

@@ -1075,7 +1075,7 @@ def find_common_type(types):

    Parameters
    ----------
-    types : list of dtypes
+    types :  list_like


jreback · 2019-02-16T16:42:06Z

pandas/core/dtypes/cast.py

@@ -1090,7 +1090,7 @@ def find_common_type(types):
    if len(types) == 0:
        raise ValueError('no types given')

-    first = types[0]
+    first = types[:1]


if you are changing ths, you must have a failing test case, can you pls add it

Last modification can't pass test, so fix it and now it can pass test.

codecov · 2019-02-17T02:56:54Z

Codecov Report

Merging #25320 into master will decrease coverage by 49.99%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25320      +/-   ##
==========================================
- Coverage   91.72%   41.72%     -50%     
==========================================
  Files         173      173              
  Lines       52831    52831              
==========================================
- Hits        48457    22042   -26415     
- Misses       4374    30789   +26415

Flag	Coverage Δ
#multiple	`?`
#single	`41.72% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/cast.py	`48.83% <100%> (-39.34%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 130 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b144f66...ccca752. Read the comment docs.

codecov · 2019-02-17T02:56:55Z

Codecov Report

Merging #25320 into master will decrease coverage by 49.99%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25320      +/-   ##
==========================================
- Coverage   91.72%   41.72%     -50%     
==========================================
  Files         173      173              
  Lines       52831    52831              
==========================================
- Hits        48457    22042   -26415     
- Misses       4374    30789   +26415

Flag	Coverage Δ
#multiple	`?`
#single	`41.72% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/cast.py	`48.83% <100%> (-39.34%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 130 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b144f66...ccca752. Read the comment docs.

ghost · 2019-02-17T04:22:33Z

In issues #25270 @rasbt gave this queston. He thinks that "the Pandas SparseDataFrame method to_coo() (and possibly others) cannot handle sparse dataframes if the column names are integer types and don't start at 0. If the column names start at 0 or are string types, this is not an issue."

Yes, he is right. Then I try to track the KeyError and find a wrong in the function find_common_type() . I will show how this error occur.

# example from @rasbt

import pandas as pd
import numpy as np

ary = np.array([ [1, 0, 0, 3],
                 [1, 0, 2, 0],
                 [0, 4, 0 ,0] ])

df = pd.DataFrame(ary)
df.columns = [1, 2, 3, 4]

dfs = pd.SparseDataFrame(df,
                         default_fill_value=0)

# DOES NOT WORK:

dfs.to_coo() # raises KeyError: 0

now if we check:

In [12]: dfs.dtypes
Out[12]: 
1    int64
2    int64
3    int64
4    int64
dtype: object
In [13]: type(dfs.dtypes)
Out[13]: pandas.core.series.Series

as we see, the dfs.dtype is not a list , and work calls the function find_comm_type() :

# pandas/core/dtypes/cast.py in find_common_type(types) at about 1093 lines
def find_common_type(types):
    """
    Find a common data type among the given dtypes.

    Parameters
    ----------
    types : list of dtypes

    Returns
    -------
    pandas extension or numpy dtype

    See Also
    --------
    numpy.find_common_type

    """

    if len(types) == 0:
        raise ValueError('no types given')

    first = types[0] # list is ok, but pd.Series may cause litte error.

We check this statement first = types[0]:

In [20]: dfs.dtypes[0]
---------------------------------------------------------------
KeyError                      Traceback (most recent call last)
<ipython-input-20-4d14dd9f5c73> in <module>()
----> 1 dfs.dtypes[0]

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    765         key = com._apply_if_callable(key, self)
    766         try:
--> 767             result = self.index.get_value(self, key)
    768 
    769             if not is_scalar(result):

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   3116         try:
   3117             return self._engine.get_value(s, k,
-> 3118                                           tz=getattr(series.dtype, 'tz', None))
   3119         except KeyError as e1:
   3120             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

Yes, it raises a KeyError. So using types[0] to get the types first item can't pass in this case.

Then in the example of @rasbt there are two cases can work.

# WORKS (1)

dfs2 = dfs.copy()
dfs2.columns = [0, 1, 2, 3]
dfs2.to_coo()

# WORKS (2)

dfs3 = dfs.copy()
dfs3.columns = [str(i) for i in dfs3.columns]
dfs3.to_coo()

In fact, dfs.columns will be dfs.dtypes.index. Now, dfs.dtypes and dfs2.dtypes,dfs3.dtypesare Series. And We know Series has some features.

In [10]: dfs.dtypes.index
Out[10]: Int64Index([1, 2, 3, 4], dtype='int64')

In [11]: dfs2.dtypes.index
Out[11]: Int64Index([0, 1, 2, 3], dtype='int64')

In [12]: dfs3.dtypes.index
Out[12]: Index(['1', '2', '3', '4'], dtype='object')

Useing types[0] will take different ways for different Series.index.dtype. (Of couse, is pd.DataFrame too.) 0 in types[0] is regard as a key when dfs.dtypes.index.dtype is int64, but a index when dfs.dtypes.index.dtype is object (str) like list[0].

So first = types[0] can't handle those. and first = [t for t in types][0] will solve those simply.

Of cause, first = types[:1] is not a right way , beacuse it can't pass test although it can slove @rasbt's case.

But after committing, some checks were not successful, the newest update passed test.

jreback · 2019-03-20T02:05:31Z

closing as stale if you want to keep working, merge master and ping

fix the function find_common_types bug

b5fe3f7

` types[0]` can raise a KeyError when `types` is a `pd.Series` . see issue #25270

WillAyd requested changes Feb 15, 2019

View reviewed changes

gfyoung requested a review from jreback February 15, 2019 09:22

gfyoung added Sparse Sparse Data Type Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 15, 2019

jreback requested changes Feb 16, 2019

View reviewed changes

ghost changed the title ~~fix the function find_common_types bug~~ fix the function find_common_types bug Feb 17, 2019

update find_common_type() statement about first.

ccca752

Last modification can't pass test, so fix it and now it can pass test.

jreback closed this Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the function find_common_types bug #25320

fix the function find_common_types bug #25320

ghost commented Feb 14, 2019 •

edited by ghost

Loading

WillAyd left a comment

WillAyd commented Feb 15, 2019

ghost commented Feb 15, 2019

jreback Feb 16, 2019

jreback Feb 16, 2019

codecov bot commented Feb 17, 2019

codecov bot commented Feb 17, 2019

ghost commented Feb 17, 2019

jreback commented Mar 20, 2019

fix the function find_common_types bug #25320

fix the function find_common_types bug #25320

Conversation

ghost commented Feb 14, 2019 • edited by ghost Loading

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Feb 15, 2019

ghost commented Feb 15, 2019

jreback Feb 16, 2019

Choose a reason for hiding this comment

jreback Feb 16, 2019

Choose a reason for hiding this comment

codecov bot commented Feb 17, 2019

Codecov Report

codecov bot commented Feb 17, 2019

Codecov Report

ghost commented Feb 17, 2019

jreback commented Mar 20, 2019

ghost commented Feb 14, 2019 •

edited by ghost

Loading