Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: accept list-like percentiles in describe (WIP) #7088

Merged
merged 1 commit into from
May 14, 2014

Conversation

TomAugspurger
Copy link
Contributor

Closes #4196

This is for frames. I'm going to refactor this into generic since to cover series / frames.

A couple questions:

  • API wise, I added a new kwarg percentiles. For backwards compat, we keep the percentile_width kwarg. I changed the default percentile_width from 50 to None (but the default output is the same) Cases:
    1. You specify percentile_width and percentiles -> ValueError
    2. You specify neither percentile_width nor percentiles -> percentile_width set to 50 and same as before
    3. You specify one of those, everything goes as expected.
  • I'm accepting either decimals (e.g. [0.25, .5, .75]) or percentages (e.g. [25, 50, 75]). Those two are equivalent in output. Should I move this logic to the .quantile to be more consistent?
  • I'm choosing to not sort the provided percentiles. It's easy for the user to sort, but hard for them to unsort if for some reason they want it in a specific order. I'd rather not add another kwarg.

@jreback jreback added this to the 0.14.1 milestone May 9, 2014
@TomAugspurger
Copy link
Contributor Author

I'm close on this... a quick question though. There's also a describe for object types (strs or datetime). Notice the different index order for the last one (this is all current behavior):

# strs only
In [31]: df2 = pd.DataFrame({"C2": ['a', 'a', 'b', 'c']})

In [32]: df2.describe()
Out[32]: 
       C2
count   4
unique  3
top     a
freq    2

[4 rows x 1 columns]

# datetime only
In [28]: df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D')})

In [29]: df
Out[29]: 
          C1
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04

[4 rows x 1 columns]

In [30]: df.describe()
Out[30]: 
                         C1
count                     4
unique                    4
first   2010-01-01 00:00:00
last    2010-01-04 00:00:00
top     2010-01-01 00:00:00
freq                      1

[6 rows x 1 columns]

# mix of timestamp and strs
In [33]: df = pd.concat([df, df2], axis=1)

In [35]: df.describe()
Out[35]: 
                         C1   C2
count                     4    4
first   2010-01-01 00:00:00  NaN
freq                      1    2
last    2010-01-04 00:00:00  NaN
top     2010-01-01 00:00:00    a
unique                    4    3

[6 rows x 2 columns]

So the index gets sorted. Is it worth breaking backwards compat to keep the index in a sensible order? I'm not sure.

@jreback
Copy link
Contributor

jreback commented May 10, 2014

yeh these should be in a sensible order I think
u can put it in API changes

@TomAugspurger
Copy link
Contributor Author

Moved to generic (I'm not sure it was worth it; the code got pretty messy with a bunch of if / else.), updated docs. Should be good when travis says so.

@jreback
Copy link
Contributor

jreback commented May 10, 2014

ok, in theory ndim>=3 should work via .apply FYI

you can put tests in test_generic.py (you can do specific tests or have it create them generically)

# dtypes: numeric only, numeric mixed, objects only
data = self._get_numeric_data()
if self.ndim > 1:
if len(data.columns) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this as len(data._info_axis)

@TomAugspurger
Copy link
Contributor Author

@jreback was there anything else you saw here? I think it''s ready.

@jreback
Copy link
Contributor

jreback commented May 11, 2014

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful

@TomAugspurger
Copy link
Contributor Author

Should we do any warning / deprecation? I should be able to handle that very easily.

On May 11, 2014, at 3:46 PM, "jreback" <notifications@gh.neting.ccmailto:notifications@github.com> wrote:

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful


Reply to this email directly or view it on GitHubhttps://github.com//pull/7088#issuecomment-42782700.

@jreback
Copy link
Contributor

jreback commented May 11, 2014

sure why don't I deprecate perentile_width and replace with percentile

otherwise functionality is the same

@TomAugspurger
Copy link
Contributor Author

Added a note about this deprecation to ##6581.

Anything else?

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche Does my deprecation note here look ok? That's how the numpy guide said to do it for objects. I assumed it was similar for keyword arguments.

@@ -3478,6 +3478,152 @@ def _convert_timedeltas(x):

return np.abs(self)

_shared_docs['describe'] = """
Generate various summary statistics of self, excluding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the of self is not very clear for people not knowing the self-concept, maybe just leave it out?

@jorisvandenbossche
Copy link
Member

Some comments:

  • is it necessary that both the percentage-style (50) and number-style (0.5) are allowed? This seems to make it a bit complex to me (what if you use the first style, but want to give 0.5%, then you have to use the other style and it becomes confusing I think)
  • the deprecation of percentile_width is not in the release notes
  • About the deprecation note in the docstring. I think this is OK, but another option is to keep the parameter in the parameter section, but start its explanation with like 'Deprecated and will be removed. Use instead ... ' (the numpy docstring standard does not really mention this how to do, I think in pandas we mostly mention it in the parameter section itself instead of a seperate note (although we maybe mostly forget it ...), eg http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html?highlight=deprecated).

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche thanks for the comments.

I was hoping that accepting both percentages and raw decimals would be less confusing, since I can never remember which we expect.

I actually had a longer reply written and then I realized why it was confusing. I'll switch it back to just expecting decimals between [0, 1], which is at least consistent with quantile.

@jorisvandenbossche
Copy link
Member

@TomAugspurger I think quantile was originally designed this way to match R, however, now it dismatches np.percentile which uses [0,100]. In any case, too late to change now in pandas I think, and indeed most important to be consistent within pandas with quantile.

@jorisvandenbossche
Copy link
Member

BTW, nice and informative FutureWarning! +1

@jreback jreback modified the milestones: 0.14.0, 0.14.1 May 13, 2014
TomAugspurger pushed a commit that referenced this pull request May 14, 2014
ENH/API: accept list-like percentiles in describe (WIP)
@TomAugspurger TomAugspurger merged commit f26e668 into pandas-dev:master May 14, 2014
@jreback
Copy link
Contributor

jreback commented May 14, 2014

@TomAugspurger why are you using Counter rather than pd.core.algorithm.value_counts() again?

Counter DOES not sort, while value_counts does.

On windows test_generic/test_describe_object is failing on the datetimes because of the arbitrary order from Counter (as the counts are all 1 for some reason it picks the 2nd one).

======================================================================
FAIL: test_describe_objects (pandas.tests.test_generic.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\tests\test_generic.py", line 997, in test_describe_objects
    assert_frame_equal(result, expected)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 573, in assert_frame_equal
    check_exact=check_exact)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 520, in assert_series_equal
    assert_almost_equal(left.values, right.values, check_less_precise)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)
  File "testing.pyx", line 93, in pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2338)
AssertionError: Timestamp('2010-01-02 00:00:00') != Timestamp('2010-01-01 00:00:00')

----------------------------------------------------------------------
Ran 6971 tests in 194.439s

FAILED (SKIP=130, failures=1)
2.7\pandas\core\generic.py(3576)describe_categorical_1d()
-> if data.dtype == object:
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3585)describe_categorical_1d()
-> elif issubclass(data.dtype.type, np.datetime64):
(Pdb)
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3586)describe_categorical_1d()
-> names = ['count', 'unique']
(Pdb) p data
0   2010-01-01
1   2010-01-02
2   2010-01-03
3   2010-01-04
Name: C1, dtype: datetime64[ns]
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3587)describe_categorical_1d()
-> asint = data.dropna().values.view('i8')
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3588)describe_categorical_1d()
-> objcounts = compat.Counter(asint)
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) l
3583                        result += [top, freq]
3584
3585                elif issubclass(data.dtype.type, np.datetime64):
3586                    names = ['count', 'unique']
3587                    asint = data.dropna().values.view('i8')
3588 ->                 objcounts = compat.Counter(asint)
3589                    result = [data.count(), len(objcounts)]
3590                    if result[1] > 0:
3591                        top, freq = objcounts.most_common(1)[0]
3592                        names += ['first', 'last', 'top', 'freq']
3593                        result += [lib.Timestamp(asint.min()),
(Pdb) p compat.Counter(asint)
Counter({1262390400000000000: 1, 1262563200000000000: 1, 1262304000000000000: 1, 1262476800000000000: 1})
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) p pd.algorithms.value_counts
*** AttributeError: AttributeError("'module' object has no attribute 'algorithms'",)
(Pdb) p pd.core.algorithms.value_counts
<function value_counts at 0x00000000076FDBA8>
(Pdb) p pd.core.algorithms.value_counts(asint)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb) p pd.core.algorithms.value_counts(asint,sort=True)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb)

@TomAugspurger
Copy link
Contributor Author

Ahh I missed that one. I'll switch it over to use value counts and fix the test so that it isn't ambiguous.

@jreback
Copy link
Contributor

jreback commented May 14, 2014

awesome just put up a pr and I can test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add quantiles kw to Series.describe to create dynamic quantiles
3 participants