-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add "level=" argument to MultiIndex.unique() #17897
API: add "level=" argument to MultiIndex.unique() #17897
Conversation
pandas/core/indexes/multi.py
Outdated
@@ -896,20 +896,24 @@ def _get_level_values(self, level): | |||
Parameters | |||
---------- | |||
level : int level | |||
unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add a version added tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it, and then removed it on @jorisvandenbossche 's suggestion because this is an internal method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the default value in the doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can remove the versionadded tag here
From #17881 , @jreback 's comments:
I'm not a fan either...
... but I think this would be pretty confusing ("used" as in "this label appears in the index" vs. "used" as "this label is used multiple times") |
Alternatives:
|
how about |
you mean "levels" or "labels"? Anyway, not following you: |
57aa7e6
to
8a69543
Compare
Codecov Report
@@ Coverage Diff @@
## master #17897 +/- ##
==========================================
- Coverage 91.23% 91.22% -0.02%
==========================================
Files 163 163
Lines 50105 50107 +2
==========================================
- Hits 45715 45708 -7
- Misses 4390 4399 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17897 +/- ##
==========================================
- Coverage 91.38% 91.34% -0.05%
==========================================
Files 164 164
Lines 49797 49812 +15
==========================================
- Hits 45508 45501 -7
- Misses 4289 4311 +22
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -792,6 +792,7 @@ Other API Changes | |||
- Pandas no longer registers matplotlib converters on import. The converters | |||
will be registered and used when the first plot is draw (:issue:`17710`) | |||
- Setting on a column with a scalar value and 0-len index now raises a ``ValueError`` (:issue:`16823`) | |||
- :func:`MultiIndex.get_level_values` now supports the `unique` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unique have double ticks here, ``unique``?
Answering @jorisvandenbossche from here:
I'm open to alternatives, but let me summarize my rationale for working on
All this said, an alternative which I would find elegant enough would be to add an argument |
Two additional elements in favor of
|
OK, trying to clarify my rationale for objecting adding it to So indeed we have to distinguish the
You could indeed see it that way (unique "effective values"), but I would rather see it as the used "unique labels". If you look at it like that, it clearly pertains to the first .. In that sense, suppose we had a method like
This actually sounds like a nice alternative. |
or |
Good point. Still, it's recent and hence probably not very well known. And in principle less efficient for short indexes with multiple long levels. Anyway, I'm OK with changing this PR to |
seems reasonable. alternatively, |
e5a4635
to
3617e2a
Compare
I'm afraid it would be confusing (I would love to swap its name with |
pandas/core/indexes/multi.py
Outdated
@@ -945,6 +951,34 @@ def get_level_values(self, level): | |||
values = self._get_level_values(level) | |||
return values | |||
|
|||
def unique(self, level=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the level=None
kw to Index.unique
as well (for compat. It should raise if its not None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, would you mention it in the (Index
) docs?
pandas/tests/indexes/test_multi.py
Outdated
expected = Index(['foo', 'bar', 'baz', 'qux'], | ||
name='first') | ||
tm.assert_index_equal(result, expected) | ||
assert result.name == 'first' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already done by assert_index_equal
pandas/tests/indexes/test_multi.py
Outdated
arrays = [['a', 'b', 'b'], [2, np.nan, 2]] | ||
index = pd.MultiIndex.from_arrays(arrays) | ||
values = index.unique(level=1) | ||
expected = np.array([2], dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a normal index, the NaN is included in the uniques.
pandas/tests/indexes/test_multi.py
Outdated
index = pd.MultiIndex.from_arrays(arrays) | ||
values = index.unique(level=1) | ||
expected = np.array([2], dtype=np.int64) | ||
tm.assert_numpy_array_equal(values.values, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing here the .values and not comparing with an Index object ?
pandas/core/indexes/multi.py
Outdated
unique : bool | ||
if True, drop duplicated values | ||
|
||
.. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed for an internal method IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see that actually @jreback asked to add it above. But still, I don't think this should be added for internal private functions, as internal code should never have to care about the pandas version
pandas/core/indexes/multi.py
Outdated
|
||
Parameters | ||
---------- | ||
level : int, optional, defaults None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be able to both specify int or level name ?
a07e2d8
to
f0f7874
Compare
The first |
can you rebase and will take a look. |
pandas/core/indexes/base.py
Outdated
def unique(self): | ||
def unique(self, level=None): | ||
if level not in {0, self.name, None}: | ||
raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doc string for unique updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because you never replied to my question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what the question is
pandas/core/indexes/multi.py
Outdated
@@ -896,20 +896,24 @@ def _get_level_values(self, level): | |||
Parameters | |||
---------- | |||
level : int level | |||
unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a versionadded tag
pandas/tests/indexes/test_base.py
Outdated
@@ -1484,6 +1484,24 @@ def test_get_level_values(self): | |||
result = index_with_name.get_level_values('a') | |||
tm.assert_index_equal(result, index_with_name) | |||
|
|||
def test_unique(self): | |||
idx = pd.Index([2, 3, 2, 1], name='my_index') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number
pandas/tests/indexes/test_base.py
Outdated
def test_unique(self): | ||
idx = pd.Index([2, 3, 2, 1], name='my_index') | ||
expected = pd.Index([2, 3, 1], name='my_index') | ||
for level in 0, 'my_index', None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you integrete with the other tests of unique in indexes/common.py and/or tests/test_base.py
pandas/tests/indexes/test_multi.py
Outdated
@@ -2269,6 +2271,20 @@ def test_unique(self): | |||
exp = pd.MultiIndex.from_arrays([['a'], ['a']]) | |||
tm.assert_index_equal(res, exp) | |||
|
|||
# GH #17896 - with level= argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try with named level, try on an already unique level as well. ideally even more corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate test and parametrize
f0f7874
to
284649a
Compare
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -100,7 +100,7 @@ Indexing | |||
|
|||
- Bug in :func:`PeriodIndex.truncate` which raises ``TypeError`` when ``PeriodIndex`` is monotonic (:issue:`17717`) | |||
- Bug in ``DataFrame.groupby`` where key as tuple in a ``MultiIndex`` were interpreted as a list of keys (:issue:`17979`) | |||
- | |||
- :func:`MultiIndex.unique` now supports the ``level=`` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to other enhancements
pandas/core/indexes/base.py
Outdated
def unique(self): | ||
def unique(self, level=None): | ||
if level not in {0, self.name, None}: | ||
raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doc string for unique updated?
pandas/core/indexes/multi.py
Outdated
@@ -896,20 +896,24 @@ def _get_level_values(self, level): | |||
Parameters | |||
---------- | |||
level : int level | |||
unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the default value in the doc string
pandas/core/indexes/multi.py
Outdated
@@ -896,20 +896,24 @@ def _get_level_values(self, level): | |||
Parameters | |||
---------- | |||
level : int level | |||
unique : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can remove the versionadded tag here
pandas/tests/indexes/common.py
Outdated
@@ -329,6 +329,25 @@ def test_duplicates(self, indices): | |||
assert not idx.is_unique | |||
assert idx.has_duplicates | |||
|
|||
def test_unique(self): | |||
# GH 17896 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really like to use the indices fixture here to test all Index types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It assigns no name to the indexes... shall I change it so it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try that
pandas/tests/indexes/test_multi.py
Outdated
@@ -2269,6 +2271,20 @@ def test_unique(self): | |||
exp = pd.MultiIndex.from_arrays([['a'], ['a']]) | |||
tm.assert_index_equal(res, exp) | |||
|
|||
# GH #17896 - with level= argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate test and parametrize
pandas/tests/indexes/test_multi.py
Outdated
tm.assert_index_equal(result, expected) | ||
|
||
# With already unique level | ||
mi = pd.MultiIndex.from_arrays([[1, 3, 2, 4], [1, 1, 1, 2]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with each level of the mi
pandas/tests/indexes/test_multi.py
Outdated
tm.assert_index_equal(result, expected) | ||
|
||
@pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') | ||
def test_unique_with_nans(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this test
103d192
to
d5f4324
Compare
The second commit changes the behavior of If instead the current behavior is preferred, that commit can be dropped, and I will just need to exclude the |
how did this come up? why do you think that the consistency should diverge here? |
My new test compares
I'm not in favor of breaking consistency (between |
d5f4324
to
861867c
Compare
The behavior of |
( @jreback : ping) |
Will take a look at this tomorrow! |
pandas/core/indexes/base.py
Outdated
level : int or str, optional, default None | ||
only return values from specified level (for MultiIndex) | ||
|
||
.. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.22.0
pandas/tests/indexes/common.py
Outdated
with tm.assert_raises_regex(ValueError, msg): | ||
indices.unique(level=level) | ||
|
||
def test_unique_na(self, indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not using indices here, so don't add it as an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general!
Added a bunch of minor comments
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -24,6 +24,7 @@ Other Enhancements | |||
|
|||
- Better support for :func:`Dataframe.style.to_excel` output with the ``xlsxwriter`` engine. (:issue:`16149`) | |||
- :func:`pandas.tseries.frequencies.to_offset` now accepts leading '+' signs e.g. '+1h'. (:issue:`18171`) | |||
- :func:`MultiIndex.unique` now supports the ``level=`` argument (:issue:`17896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add briefly what it does, something like ".. to get the unique values of a single index level" ?
pandas/core/indexes/category.py
Outdated
@@ -361,7 +361,9 @@ def is_monotonic_decreasing(self): | |||
return Index(self.codes).is_monotonic_decreasing | |||
|
|||
@Appender(base._shared_docs['unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the shared docs of 'index_unique' ?
pandas/core/indexes/base.py
Outdated
Series.unique | ||
""") | ||
|
||
@Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the _index_doc_kwargs
are doing something here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right. But it was already the case, so I thought it was a choice of consistency ("you don't need to change this line if you change the docs making the interpolation meaningful"). Shall I remove it?
pandas/core/indexes/base.py
Outdated
@@ -3757,8 +3757,32 @@ def drop(self, labels, errors='raise'): | |||
indexer = indexer[~mask] | |||
return self.delete(indexer) | |||
|
|||
@Appender(base._shared_docs['unique'] % _index_doc_kwargs) | |||
def unique(self): | |||
base._shared_docs['index_unique'] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be in _index_shared_docs
instead of base._shared_docs
? (because base
are the ones shared with series)
pandas/core/indexes/base.py
Outdated
Parameters | ||
---------- | ||
level : int or str, optional, default None | ||
only return values from specified level (for MultiIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start sentence with capital letter
pandas/core/indexes/multi.py
Outdated
@@ -943,6 +947,15 @@ def get_level_values(self, level): | |||
values = self._get_level_values(level) | |||
return values | |||
|
|||
@Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base._shared_docs -> index_shared_docs
pandas/core/indexes/base.py
Outdated
@Appender(base._shared_docs['index_unique'] % _index_doc_kwargs) | ||
def unique(self, level=None): | ||
if level not in {0, self.name, None}: | ||
raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have _get_level_number
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather _validate_index_level
which is used in _get_level_number
@@ -963,19 +963,21 @@ def test_get_level_values(self): | |||
exp = CategoricalIndex([1, 2, 3, 1, 2, 3]) | |||
tm.assert_index_equal(index.get_level_values(1), exp) | |||
|
|||
def test_get_level_values_na(self): | |||
@pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we xfail this? (I mean, why don't we keep asserting for now that it is float)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions compare a correctly built index (float64
data in Float64Index
) with an invalid index (float64
data in Int64Index
, due to #17924 ). Maybe I'm missing something, but it's not obvious to me what we would assert - plus, I see it as an added value to xfail
for a buggy behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
fbf9eff
to
337e942
Compare
needs a rebase |
337e942
to
50f199d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last minor comments, otherwise looks good!
pandas/core/indexes/category.py
Outdated
@Appender(_index_shared_docs['index_unique'] % _index_doc_kwargs) | ||
def unique(self, level=None): | ||
if level not in {0, self.name, None}: | ||
raise ValueError("Level {} not found".format(level)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use here _validate_index_level
as well?
@@ -963,19 +963,21 @@ def test_get_level_values(self): | |||
exp = CategoricalIndex([1, 2, 3, 1, 2, 3]) | |||
tm.assert_index_equal(index.get_level_values(1), exp) | |||
|
|||
def test_get_level_values_na(self): | |||
@pytest.mark.xfail(reason='GH 17924 (returns Int64Index with float data)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
50f199d
to
feb65ed
Compare
ping |
Thanks! |
git diff master -u -- "*.py" | flake8 --diff