-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally disallow duplicate labels #28394
Conversation
I think this is ready for an initial review, to decide if this direction / behavior is OK. I'd recommend starting with the new docs section for the behavior, and cc @jorisvandenbossche @jreback @jbrockmendel @jschendel @datapythonista @ others I'm sure. |
This looks very useful, thanks for the work on this. I can't think of a better API, so happy with your proposal. Just wondering if in the future we may want |
Yeah I think that's a possibility. But yes let's leave it for later :) |
doc/source/user_guide/duplicates.rst
Outdated
or column labels. This may be a bit confusing at first. If you're familiar with | ||
SQL, you know that row labels are similar to a primary key on a table, and you | ||
would never want duplicates in a SQL table. But one of pandas' roles is to clean | ||
mess, real-world data before it goes to some downstream system. And real-world |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mess --> messy?
I'm +1 on this. A couple passing comments:
One thing I noticed is that this doesn't appear to validate when you directly assign a new index: In [2]: s = pd.Series(list('abc'), allow_duplicate_labels=False)
In [3]: s.index = [0, 0, 0]
In [4]: s
Out[4]:
0 a
0 b
0 c
dtype: object
In [5]: s.allows_duplicate_labels
Out[5]: False |
Once a kwarg is in the Series constructor it can be difficult to get it out (see: fastpath). Could we put this in stealth mode for a while with something like:
Not a strong opinion, just thinking out loud. |
Just out of curiosity why do you think adding this as a keyword argument to the Series / DataFrame constructor is the right approach? Would that subsequently guard against doing something like |
I share this concern. Do others? After using it for a few days I'm getting used to it, but still need to stop and think.
Yes. My backwards-compatible plan here is to have the setter allow
Code-wise, it's not much effort to support. It just makes the logic a bit more complicated, so I'm holding off for now.
Do we anticipate needing to remove this? Certainly, it can be marked as experimental if there's trepidation.
Initially, I was thinking it'd be a property on Index. I think that's more natural, since it's making a statement about the Index. But users interact more with Series / DataFrame, so I think it's more ergonomic to place it there. After implementing things, putting it on NDFrame is nice because we have |
I don't anticipate needing to remove this, but I do anticipate other conceptually similar flags being implemented. In that scenario, I don't like the idea of all of these becoming kwargs in the Series constructor. |
This aids in the implementation of pandas-dev#28394. Over there, I'm having issues with using `NDFrame.__finalize__` to copy attributes, in part because getattribute on NDFrame is so complicated. This simplifies this because we only need to look in NDFrame.attrs, which is just a plain dictionary. Aside from the addition of a public NDFrame.attrs dictionary, there aren't any user-facing API changes.
9cc17b7
to
05e238d
Compare
commit 67a3263 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Oct 18 08:05:04 2019 -0500 fixup name commit e6183cd Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Oct 18 07:05:33 2019 -0500 fixup Index.name commit d1826bb Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Oct 17 13:45:30 2019 -0500 REF: Store metadata in attrs dict This aids in the implementation of pandas-dev#28394. Over there, I'm having issues with using `NDFrame.__finalize__` to copy attributes, in part because getattribute on NDFrame is so complicated. This simplifies this because we only need to look in NDFrame.attrs, which is just a plain dictionary. Aside from the addition of a public NDFrame.attrs dictionary, there aren't any user-facing API changes.
5b7662f
to
d771b80
Compare
I'm seeing some strange mypy issues with these changes. Lots of
L206 is In fact, I'm seeing these mypy errors with just the following diff from master diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index ef4e3e064d..7445951536 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -407,6 +407,7 @@ class DataFrame(NDFrame):
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
+ allow_duplicate_labels: bool = True,
):
if data is None:
data = {}
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index d59ce8db9b..64a38ca4eb 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -207,6 +207,7 @@ class NDFrame(PandasObject, SelectionMixin):
dtype: Optional[Dtype] = None,
attrs: Optional[Mapping[Optional[Hashable], Any]] = None,
fastpath: bool = False,
+ allow_duplicate_labels: bool = True,
):
if not fastpath:
diff --git a/pandas/core/series.py b/pandas/core/series.py
index 3e9d3d5c04..8d6beb42c4 100644
--- a/pandas/core/series.py
+++ b/pandas/core/series.py
@@ -203,7 +203,7 @@ class Series(base.IndexOpsMixin, generic.NDFrame):
# Constructors
def __init__(
- self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False
+ self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False, allow_duplicate_labels=True,
):
# we are called internally, so short-circuit Any ideas what's going wrong (perhaps @simonjayhawkins or @WillAyd have a guess)? |
What version of mypy / pytest are you using? Seems strange for that to pop up if that is your only diff |
Working around a strange typing issue. See pandas-dev#28394 (comment) for more, but the types on these were being inferred incorrectly by mypy with just the addition of the `allows_duplicate_labels` kwarg.
pytest 5.2.1 Indeed, it's quite strange. Opened #29205 to add explicit types to the tests that were failing. |
on master
with the changes in this PR..
|
same inferred type with mypy 0.730 and 0.740 |
@TomAugspurger status of this |
Still interested in working on it, but not happening for for 1.0.
…On Wed, Jan 1, 2020 at 11:21 AM Jeff Reback ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> status of this
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28394?email_source=notifications&email_token=AAKAOIWMFJXLSIRR6U3C5D3Q3TGKRA5CNFSM4IVZNIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH5JALQ#issuecomment-570069038>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISNENLHU2JZMPFPUC3Q3TGKRANCNFSM4IVZNIOA>
.
|
Just the windows s3 failures, which I think that #35856 is tracking. |
CI is green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice updates
a few inline questions
- can u ensure that this can survive a round trip pickle (flags)
- worth separating out index and columns flags for duplicates? eg i almost always care about column duplicates but may want to allow row dupes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of inline comments
can u ensure this survives a round trip pickle (flags)
- worth having separate flags for index and columns ? eg disallow column but allow for rows i would say is common
the reason to do this now is that it would be very hard to change later
i think my comments got duped as internet not great |
Possibly. The current system is future-compatible with that though, |
Added the pickle test. |
All green with the pickle test now. I've also added tests that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of minor comments and questions. looks good
pandas/core/indexes/base.py
Outdated
@@ -483,6 +483,26 @@ def _simple_new(cls, values, name: Label = None): | |||
def _constructor(self): | |||
return type(self) | |||
|
|||
def _maybe_check_unique(self): | |||
if not self.is_unique: | |||
# TODO: position, value, not too large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah maybe an arg to show the first 10 duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's truncated, following our usual repr's settings. I'll remove the todo
In [3]: s = pd.Series(1, index=[0] * 200).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
<ipython-input-3-ece63ed32110> in <module>
----> 1 s = pd.Series(1, index=[0] * 200).set_flags(allows_duplicate_labels=False)
~/sandbox/pandas/pandas/core/generic.py in set_flags(self, copy, allows_duplicate_labels)
349 df = self.copy(deep=copy)
350 if allows_duplicate_labels is not None:
--> 351 df.flags["allows_duplicate_labels"] = allows_duplicate_labels
352 return df
353
~/sandbox/pandas/pandas/core/flags.py in __setitem__(self, key, value)
103 if key not in self._keys:
104 raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 105 setattr(self, key, value)
106
107 def __repr__(self):
~/sandbox/pandas/pandas/core/flags.py in allows_duplicate_labels(self, value)
90 if not value:
91 for ax in obj.axes:
---> 92 ax._maybe_check_unique()
93
94 self._allows_duplicate_labels = value
~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
491 msg += "\n{}".format(duplicates)
492
--> 493 raise DuplicateLabelError(msg)
494
495 def _format_duplicate_message(self):
DuplicateLabelError: Index has duplicates.
positions
label
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
In [10]: s = pd.Series(1, index=np.arange(100).repeat(10)).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
<ipython-input-10-452fca37a730> in <module>
----> 1 s = pd.Series(1, index=np.arange(100).repeat(10)).set_flags(allows_duplicate_labels=False)
~/sandbox/pandas/pandas/core/generic.py in set_flags(self, copy, allows_duplicate_labels)
349 df = self.copy(deep=copy)
350 if allows_duplicate_labels is not None:
--> 351 df.flags["allows_duplicate_labels"] = allows_duplicate_labels
352 return df
353
~/sandbox/pandas/pandas/core/flags.py in __setitem__(self, key, value)
103 if key not in self._keys:
104 raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 105 setattr(self, key, value)
106
107 def __repr__(self):
~/sandbox/pandas/pandas/core/flags.py in allows_duplicate_labels(self, value)
90 if not value:
91 for ax in obj.axes:
---> 92 ax._maybe_check_unique()
93
94 self._allows_duplicate_labels = value
~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
491 msg += "\n{}".format(duplicates)
492
--> 493 raise DuplicateLabelError(msg)
494
495 def _format_duplicate_message(self):
DuplicateLabelError: Index has duplicates.
positions
label
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
... ...
95 [950, 951, 952, 953, 954, 955, 956, 957, 958, ...
96 [960, 961, 962, 963, 964, 965, 966, 967, 968, ...
97 [970, 971, 972, 973, 974, 975, 976, 977, 978, ...
98 [980, 981, 982, 983, 984, 985, 986, 987, 988, ...
99 [990, 991, 992, 993, 994, 995, 996, 997, 998, ...
[100 rows x 1 columns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. thanks @TomAugspurger
Thanks!
…On Wed, Sep 2, 2020 at 10:00 PM Jeff Reback ***@***.***> wrote:
Merged #28394 <#28394> into
master.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28394 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIV6VNVMWLXV2KW75ZLSD4BD5ANCNFSM4IVZNIOA>
.
|
This adds a property to NDFrame to disallow duplicate labels. This fixes a vexing issue with using pandas for ETL pipelines, where accidentally introducing duplicate labels can lead to confusing downstream behavior (e.g.
NDFrame.__getitem__
not reducing dimensionality).When set (via the construction with
allow_duplicate_labels=False
or afterward via.allows_duplicate_labels=False
), the presence of duplicate labels causes a DuplicateLabelError exception to be raised:This property is preserved through operations (using
_metadata
and__finalize__
).API design questions
I dislike that in combination with
.allows_duplicates
, since we'd have a property and a method that only differ by ans
. Perhaps something likeBut do people know that "allows_duplicates" is considered metadata?
TODO:
__finalize__
(we may be changing when we call it, and what we pass)Apologies for using the PR for discussion, but we needed a way to compare this to #28334.