Optionally disallow duplicate labels (pandas-dev#28394)

kesmit13 · Nov 2, 2020 · 4b03ee6 · 4b03ee6
1 parent da91aa5
commit 4b03ee6
Show file tree

Hide file tree

Showing 24 changed files with 1,227 additions and 10 deletions.
diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst
@@ -37,6 +37,7 @@ Attributes and underlying data
    DataFrame.shape
    DataFrame.memory_usage
    DataFrame.empty
+   DataFrame.set_flags
 
 Conversion
 ~~~~~~~~~~
@@ -276,6 +277,21 @@ Time Series-related
    DataFrame.tz_convert
    DataFrame.tz_localize
 
+.. _api.frame.flags:
+
+Flags
+~~~~~
+
+Flags refer to attributes of the pandas object. Properties of the dataset (like
+the date is was recorded, the URL it was accessed from, etc.) should be stored
+in :attr:`DataFrame.attrs`.
+
+.. autosummary::
+   :toctree: api/
+
+   Flags
+
+
 .. _api.frame.metadata:
 
 Metadata

diff --git a/doc/source/reference/general_utility_functions.rst b/doc/source/reference/general_utility_functions.rst
@@ -37,6 +37,7 @@ Exceptions and warnings
 
    errors.AccessorRegistrationWarning
    errors.DtypeWarning
+   errors.DuplicateLabelError
    errors.EmptyDataError
    errors.InvalidIndexError
    errors.MergeError

diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
@@ -39,6 +39,8 @@ Attributes
    Series.empty
    Series.dtypes
    Series.name
+   Series.flags
+   Series.set_flags
 
 Conversion
 ----------
@@ -527,6 +529,19 @@ Sparse-dtype specific methods and attributes are provided under the
    Series.sparse.from_coo
    Series.sparse.to_coo
 
+.. _api.series.flags:
+
+Flags
+~~~~~
+
+Flags refer to attributes of the pandas object. Properties of the dataset (like
+the date is was recorded, the URL it was accessed from, etc.) should be stored
+in :attr:`Series.attrs`.
+
+.. autosummary::
+   :toctree: api/
+
+   Flags
 
 .. _api.series.metadata:
 

diff --git a/doc/source/user_guide/duplicates.rst b/doc/source/user_guide/duplicates.rst
@@ -0,0 +1,210 @@
+.. _duplicates:
+
+****************
+Duplicate Labels
+****************
+
+:class:`Index` objects are not required to be unique; you can have duplicate row
+or column labels. This may be a bit confusing at first. If you're familiar with
+SQL, you know that row labels are similar to a primary key on a table, and you
+would never want duplicates in a SQL table. But one of pandas' roles is to clean
+messy, real-world data before it goes to some downstream system. And real-world
+data has duplicates, even in fields that are supposed to be unique.
+
+This section describes how duplicate labels change the behavior of certain
+operations, and how prevent duplicates from arising during operations, or to
+detect them if they do.
+
+.. ipython:: python
+
+   import pandas as pd
+   import numpy as np
+
+Consequences of Duplicate Labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some pandas methods (:meth:`Series.reindex` for example) just don't work with
+duplicates present. The output can't be determined, and so pandas raises.
+
+.. ipython:: python
+   :okexcept:
+
+   s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b'])
+   s1.reindex(['a', 'b', 'c'])
+
+Other methods, like indexing, can give very surprising results. Typically
+indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame``
+with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will
+return a scalar. But with duplicates, this isn't the case.
+
+.. ipython:: python
+
+   df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B'])
+   df1
+
+We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series``
+
+.. ipython:: python
+
+   df1['B']  # a series
+
+But slicing ``'A'`` returns a ``DataFrame``
+
+
+.. ipython:: python
+
+   df1['A']  # a DataFrame
+
+This applies to row labels as well
+
+.. ipython:: python
+
+   df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b'])
+   df2
+   df2.loc['b', 'A']  # a scalar
+   df2.loc['a', 'A']  # a Series
+
+Duplicate Label Detection
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can check whether an :class:`Index` (storing the row or column labels) is
+unique with :attr:`Index.is_unique`:
+
+.. ipython:: python
+
+   df2
+   df2.index.is_unique
+   df2.columns.is_unique
+
+.. note::
+
+   Checking whether an index is unique is somewhat expensive for large datasets.
+   Pandas does cache this result, so re-checking on the same index is very fast.
+
+:meth:`Index.duplicated` will return a boolean ndarray indicating whether a
+label is repeated.
+
+.. ipython:: python
+
+   df2.index.duplicated()
+
+Which can be used as a boolean filter to drop duplicate rows.
+
+.. ipython:: python
+
+   df2.loc[~df2.index.duplicated(), :]
+
+If you need additional logic to handle duplicate labels, rather than just
+dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common
+trick. For example, we'll resolve duplicates by taking the average of all rows
+with the same label.
+
+.. ipython:: python
+
+   df2.groupby(level=0).mean()
+
+.. _duplicates.disallow:
+
+Disallowing Duplicate Labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. versionadded:: 1.2.0
+
+As noted above, handling duplicates is an important feature when reading in raw
+data. That said, you may want to avoid introducing duplicates as part of a data
+processing pipeline (from methods like :meth:`pandas.concat`,
+:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame`
+*disallow* duplicate labels by calling ``.set_flags(allows_duplicate_labels=False)``.
+(the default is to allow them). If there are duplicate labels, an exception
+will be raised.
+
+.. ipython:: python
+   :okexcept:
+
+   pd.Series(
+       [0, 1, 2],
+       index=['a', 'b', 'b']
+   ).set_flags(allows_duplicate_labels=False)
+
+This applies to both row and column labels for a :class:`DataFrame`
+
+.. ipython:: python
+   :okexcept:
+
+   pd.DataFrame(
+       [[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],
+   ).set_flags(allows_duplicate_labels=False)
+
+This attribute can be checked or set with :attr:`~DataFrame.flags.allows_duplicate_labels`,
+which indicates whether that object can have duplicate labels.
+
+.. ipython:: python
+
+   df = (
+       pd.DataFrame({"A": [0, 1, 2, 3]},
+                    index=['x', 'y', 'X', 'Y'])
+         .set_flags(allows_duplicate_labels=False)
+   )
+   df
+   df.flags.allows_duplicate_labels
+
+:meth:`DataFrame.set_flags` can be used to return a new ``DataFrame`` with attributes
+like ``allows_duplicate_labels`` set to some value
+
+.. ipython:: python
+
+   df2 = df.set_flags(allows_duplicate_labels=True)
+   df2.flags.allows_duplicate_labels
+
+The new ``DataFrame`` returned is a view on the same data as the old ``DataFrame``.
+Or the property can just be set directly on the same object
+
+
+.. ipython:: python
+
+   df2.flags.allows_duplicate_labels = False
+   df2.flags.allows_duplicate_labels
+
+When processing raw, messy data you might initially read in the messy data
+(which potentially has duplicate labels), deduplicate, and then disallow duplicates
+going forward, to ensure that your data pipeline doesn't introduce duplicates.
+
+
+.. code-block:: python
+
+   >>> raw = pd.read_csv("...")
+   >>> deduplicated = raw.groupby(level=0).first()  # remove duplicates
+   >>> deduplicated.flags.allows_duplicate_labels = False  # disallow going forward
+
+Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
+labels or performing an operation that introduces duplicate labels on a ``Series`` or
+``DataFrame`` that disallows duplicates will raise an
+:class:`errors.DuplicateLabelError`.
+
+.. ipython:: python
+   :okexcept:
+
+   df.rename(str.upper)
+
+This error message contains the labels that are duplicated, and the numeric positions
+of all the duplicates (including the "original") in the ``Series`` or ``DataFrame``
+
+Duplicate Label Propagation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In general, disallowing duplicates is "sticky". It's preserved through
+operations.
+
+.. ipython:: python
+   :okexcept:
+
+   s1 = pd.Series(0, index=['a', 'b']).set_flags(allows_duplicate_labels=False)
+   s1
+   s1.head().rename({"a": "b"})
+
+.. warning::
+
+   This is an experimental feature. Currently, many methods fail to
+   propagate the ``allows_duplicate_labels`` value. In future versions
+   it is expected that every method taking or returning one or more
+   DataFrame or Series objects will propagate ``allows_duplicate_labels``.
diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
@@ -33,6 +33,7 @@ Further information on any specific method can be obtained in the
     reshaping
     text
     missing_data
+    duplicates
     categorical
     integer_na
     boolean

diff --git a/doc/source/whatsnew/v1.2.0.rst b/doc/source/whatsnew/v1.2.0.rst
@@ -13,6 +13,53 @@ including other versions of pandas.
 Enhancements
 ~~~~~~~~~~~~
 
+.. _whatsnew_120.duplicate_labels:
+
+Optionally disallow duplicate labels
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`Series` and :class:`DataFrame` can now be created with ``allows_duplicate_labels=False`` flag to
+control whether the index or columns can contain duplicate labels (:issue:`28394`). This can be used to
+prevent accidental introduction of duplicate labels, which can affect downstream operations.
+
+By default, duplicates continue to be allowed
+
+.. ipython:: python
+
+   pd.Series([1, 2], index=['a', 'a'])
+
+.. ipython:: python
+   :okexcept:
+
+   pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)
+
+Pandas will propagate the ``allows_duplicate_labels`` property through many operations.
+
+.. ipython:: python
+   :okexcept:
+
+   a = (
+       pd.Series([1, 2], index=['a', 'b'])
+         .set_flags(allows_duplicate_labels=False)
+   )
+   a
+   # An operation introducing duplicates
+   a.reindex(['a', 'b', 'a'])
+
+.. warning::
+
+   This is an experimental feature. Currently, many methods fail to
+   propagate the ``allows_duplicate_labels`` value. In future versions
+   it is expected that every method taking or returning one or more
+   DataFrame or Series objects will propagate ``allows_duplicate_labels``.
+
+See :ref:`duplicates` for more.
+
+The ``allows_duplicate_labels`` flag is stored in the new :attr:`DataFrame.flags`
+attribute. This stores global attributes that apply to the *pandas object*. This
+differs from :attr:`DataFrame.attrs`, which stores information that applies to
+the dataset.
+
 Passing arguments to fsspec backends
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -53,6 +100,8 @@ For example:
 
 Other enhancements
 ^^^^^^^^^^^^^^^^^^
+
+- Added :meth:`~DataFrame.set_flags` for setting table-wide flags on a ``Series`` or ``DataFrame`` (:issue:`28394`)
 - :class:`Index` with object dtype supports division and multiplication (:issue:`34160`)
 -
 -

diff --git a/pandas/__init__.py b/pandas/__init__.py
@@ -100,6 +100,7 @@
     to_datetime,
     to_timedelta,
     # misc
+    Flags,
     Grouper,
     factorize,
     unique,