Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add string extension type #27949

Merged
merged 59 commits into from
Oct 5, 2019
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
c24b5b6
API: Add string extension type
TomAugspurger Jul 31, 2019
3ecb5cc
test fixups
TomAugspurger Aug 16, 2019
59a7d39
string dtype
TomAugspurger Aug 16, 2019
7c07070
35 compat
TomAugspurger Aug 16, 2019
9e1a73b
doc
TomAugspurger Aug 16, 2019
16ccad8
fixups
TomAugspurger Aug 16, 2019
1027463
doc
TomAugspurger Aug 16, 2019
aafb53b
doc
TomAugspurger Aug 19, 2019
9cdfe2f
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Aug 19, 2019
ab49169
fix doc warnings
TomAugspurger Aug 19, 2019
978fb55
fixup docstrings
TomAugspurger Aug 19, 2019
aebc688
fixup docstrings
TomAugspurger Aug 19, 2019
d90d0ad
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 9, 2019
41dc0f9
lint
TomAugspurger Sep 9, 2019
b783559
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 16, 2019
13cdddd
typing
TomAugspurger Sep 16, 2019
78c2eaa
removed double assert
TomAugspurger Sep 18, 2019
726d0af
experimental
TomAugspurger Sep 19, 2019
69d24e5
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 19, 2019
9cd9945
failing
TomAugspurger Sep 19, 2019
070fb76
xfails
TomAugspurger Sep 19, 2019
2b90639
Handle non-ndarray in add
TomAugspurger Sep 19, 2019
381c889
fixup
TomAugspurger Sep 19, 2019
bf82aad
fixup
TomAugspurger Sep 19, 2019
79bd87a
note
TomAugspurger Sep 19, 2019
2af8c81
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 23, 2019
fd24274
spacing
TomAugspurger Sep 23, 2019
0635ede
warning note
TomAugspurger Sep 23, 2019
d3311ee
update doc
TomAugspurger Sep 23, 2019
dce9258
doc updates
TomAugspurger Sep 23, 2019
0524f7e
update ctor
TomAugspurger Sep 23, 2019
292a8f3
clean up wrapping
TomAugspurger Sep 23, 2019
2c88e3b
clarify
TomAugspurger Sep 23, 2019
1b8c83a
reduce sum
TomAugspurger Sep 23, 2019
f1dad2a
skip reduce sum
TomAugspurger Sep 23, 2019
be95ecb
rename
TomAugspurger Sep 23, 2019
903ea2f
move
TomAugspurger Sep 23, 2019
0e1f479
missed
TomAugspurger Sep 23, 2019
c168ecf
missed
TomAugspurger Sep 23, 2019
d06ba73
fixup rename
TomAugspurger Sep 24, 2019
3ba27c3
fixup
TomAugspurger Sep 24, 2019
fe8ee77
doctest
TomAugspurger Sep 24, 2019
d9f63aa
updates
TomAugspurger Sep 24, 2019
d3c49e2
fixups
TomAugspurger Sep 24, 2019
dcb84f9
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 24, 2019
43b51cd
length check
TomAugspurger Sep 24, 2019
4fd2d11
unimplement sum
TomAugspurger Sep 24, 2019
713f807
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 26, 2019
777b295
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Sep 30, 2019
8714a53
fixup
TomAugspurger Sep 30, 2019
41f234c
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 1, 2019
dc9ef3c
rename
TomAugspurger Oct 1, 2019
9419af2
rename
TomAugspurger Oct 1, 2019
462b29d
doc updates
TomAugspurger Oct 1, 2019
0391563
fixups
TomAugspurger Oct 1, 2019
129fe29
Merge remote-tracking branch 'upstream/master' into ea-string
TomAugspurger Oct 3, 2019
6aebd8c
move and perf
TomAugspurger Oct 4, 2019
2ee5e30
test is_string_dtype
TomAugspurger Oct 4, 2019
7e92cde
helper
TomAugspurger Oct 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1704,14 +1704,21 @@ built-in string methods. For example:

.. ipython:: python

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
s.str.lower()

Powerful pattern-matching methods are provided as well, but note that
pattern-matching generally uses `regular expressions
<https://docs.python.org/3/library/re.html>`__ by default (and in some cases
always uses them).

.. note::

Prior to pandas 1.0, string methods were only available on ``object`` -dtype
``Series``. Pandas 1.0 added the :class:`StringDtype` which is dedicated
jreback marked this conversation as resolved.
Show resolved Hide resolved
to strings. See :ref:`text.types` for more.

Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
description.

Expand Down
26 changes: 25 additions & 1 deletion doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.array
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Text :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
=================== ========================= ================== =============================

Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
Expand Down Expand Up @@ -460,6 +461,29 @@ and methods if the :class:`Series` contains sparse values. See
:ref:`api.series.sparse` for more.


.. _api.arrays.string:

Text data
---------

When working with text data, where each valid element is a string, we recommend using
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
:class:`StringDtype` (with the alias ``"string"``).

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

arrays.StringArray

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

StringDtype

The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.


.. Dtype attributes which are manually listed in their docstrings: including
.. it here to make sure a docstring page is built for them
Expand All @@ -471,4 +495,4 @@ and methods if the :class:`Series` contains sparse values. See
DatetimeTZDtype.unit
DatetimeTZDtype.tz
PeriodDtype.freq
IntervalDtype.subtype
IntervalDtype.subtype
128 changes: 103 additions & 25 deletions doc/source/user_guide/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,66 @@
Working with text data
======================

.. _text.types:

Text Data Types
---------------

.. versionadded:: 1.0.0

There are two main ways to store text data

1. ``object`` -dtype NumPy array.
2. As an :class:`arrays.StringArray` extension type.

We recommend using :class:`arrays.StringArray` to store text data.

Prior to pandas 1.0, ``object`` dtype was the only option. This was unfortunate
for many reasons:

1. You can accidentally store a *mixture* of strings and non-strings in an
``object`` dtype array. It's better to have a dedicated dtype.
2. ``object`` dtype breaks dtype-specific operations like ``select_dtypes``.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
There isn't a clear way to select *just* text while excluding non-text
but still object-dtype columns.
3. When reading code, the contents of an ``object`` dtype array is less clear
than ``string``.


.. warning::

StringArray is currently considered experimental.

For backwards-compatibility, ``object`` dtype remains the default type we
infer a list of strings to

.. ipython:: python

pd.Series(['a', 'b', 'c'])

To explicitly request ``string`` dtype, specify the ``dtype``

.. ipython:: python

pd.Series(['a', 'b', 'c'], dtype="string")
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())

Or ``astype`` after the ``Series`` or ``DataFrame`` is created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of the convention, should Series and DataFrame be ":class:Foo" in this contest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we have a formal policy. I vaguely recall a discussion somewhere about doing it ~once per paragraph?


.. ipython:: python

s = pd.Series(['a', 'b', 'c'])
s
s.astype("string")

Everything that follows in the rest of this document applies equally to
``string`` and ``object`` dtype.

.. _text.string_methods:

String Methods
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
--------------

Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps most
importantly, these methods exclude missing/NA values automatically. These are
Expand All @@ -16,7 +74,8 @@ the equivalent (scalar) built-in string methods:

.. ipython:: python

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
s.str.lower()
s.str.upper()
s.str.len()
Expand Down Expand Up @@ -90,7 +149,7 @@ Methods like ``split`` return a Series of lists:

.. ipython:: python

s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
s2.str.split('_')

Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
Expand All @@ -106,6 +165,9 @@ It is easy to expand this to return a DataFrame using ``expand``.

s2.str.split('_', expand=True)

When original ``Series`` has :class:`StringDtype`, the output columns will all
be :class:`StringDtype` as well.

It is also possible to limit the number of splits:

.. ipython:: python
Expand All @@ -125,7 +187,8 @@ i.e., from the end of the string to the beginning of the string:
.. ipython:: python

s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
'', np.nan, 'CABA', 'dog', 'cat'])
'', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
s3
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

Expand All @@ -136,7 +199,7 @@ following code will cause trouble because of the regular expression meaning of
.. ipython:: python

# Consider the following badly formatted financial data
dollars = pd.Series(['12', '-$10', '$10,000'])
dollars = pd.Series(['12', '-$10', '$10,000'], dtype="string")

# This does what you'd naively expect:
dollars.str.replace('$', '')
Expand Down Expand Up @@ -174,15 +237,17 @@ positional argument (a regex object) and return a string.
def repl(m):
return m.group(0)[::-1]

pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(pat, repl)
pd.Series(['foo 123', 'bar baz', np.nan],
dtype="string").str.replace(pat, repl)

# Using regex groups
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"

def repl(m):
return m.group('two').swapcase()

pd.Series(['Foo Bar Baz', np.nan]).str.replace(pat, repl)
pd.Series(['Foo Bar Baz', np.nan],
dtype="string").str.replace(pat, repl)

.. versionadded:: 0.20.0

Expand Down Expand Up @@ -221,7 +286,7 @@ The content of a ``Series`` (or ``Index``) can be concatenated:

.. ipython:: python

s = pd.Series(['a', 'b', 'c', 'd'])
s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")
s.str.cat(sep=',')

If not specified, the keyword ``sep`` for the separator defaults to the empty string, ``sep=''``:
Expand All @@ -234,7 +299,7 @@ By default, missing values are ignored. Using ``na_rep``, they can be given a re

.. ipython:: python

t = pd.Series(['a', 'b', np.nan, 'd'])
t = pd.Series(['a', 'b', np.nan, 'd'], dtype="string")
t.str.cat(sep=',')
t.str.cat(sep=',', na_rep='-')

Expand Down Expand Up @@ -279,7 +344,8 @@ the ``join``-keyword.
.. ipython:: python
:okwarning:

u = pd.Series(['b', 'd', 'a', 'c'], index=[1, 3, 0, 2])
u = pd.Series(['b', 'd', 'a', 'c'], index=[1, 3, 0, 2],
dtype="string")
s
u
s.str.cat(u)
Expand All @@ -295,7 +361,8 @@ In particular, alignment also means that the different lengths do not need to co

.. ipython:: python

v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4])
v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4],
dtype="string")
s
v
s.str.cat(v, join='left', na_rep='-')
Expand Down Expand Up @@ -351,7 +418,8 @@ of the string, the result will be a ``NaN``.
.. ipython:: python

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
'CABA', 'dog', 'cat'])
'CABA', 'dog', 'cat'],
dtype="string")

s.str[0]
s.str[1]
Expand Down Expand Up @@ -382,7 +450,8 @@ DataFrame with one column per group.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract(r'([ab])(\d)', expand=False)
pd.Series(['a1', 'b2', 'c3'],
dtype="string").str.extract(r'([ab])(\d)', expand=False)

Elements that do not match return a row filled with ``NaN``. Thus, a
Series of messy strings can be "converted" into a like-indexed Series
Expand All @@ -395,14 +464,16 @@ Named groups like

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
expand=False)
pd.Series(['a1', 'b2', 'c3'],
dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
expand=False)

and optional groups like

.. ipython:: python

pd.Series(['a1', 'b2', '3']).str.extract(r'([ab])?(\d)', expand=False)
pd.Series(['a1', 'b2', '3'],
dtype="string").str.extract(r'([ab])?(\d)', expand=False)

can also be used. Note that any capture group names in the regular
expression will be used for column names; otherwise capture group
Expand All @@ -413,20 +484,23 @@ with one column if ``expand=True``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=True)
pd.Series(['a1', 'b2', 'c3'],
dtype="string").str.extract(r'[ab](\d)', expand=True)

It returns a Series if ``expand=False``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=False)
pd.Series(['a1', 'b2', 'c3'],
dtype="string").str.extract(r'[ab](\d)', expand=False)

Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``.

.. ipython:: python

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"],
dtype="string")
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

Expand Down Expand Up @@ -471,7 +545,8 @@ Unlike ``extract`` (which returns only the first match),

.. ipython:: python

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
dtype="string")
s
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)
Expand All @@ -489,7 +564,7 @@ When each subject string in the Series has exactly one match,

.. ipython:: python

s = pd.Series(['a3', 'b3', 'c2'])
s = pd.Series(['a3', 'b3', 'c2'], dtype="string")
s

then ``extractall(pat).xs(0, level='match')`` gives the same result as
Expand All @@ -510,7 +585,7 @@ same result as a ``Series.str.extractall`` with a default index (starts from 0).

pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)

pd.Series(["a1a2", "b1", "c1"]).str.extractall(two_groups)
pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved


Testing for Strings that match or contain a pattern
Expand All @@ -521,13 +596,15 @@ You can check whether elements contain a pattern:
.. ipython:: python

pattern = r'[0-9][a-z]'
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
pd.Series(['1', '2', '3a', '3b', '03c'],
dtype="string").str.contains(pattern)

Or whether elements match a pattern:

.. ipython:: python

pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern)
pd.Series(['1', '2', '3a', '3b', '03c'],
dtype="string").str.match(pattern)

The distinction between ``match`` and ``contains`` is strictness: ``match``
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.
Expand All @@ -537,7 +614,8 @@ an extra ``na`` argument so missing values can be considered True or False:

.. ipython:: python

s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
dtype="string")
s4.str.contains('A', na=False)

.. _text.indicator:
Expand All @@ -550,7 +628,7 @@ For example if they are separated by a ``'|'``:

.. ipython:: python

s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
s = pd.Series(['a', 'a|b', np.nan, 'a|c'], dtype="string")
s.str.get_dummies(sep='|')

String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.
Expand Down
Loading