Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: change default behaviour of str.match from deprecated extract to match (GH5224) #15257

Closed
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 0 additions & 12 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -385,18 +385,6 @@ or match a pattern:
The distinction between ``match`` and ``contains`` is strictness: ``match``
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.

.. warning::

In previous versions, ``match`` was for *extracting* groups,
returning a not-so-convenient Series of tuples. The new method ``extract``
(described in the previous section) is now preferred.

This old, deprecated behavior of ``match`` is still the default. As
demonstrated above, use the new behavior by setting ``as_indexer=True``.
In this mode, ``match`` is analogous to ``contains``, returning a boolean
Series. The new behavior will become the default behavior in a future
release.

Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
an extra ``na`` argument so missing values can be considered True or False:

Expand Down
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -729,6 +729,11 @@ Other API Changes
- ``Series.sort_values()`` accepts a one element list of bool for consistency with the behavior of ``DataFrame.sort_values()`` (:issue:`15604`)
- ``.merge()`` and ``.join()`` on ``category`` dtype columns will now preserve the category dtype when possible (:issue:`10409`)
- ``SparseDataFrame.default_fill_value`` will be 0, previously was ``nan`` in the return from ``pd.get_dummies(..., sparse=True)`` (:issue:`15594`)
- The default behaviour of ``Series.str.match`` has changed from extracting
groups to matching the pattern. The extracting behaviour was deprecated
since pandas version 0.13.0 and can be done with the ``Series.str.extract``
method (:issue:`5224`).


.. _whatsnew_0200.deprecations:

Expand Down
57 changes: 15 additions & 42 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,11 +464,9 @@ def rep(x, r):
return result


def str_match(arr, pat, case=True, flags=0, na=np.nan, as_indexer=False):
def str_match(arr, pat, case=True, flags=0, na=np.nan, as_indexer=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you take this arg out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this keyword needs to stay, because it was how people could specify the 'new' behaviour before (although we said we would change this in 0.14, we never did).
So all people still using match are probably specifying this keyword, AFAIU.

See the removed warning from the documentation in the diff for some context.

In principle we could make it a FutureWarning instead of UserWarning, so we can remove it later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this should have been changed a long time ago. no reason to keep a dead API around.

and change to FutureWarning. can remove in next major version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason to keep a dead API around.

To be clear, this is no dead API. Although it is ignored after this PR, everybody using this function uses that keyword.
So I certainly won't raise (FutureWarning is fine, probably even better as UserWarning anyway)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well its going to be removed. So should for sure use FutureWarning. UserWarning is pretty useless as a warning IMHO. (not that FutureWarning is much better but at least signals that we are going to remove it).

"""
Deprecated: Find groups in each string in the Series/Index
using passed regular expression.
If as_indexer=True, determine if each string matches a regular expression.
Determine if each string matches a regular expression.

Parameters
----------
Expand All @@ -479,60 +477,35 @@ def str_match(arr, pat, case=True, flags=0, na=np.nan, as_indexer=False):
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.
as_indexer : False, by default, gives deprecated behavior better achieved
using str_extract. True return boolean indexer.

Returns
-------
Series/array of boolean values
if as_indexer=True
Series/Index of tuples
if as_indexer=False, default but deprecated

See Also
--------
contains : analogous, but less strict, relying on re.search instead of
re.match
extract : now preferred to the deprecated usage of match (as_indexer=False)
extract : extract matched groups

Notes
-----
To extract matched groups, which is the deprecated behavior of match, use
str.extract.
"""

if not case:
flags |= re.IGNORECASE

regex = re.compile(pat, flags=flags)

if (not as_indexer) and regex.groups > 0:
# Do this first, to make sure it happens even if the re.compile
# raises below.
warnings.warn("In future versions of pandas, match will change to"
" always return a bool indexer.", FutureWarning,
stacklevel=3)

if as_indexer and regex.groups > 0:
warnings.warn("This pattern has match groups. To actually get the"
" groups, use str.extract.", UserWarning, stacklevel=3)
if (as_indexer is False) and (regex.groups > 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you taking this out?

raise ValueError("as_indexer=False with a pattern with groups is no "
"longer supported. Use '.str.extract(pat)' instead")
elif as_indexer is not None:
# Previously, this keyword was used for changing the default but
# deprecated behaviour. This keyword is now no longer needed.
warnings.warn("'as_indexer' keyword was specified but will be ignored;"
" match now returns a boolean indexer by default.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should for sure be a FutureWarning. to be honest I would just raise. really no reason to continue supporting this. but if you want to make for 1 more cycle ok too.

UserWarning, stacklevel=3)

# If not as_indexer and regex.groups == 0, this returns empty lists
# and is basically useless, so we will not warn.

if (not as_indexer) and regex.groups > 0:
dtype = object

def f(x):
m = regex.match(x)
if m:
return m.groups()
else:
return []
else:
# This is the new behavior of str_match.
dtype = bool
f = lambda x: bool(regex.match(x))
dtype = bool
f = lambda x: bool(regex.match(x))

return _na_map(f, arr, na, dtype=dtype)

Expand Down Expand Up @@ -1587,7 +1560,7 @@ def contains(self, pat, case=True, flags=0, na=np.nan, regex=True):
return self._wrap_result(result)

@copy(str_match)
def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=False):
def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
result = str_match(self._data, pat, case=case, flags=flags, na=na,
as_indexer=as_indexer)
return self._wrap_result(result)
Expand Down
63 changes: 22 additions & 41 deletions pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -559,64 +559,44 @@ def test_repeat(self):
exp = Series([u('a'), u('bb'), NA, u('cccc'), NA, u('dddddd')])
tm.assert_series_equal(result, exp)

def test_deprecated_match(self):
# Old match behavior, deprecated (but still default) in 0.13
def test_match(self):
# New match behavior introduced in 0.13
values = Series(['fooBAD__barBAD', NA, 'foo'])

with tm.assert_produces_warning():
result = values.str.match('.*(BAD[_]+).*(BAD)')
exp = Series([('BAD__', 'BAD'), NA, []])
tm.assert_series_equal(result, exp)

# mixed
mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(),
'foo', None, 1, 2.])

with tm.assert_produces_warning():
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)')
xp = Series([('BAD_', 'BAD'), NA, ('BAD_', 'BAD'),
NA, NA, [], NA, NA, NA])
tm.assertIsInstance(rs, Series)
tm.assert_series_equal(rs, xp)

# unicode
values = Series([u('fooBAD__barBAD'), NA, u('foo')])

with tm.assert_produces_warning():
result = values.str.match('.*(BAD[_]+).*(BAD)')
exp = Series([(u('BAD__'), u('BAD')), NA, []])
result = values.str.match('.*(BAD[_]+).*(BAD)')
exp = Series([True, NA, False])
tm.assert_series_equal(result, exp)

def test_match(self):
# New match behavior introduced in 0.13
values = Series(['fooBAD__barBAD', NA, 'foo'])
with tm.assert_produces_warning():
result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
result = values.str.match('.*BAD[_]+.*BAD')
exp = Series([True, NA, False])
tm.assert_series_equal(result, exp)

# If no groups, use new behavior even when as_indexer is False.
# (Old behavior is pretty much useless in this case.)
# test passing as_indexer still works but is ignored
values = Series(['fooBAD__barBAD', NA, 'foo'])
result = values.str.match('.*BAD[_]+.*BAD', as_indexer=False)
exp = Series([True, NA, False])
with tm.assert_produces_warning(UserWarning):
result = values.str.match('.*BAD[_]+.*BAD', as_indexer=True)
tm.assert_series_equal(result, exp)
with tm.assert_produces_warning(UserWarning):
result = values.str.match('.*BAD[_]+.*BAD', as_indexer=False)
tm.assert_series_equal(result, exp)
with tm.assert_produces_warning(UserWarning):
result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
tm.assert_series_equal(result, exp)
self.assertRaises(ValueError, values.str.match, '.*(BAD[_]+).*(BAD)',
as_indexer=False)

# mixed
mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(),
'foo', None, 1, 2.])

with tm.assert_produces_warning():
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)')
xp = Series([True, NA, True, NA, NA, False, NA, NA, NA])
tm.assertIsInstance(rs, Series)
tm.assert_series_equal(rs, xp)

# unicode
values = Series([u('fooBAD__barBAD'), NA, u('foo')])

with tm.assert_produces_warning():
result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
result = values.str.match('.*(BAD[_]+).*(BAD)')
exp = Series([True, NA, False])
tm.assert_series_equal(result, exp)

Expand Down Expand Up @@ -2610,10 +2590,11 @@ def test_match_findall_flags(self):

pat = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

with tm.assert_produces_warning(FutureWarning):
result = data.str.match(pat, flags=re.IGNORECASE)
result = data.str.extract(pat, flags=re.IGNORECASE, expand=True)
self.assertEqual(result.iloc[0].tolist(), ['dave', 'google', 'com'])

self.assertEqual(result[0], ('dave', 'google', 'com'))
result = data.str.match(pat, flags=re.IGNORECASE)
self.assertEqual(result[0], True)

result = data.str.findall(pat, flags=re.IGNORECASE)
self.assertEqual(result[0][0], ('dave', 'google', 'com'))
Expand Down