API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

h-vetinari · 2018-10-15T19:13:46Z

closes #23163
closes #23011
closes #23551
~~#23555 / #23556~~

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Several times while working around str.cat (most recently #22725) @jreback and @WillAyd mentioned that the .str accessor should not work for (e.g.) bytes data. So far, the constructor of the StringMethods object didn't infer for Series -- this PR adds that.

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types, ~~but for floats, the inference-code would need to be able to distinguish between actual floats and missing values (currently, both of those return inferred_type='mixed')~~

pep8speaks · 2018-10-15T19:13:51Z

Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-30 15:17:50 UTC

h-vetinari · 2018-10-15T19:58:42Z

Hm, turns out the .str-on-bytes is used in several more places. Reverting that part for now.

codecov · 2018-10-16T23:25:47Z

Codecov Report

Merging #23167 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23167      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50643    50682      +39     
==========================================
+ Hits        46515    46550      +35     
- Misses       4128     4132       +4

Flag	Coverage Δ
#multiple	`90.38% <100%> (ø)`	⬆️
#single	`41.74% <95.52%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.92% <100%> (+0.05%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a60d1bd...f62e344. Read the comment docs.

jreback · 2018-10-17T01:20:16Z

doc/source/whatsnew/v0.24.0.txt

@@ -208,6 +208,8 @@ Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
+- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)


can you reword a bit. something like

The `.str`-accessor will raise on any non-(string or unicode) ``object`` dtype data. Previously this would accept ``bytes`` for example.

I know that this is how you would love to see the .str-accessor work (and it's not like my sentence is better).

The problem is that the .str-accessor has effectively been an .object-accessor for a very long time, and is used as such - even pandas-internally, like .str.join for lists or str.decode for bytes.

Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.

So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).

ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.

pandas/core/strings.py

jreback · 2018-10-17T01:24:19Z

pandas/core/strings.py

+                raise AttributeError('Can only use .str accessor with Index, '
+                                     'not MultiIndex')
+
+        # see src/inference.pyx which can contain string values


empty is only ok if its object dtype

jreback · 2018-10-17T01:24:29Z

pandas/core/strings.py

+        # see src/inference.pyx which can contain string values
+        allowed_types = ['string', 'unicode', 'empty',
+                         'mixed', 'mixed-integer']
+        if isinstance(data, ABCSeries):


why does this accept bytes?

See OP:

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

I originally commented out the only tests in test_strings that needed bytes and completely forbade bytes, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.

jreback · 2018-10-17T01:24:51Z

pandas/core/strings.py

+        if isinstance(data, ABCSeries):
+            allowed_types = allowed_types + ['bytes']
+
+        values = data if isinstance(data, Index) else data.values


getattr(data, 'values', data)

jreback · 2018-10-17T01:25:35Z

pandas/core/strings.py

+        else:
+            inf_type = lib.infer_dtype(values)
+
+        all_na_obj = is_object_dtype(values.dtype) and data.isna().all()


this is by defintion what 'empty' means, you only need to check if its object type in that case

Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'.

>>> import pandas as pd >>> import numpy as np >>> import pandas._libs.lib as lib >>> lib.infer_dtype(pd.Index([np.nan, np.nan])) 'floating' >>> lib.infer_dtype(pd.Series([np.nan, np.nan])) 'floating' >>> lib.infer_dtype(pd.Series([])) 'floating' >>> lib.infer_dtype(pd.Index([])) 'empty' >>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object)) 'floating' >>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object)) 'floating' >>> lib.infer_dtype(pd.Series([], dtype=object)) 'empty' >>> lib.infer_dtype(pd.Index([], dtype=object)) 'empty'

ok, still separating these out to separate if/elif clauses makes this easier to grok.

This isn't necessary anymore if I use .dropna. Then the only 'mixed' are really mixed (and not caused by e.g. np.nan)

jreback · 2018-10-17T01:26:36Z

pandas/core/strings.py

+
+        values = data if isinstance(data, Index) else data.values
+        if is_categorical_dtype(data.dtype):
+            inf_type = lib.infer_dtype(values.categories)


pls pls don't use abbreviation. inferred_type is the name for this

This came from the existing code. Will change

jreback · 2018-10-17T01:27:46Z

pandas/core/strings.py

+        all_na_obj = is_object_dtype(values.dtype) and data.isna().all()
+
+        # same for Series and Index (that is not MultiIndex)
+        if inf_type not in allowed_types and not all_na_obj:


rather than doing this here, I would for example have an if block that handles the empty (and checks if its object_dtype, then you immediately know if its ok or not)
then pass thru string and unicode. anything else is a raise.

Ok in principle, but it's definitely necessary to allow 'mixed':

>>> lib.infer_dtype(pd.Series(['a', np.nan])) 'mixed'

Disallowing 'mixed-integer' would also break existing code, but I mentioned in the OP:

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types.

Check out the example in #13877 - currently the .str-methods are generally permissive about just returning NaN if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith. Could introduce an 'errors'-kwarg where necessary, but all of that needs to happen on method level - the .str-constructor must allow these to pass.

same comment as above, we need a very clear path where each case can be handled independently if need be.

I could infer on data.dropna(), then the answer would be much clearer, I think.

I'd actually like to save the result of all this type inferring in an attribute of the StringMethods object (like ._inferred_type) so that it can be reused by the methods.

Independently, we could also (at a later stage) add the errors-kwarg to .str itself, and the methods then inherit it.

jreback · 2018-10-17T01:29:04Z

pandas/core/strings.py

@@ -2274,7 +2274,8 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
            result = cat_core(all_cols, sep)

        if isinstance(self._orig, Index):
-            result = Index(result, name=self._orig.name)
+            # add dtype for case that result is all-NA
+            result = Index(result, dtype='object', name=self._orig.name)
        else:  # Series
            result = Series(result, index=data.index, name=self._orig.name)


prob need dtype=object (you don't need quotes FYI)

pandas/tests/test_strings.py

h-vetinari

Thanks for the review. There are still several open questions on how to proceed.

pandas/core/strings.py

h-vetinari · 2018-10-17T06:17:03Z

pandas/core/strings.py

+                raise AttributeError('Can only use .str accessor with Index, '
+                                     'not MultiIndex')
+
+        # see src/inference.pyx which can contain string values


h-vetinari · 2018-10-17T06:20:28Z

pandas/core/strings.py

+        # see src/inference.pyx which can contain string values
+        allowed_types = ['string', 'unicode', 'empty',
+                         'mixed', 'mixed-integer']
+        if isinstance(data, ABCSeries):


See OP:

What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for .str (currently there are .str.encode and .str.decode methods), or not? I guess that even if not, this would need a deprecation cycle.

I originally commented out the only tests in test_strings that needed bytes and completely forbade bytes, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods object but raise on inferred_dtype='bytes' for all but the str.decode method.

h-vetinari · 2018-10-17T06:21:02Z

pandas/core/strings.py

+
+        values = data if isinstance(data, Index) else data.values
+        if is_categorical_dtype(data.dtype):
+            inf_type = lib.infer_dtype(values.categories)


This came from the existing code. Will change

h-vetinari · 2018-10-17T06:25:32Z

pandas/core/strings.py

+        else:
+            inf_type = lib.infer_dtype(values)
+
+        all_na_obj = is_object_dtype(values.dtype) and data.isna().all()


Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'.

>>> import pandas as pd >>> import numpy as np >>> import pandas._libs.lib as lib >>> lib.infer_dtype(pd.Index([np.nan, np.nan])) 'floating' >>> lib.infer_dtype(pd.Series([np.nan, np.nan])) 'floating' >>> lib.infer_dtype(pd.Series([])) 'floating' >>> lib.infer_dtype(pd.Index([])) 'empty' >>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object)) 'floating' >>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object)) 'floating' >>> lib.infer_dtype(pd.Series([], dtype=object)) 'empty' >>> lib.infer_dtype(pd.Index([], dtype=object)) 'empty'

h-vetinari · 2018-10-17T06:28:14Z

pandas/core/strings.py

+        all_na_obj = is_object_dtype(values.dtype) and data.isna().all()
+
+        # same for Series and Index (that is not MultiIndex)
+        if inf_type not in allowed_types and not all_na_obj:


Ok in principle, but it's definitely necessary to allow 'mixed':

>>> lib.infer_dtype(pd.Series(['a', np.nan])) 'mixed'

Disallowing 'mixed-integer' would also break existing code, but I mentioned in the OP:

Finally, it would be possible to partially close #13877 by also excluding 'mixed-integer' from the allowed inferred types.

Check out the example in #13877 - currently the .str-methods are generally permissive about just returning NaN if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith. Could introduce an 'errors'-kwarg where necessary, but all of that needs to happen on method level - the .str-constructor must allow these to pass.

h-vetinari

Thanks for review; test additions will follow

h-vetinari · 2018-10-17T21:50:53Z

doc/source/whatsnew/v0.24.0.txt

@@ -208,6 +208,8 @@ Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
+- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)


I know that this is how you would love to see the .str-accessor work (and it's not like my sentence is better).

The problem is that the .str-accessor has effectively been an .object-accessor for a very long time, and is used as such - even pandas-internally, like .str.join for lists or str.decode for bytes.

Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.

So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).

pandas/tests/test_strings.py

h-vetinari · 2018-10-17T22:21:08Z

pandas/core/strings.py

+        else:
+            inf_type = lib.infer_dtype(values)
+
+        all_na_obj = is_object_dtype(values.dtype) and data.isna().all()


This isn't necessary anymore if I use .dropna. Then the only 'mixed' are really mixed (and not caused by e.g. np.nan)

jreback · 2018-10-23T03:26:09Z

doc/source/whatsnew/v0.24.0.txt

@@ -208,6 +208,8 @@ Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
+- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)


ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

jreback · 2018-10-23T03:29:55Z

doc/source/whatsnew/v0.24.0.txt

@@ -208,6 +208,8 @@ Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
+- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example)


you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.

h-vetinari · 2018-10-30T21:15:20Z

@jreback

So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).

I implemented just that for the bytes case now. PTAL

Side note: I needed to use #23422 already to be able to use the type inference.

jreback · 2018-11-06T02:39:52Z

pandas/core/strings.py

+
+        values = getattr(data, 'values', data)  # Series / Index
+        values = getattr(values, 'categories', values)  # categorical / normal
+        # missing values mess up type inference -> skip


blank line

'mess up' -> obfuscate

jreback · 2018-11-06T02:45:19Z

pandas/core/strings.py

@@ -2400,13 +2411,17 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
        'side': 'beginning',
        'method': 'split'})
    def split(self, pat=None, n=-1, expand=False):
+        if self._inferred_type in ['bytes']:


make this a method instead and call it where appropriate (or can do a decorator)

def assert_bytes_allowed(self): ....

jreback · 2018-11-06T02:46:30Z

pandas/tests/test_strings.py

+    @pytest.mark.parametrize('other', [Series, Index])
+    def test_str_cat_all_na(self, box, other):
+        # check that all NaNs in caller / target work
+        s = Index(['a', 'b', 'c', 'd'])


can you make this more transparent and don't use abbreviations. This is very hard to read

Not sure I understand you here - this uses the same format as the other tests for str.cat, and I can't see any abbreviations.

jreback · 2018-11-06T02:47:11Z

pandas/tests/test_strings.py

@@ -377,11 +377,36 @@ def test_str_cat_align_mixed_inputs(self, join):
        with tm.assert_raises_regex(ValueError, rgx):
            s.str.cat([t, z], join=join)

-    def test_str_cat_raises(self):
+    @pytest.mark.parametrize('box', [Series, Index])
+    def test_str_cat_raises(self, box):
        # non-strings hiding behind object dtype


need a test that validates that a bytes object is not wokring for all methods but decode

h-vetinari

@jreback

I switched to a decorator like you asked - it's much more elegant IMO, and also uncovered some deficiencies (e.g. the function names were not preserved in the existing wrappers).

pandas/core/strings.py

h-vetinari · 2018-11-07T00:10:09Z

pandas/core/strings.py

@@ -2917,7 +3021,8 @@ def rindex(self, sub, start=0, end=None):
    5    3.0
    dtype: float64
    """)
-    len = _noarg_wrapper(len, docstring=_shared_docs['len'], dtype=int)
+    len = _noarg_wrapper(len, docstring=_shared_docs['len'],
+                         forbidden_types=None, dtype=int)


I'm thinking that len should still be allowed for bytes to preserve a sort of minimal functionality compared to what is working as of v.0.23.4

h-vetinari · 2018-11-07T00:11:31Z

pandas/tests/test_strings.py

+                        ('find', ['some_pattern']),
+                        ('rfind', ['some_pattern']),
+                        # ('index', ['some_pattern']),
+                        # ('rindex', ['some_pattern']),


index and partition and get_dummies were having some errors on Index. Those will have to be investigated (or xfailed)

h-vetinari · 2018-11-07T00:12:00Z

pandas/tests/test_strings.py

+                               + [('floating', [1.0, np.nan, 2.0]),
+                                  ('integer', [1, 2, 3]),
+                                  # ('decimal', [...]),
+                                  # ('complex', [...]),


Filling in these types is WIP, but will be easy.

h-vetinari · 2018-11-07T00:13:11Z

pandas/tests/test_strings.py

+                t.str
+            assert not hasattr(t, 'str')
+
+    def test_api_mi_raises(self):


As far as I can tell, the (existing) error for trying to call MultiIndex.str was not tested so far.

h-vetinari · 2018-11-07T00:13:58Z

pandas/tests/test_strings.py

+        t = box(values, dtype=object)  # object dtype to avoid casting
+
+        bytes_allowed = method_name in ['encode', 'decode', 'len']
+        mixed_allowed = method_name not in ['cat']


The way this test is written/parametrized should make it very transparent which (inferred) dtypes are allowed for which method.

h-vetinari

Added a couple of issue refs for the bugs I found. I thought about splitting these fixes off into separate PRs, but all of them would essentially need the same parametrized-fixture-machinery I'm building in this PR, so I'm leaving them here for now.

h-vetinari · 2018-11-08T00:27:24Z

pandas/core/strings.py

@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0):
    if regex.groups == 0:
        raise ValueError("pattern contains no capture groups")

-    if isinstance(arr, ABCIndex):
+    if isinstance(arr, ABCIndexClass):


This line fixes #23555. (the problem was that a CategoricalIndex is not an instance of ABCIndex)

of course, ABCIndexClass is the base index class.

pandas/core/strings.py

h-vetinari · 2018-11-08T00:30:09Z

pandas/core/strings.py


        # for category, we do the stuff on the categories, so blow it up
        # to the full series again
        # But for some operations, we have to do the stuff on the full values,
        # so make it possible to skip this step as the method already did this
        # before the transformation...
        if use_codes and self._is_categorical:
-            result = take_1d(result, self._orig.cat.codes)
+            # if self._orig is a CategoricalIndex, there is no .cat-accessor
+            result = take_1d(result, Series(self._orig).cat.codes)


This lines fixes #23556

h-vetinari · 2018-11-08T00:31:29Z

pandas/core/strings.py

+    count = _pat_wrapper(str_count, flags=True, name='count')
+    startswith = _pat_wrapper(str_startswith, na=True, name='startswith')
+    endswith = _pat_wrapper(str_endswith, na=True, name='endswith')
+    findall = _pat_wrapper(str_findall, flags=True, name='findall')


Previously, these methods would end up with the name 'str_count', instead of 'count', for example. xref #23551

h-vetinari · 2018-11-08T00:32:50Z

pandas/tests/test_strings.py

+    ('integer', [1, np.nan, 2]),
+    ('mixed-integer-float', [1, np.nan, 2.0]),
+    ('decimal', [Decimal(1), np.nan, Decimal(2)]),
+    # ('complex', [1 + 1j, np.nan, 2 + 2j]),


This and the timedelta64 case below are commented out due to #23554

h-vetinari · 2018-11-08T00:36:58Z

pandas/tests/test_strings.py

+            pytest.xfail(reason='Method not nan-safe on Index; see GH 23558')
+        if (method_name == 'get_dummies' and box == Index
+                and inferred_dtype == 'empty' and dtype == object):
+            pytest.xfail(reason='Need to fortify get_dummies corner case')


This case wasn't an issue up until now, because Index.str raised if the inferred dtype was 'empty'. As such, I haven't opened an issue for it yet.

…ev#23167

h-vetinari · 2018-11-08T22:02:40Z

@jreback
Since all these things are interrelated, the only way I could think of how to split this PR, is adding the (initially mostly xfailed) tests of this PR in #23582, and then all the code fixes can follow independently.

jreback

so this PR is too much going on. pls break apart with a sepearate PR

for the name propagating fixing
.bytes accessor
test structure changes (you already did this i think as a pre-cursor)

jreback · 2018-11-11T17:14:20Z

pandas/_libs/lib.pyx

@@ -2258,6 +2258,8 @@ def tuples_to_object_array(ndarray[object] tuples):
    result = np.empty((n, k), dtype=object)
    for i in range(n):
        tup = tuples[i]


you would need to set the result to np.nan if its a null

np.empty((n, k), dtype=object) would have None in them, which then get used just like np.nan later. But can change of course.

jreback · 2018-11-11T17:14:50Z

pandas/_libs/lib.pyx

@@ -2274,6 +2276,8 @@ def to_object_array_tuples(rows: list):

    k = 0
    for i in range(n):
+        if checknull(rows[i]):


jreback · 2018-11-11T17:14:57Z

pandas/_libs/lib.pyx

@@ -2283,11 +2287,15 @@ def to_object_array_tuples(rows: list):
    try:
        for i in range(n):
            row = rows[i]
+            if checknull(row):


same for these

jreback · 2018-11-11T17:16:10Z

pandas/core/strings.py

@@ -54,6 +55,8 @@ def cat_core(list_of_columns, sep):
    nd.array
        The concatenation of list_of_columns with sep
    """
+    if sep == '':


why dont' you

if sep == '': list_with_sep = list_of_columns else: .... return np.sum(....

jreback · 2018-11-11T17:16:46Z

pandas/core/strings.py

@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0):
    if regex.groups == 0:
        raise ValueError("pattern contains no capture groups")

-    if isinstance(arr, ABCIndex):
+    if isinstance(arr, ABCIndexClass):


of course, ABCIndexClass is the base index class.

pandas/core/strings.py

h-vetinari · 2018-11-11T17:22:42Z

@jreback

so this PR is too much going on. pls break apart with a sepearate PR

Agreed, which is why I split off the tests (which are the main source of changes, and responsible for checking them) in #23582. Let's do that one first, and then I can easily split up this PR.

jreback · 2019-05-12T21:17:18Z

this looked ok, if @pandas-dev/pandas-core wants to finish this up.

h-vetinari · 2019-05-13T05:53:45Z

@jreback
Couldn't respond to your last ping at the time and then forgot. Merged master.

jreback

mainly doc-comments. ping on green.

jreback · 2019-05-16T01:26:54Z

doc/source/user_guide/text.rst

+.. warning::
+
+    Before v.0.25.0, the ``.str``-accessor did only the most rudimentary type checks. Starting with
+    v.0.25.0, the type of the Series is inferred (like it has already been the case for ``Index.str``),


remove what you have in the parens (like it...)

jreback · 2019-05-16T01:27:26Z

doc/source/whatsnew/v0.25.0.rst

+The ``.str``-accessor performs stricter type checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.


Reword this in a more positive light. (basically remove up thru numpy)

jreback · 2019-05-16T01:28:56Z

doc/source/whatsnew/v0.25.0.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.
+From now on, the inferred dtype of the Series is checked to be correct (particularly, not ``'bytes'``), as :attr:`Index.str` does already.


From now on, the inferred dtype of the Series is checked to be correct (particularly, not 'bytes'), as :attr:Index.str does alread

:attr:`Series.str` will now infer the dtype data *within* the Series, in particular, bytes will raise an exception

pandas/core/strings.py

jreback · 2019-05-16T01:35:00Z

doc/source/whatsnew/v0.25.0.rst

+The ``.str``-accessor performs stricter type checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype.


add the issues that are being closed here (before the examples)

jreback · 2019-05-19T19:32:59Z

@h-vetinari a couple of comments, merge master and ping on green for merge.

h-vetinari · 2019-05-30T15:00:18Z

@jreback
This is green.

Note that I enabled 'bytes' also for methods .str.get and .str.slice, as those are still useful in that case, and to reduce the potential impact of the breaking change.

h-vetinari · 2019-05-30T15:09:05Z

Argh, just double-checked and saw that I had forgotten to save the whatsnew changes...

jreback · 2019-06-01T15:02:37Z

pandas/tests/test_strings.py

-                and inferred_dtype in ['boolean', 'date', 'time']):
-            pytest.xfail(reason='Inferring incorrectly because of NaNs; '
-                         'solved by GH 23167')
+        if (method_name in ['partition', 'rpartition'] and box == Index


nice on removing some of the xfails; ideally we can remove the rest in future PRs

jreback · 2019-06-01T15:03:39Z

thanks @h-vetinari

as you have seen, targeted PRs merge much much faster than giant refactors

h-vetinari · 2019-06-01T15:08:02Z

@jreback: as you have seen, targeted PRs merge much much faster than giant refactors

Fair, but that's not a good reason to completely avoid all refactors. Sometimes, painful(ly large) PRs are necessary.

…n all-NA) (pandas-dev#23167)

jreback added the Strings String extension data type and string data label Oct 17, 2018

jreback requested changes Oct 17, 2018

View reviewed changes

h-vetinari commented Oct 17, 2018

View reviewed changes

jreback requested changes Oct 23, 2018

View reviewed changes

This was referenced Oct 30, 2018

API: fix corner cases of lib.infer_dtype #23421

Closed

API: fix corner case of lib.infer_dtype #23422

Merged

h-vetinari force-pushed the str_infer branch from 434af4b to c8bd2b9 Compare October 30, 2018 21:13

h-vetinari force-pushed the str_infer branch 3 times, most recently from 5f96220 to e1b69af Compare November 5, 2018 19:01

jreback requested changes Nov 6, 2018

View reviewed changes

h-vetinari commented Nov 7, 2018

View reviewed changes

This was referenced Nov 7, 2018

API: __name__ of str.cat.methods wrong #23551

Closed

BUG: lib.infer_type broken for IntervalArray / PeriodArray #23553

Closed

API/DOC: types covered by lib.infer_dtype #23554

Open

h-vetinari force-pushed the str_infer branch from 52aedc5 to d80d702 Compare November 7, 2018 23:26

This was referenced Nov 7, 2018

BUG: CategoricalIndex.str.extractall fails #23555

Closed

BUG: many methods on CategoricalIndex.str are broken #23556

Closed

BUG: Index.str.partition not nan-safe #23558

Closed

h-vetinari commented Nov 8, 2018

View reviewed changes

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Nov 8, 2018

TST: add method/dtype coverage to str-accessor; precursor to pandas-d…

9800447

…ev#23167

h-vetinari mentioned this pull request Nov 8, 2018

TST: add method/dtype coverage to str-accessor; precursor to #23167 #23582

Merged

3 tasks

jreback requested changes Nov 11, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into str_infer

a68ea5f

Merge remote-tracking branch 'upstream/master' into str_infer

5ae1533

jreback requested changes May 16, 2019

View reviewed changes

jreback reviewed May 16, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone May 19, 2019

h-vetinari added 3 commits May 30, 2019 15:45

Merge remote-tracking branch 'upstream/master' into str_infer

b8f6792

fix merge artefact

0c7e233

Review (jreback)

1168ca2

commit whatsnew changes

ab980ec

h-vetinari added 3 commits May 30, 2019 17:10

retrigger azure

a996889

remove mentions of 'unicode'

4adef35

improve docstring for ._validate

f62e344

jreback reviewed Jun 1, 2019

View reviewed changes

jreback approved these changes Jun 1, 2019

View reviewed changes

jreback merged commit 9a42cbe into pandas-dev:master Jun 1, 2019

h-vetinari deleted the str_infer branch June 1, 2019 15:09

vaibhavhrt pushed a commit to vaibhavhrt/pandas that referenced this pull request Jun 6, 2019

API: Series.str-accessor infers dtype (and Index.str does not raise o…

a2f9013

…n all-NA) (pandas-dev#23167)

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

simonjayhawkins mentioned this pull request Feb 18, 2020

Pandas 1.0.1 gives 1953 TypeError: Cannot use .str.encode with values of inferred dtype 'bytes'. #32048

Closed

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

Conversation

h-vetinari commented Oct 15, 2018 • edited by jreback Loading

pep8speaks commented Oct 15, 2018 • edited Loading

Comment last updated at 2019-05-30 15:17:50 UTC

h-vetinari commented Oct 15, 2018

codecov bot commented Oct 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Oct 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Nov 8, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Nov 11, 2018

jreback commented May 12, 2019

h-vetinari commented May 13, 2019

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Oct 15, 2018 •

edited by jreback

Loading

pep8speaks commented Oct 15, 2018 •

edited

Loading

codecov bot commented Oct 16, 2018 •

edited

Loading

h-vetinari commented Oct 30, 2018 •

edited

Loading

h-vetinari Nov 8, 2018 •

edited

Loading