-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167
Conversation
Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-05-30 15:17:50 UTC |
Hm, turns out the |
Codecov Report
@@ Coverage Diff @@
## master #23167 +/- ##
==========================================
- Coverage 91.84% 91.84% -0.01%
==========================================
Files 174 174
Lines 50643 50682 +39
==========================================
+ Hits 46515 46550 +35
- Misses 4128 4132 +4
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -208,6 +208,8 @@ Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`) | |||
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reword a bit. something like
The `.str`-accessor will raise on any non-(string or unicode) ``object`` dtype data. Previously this would accept ``bytes`` for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is how you would love to see the .str
-accessor work (and it's not like my sentence is better).
The problem is that the .str
-accessor has effectively been an .object
-accessor for a very long time, and is used as such - even pandas-internally, like .str.join
for lists or str.decode
for bytes.
Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.
So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).
So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.
pandas/core/strings.py
Outdated
raise AttributeError('Can only use .str accessor with Index, ' | ||
'not MultiIndex') | ||
|
||
# see src/inference.pyx which can contain string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty is only ok if its object dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/core/strings.py
Outdated
# see src/inference.pyx which can contain string values | ||
allowed_types = ['string', 'unicode', 'empty', | ||
'mixed', 'mixed-integer'] | ||
if isinstance(data, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this accept bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See OP:
What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding
bytes
data. Should such an encode-decode-cycle explicitly part of the desired functionality for.str
(currently there are.str.encode
and.str.decode
methods), or not? I guess that even if not, this would need a deprecation cycle.
I originally commented out the only tests in test_strings
that needed bytes
and completely forbade bytes
, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods
object but raise on inferred_dtype='bytes'
for all but the str.decode
method.
pandas/core/strings.py
Outdated
if isinstance(data, ABCSeries): | ||
allowed_types = allowed_types + ['bytes'] | ||
|
||
values = data if isinstance(data, Index) else data.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getattr(data, 'values', data)
pandas/core/strings.py
Outdated
else: | ||
inf_type = lib.infer_dtype(values) | ||
|
||
all_na_obj = is_object_dtype(values.dtype) and data.isna().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is by defintion what 'empty' means, you only need to check if its object type in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'
.
>>> import pandas as pd
>>> import numpy as np
>>> import pandas._libs.lib as lib
>>> lib.infer_dtype(pd.Index([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([]))
'floating'
>>> lib.infer_dtype(pd.Index([]))
'empty'
>>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([], dtype=object))
'empty'
>>> lib.infer_dtype(pd.Index([], dtype=object))
'empty'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, still separating these out to separate if/elif clauses makes this easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary anymore if I use .dropna
. Then the only 'mixed'
are really mixed (and not caused by e.g. np.nan
)
pandas/core/strings.py
Outdated
|
||
values = data if isinstance(data, Index) else data.values | ||
if is_categorical_dtype(data.dtype): | ||
inf_type = lib.infer_dtype(values.categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls pls don't use abbreviation. inferred_type
is the name for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came from the existing code. Will change
pandas/core/strings.py
Outdated
all_na_obj = is_object_dtype(values.dtype) and data.isna().all() | ||
|
||
# same for Series and Index (that is not MultiIndex) | ||
if inf_type not in allowed_types and not all_na_obj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than doing this here, I would for example have an if block that handles the empty (and checks if its object_dtype, then you immediately know if its ok or not)
then pass thru string and unicode. anything else is a raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok in principle, but it's definitely necessary to allow 'mixed'
:
>>> lib.infer_dtype(pd.Series(['a', np.nan]))
'mixed'
Disallowing 'mixed-integer'
would also break existing code, but I mentioned in the OP:
Finally, it would be possible to partially close #13877 by also excluding
'mixed-integer'
from the allowed inferred types.
Check out the example in #13877 - currently the .str
-methods are generally permissive about just returning NaN
if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith
. Could introduce an 'errors'
-kwarg where necessary, but all of that needs to happen on method level - the .str
-constructor must allow these to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above, we need a very clear path where each case can be handled independently if need be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could infer on data.dropna()
, then the answer would be much clearer, I think.
I'd actually like to save the result of all this type inferring in an attribute of the StringMethods
object (like ._inferred_type
) so that it can be reused by the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independently, we could also (at a later stage) add the errors
-kwarg to .str
itself, and the methods then inherit it.
pandas/core/strings.py
Outdated
@@ -2274,7 +2274,8 @@ def cat(self, others=None, sep=None, na_rep=None, join=None): | |||
result = cat_core(all_cols, sep) | |||
|
|||
if isinstance(self._orig, Index): | |||
result = Index(result, name=self._orig.name) | |||
# add dtype for case that result is all-NA | |||
result = Index(result, dtype='object', name=self._orig.name) | |||
else: # Series | |||
result = Series(result, index=data.index, name=self._orig.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob need dtype=object (you don't need quotes FYI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. There are still several open questions on how to proceed.
pandas/core/strings.py
Outdated
raise AttributeError('Can only use .str accessor with Index, ' | ||
'not MultiIndex') | ||
|
||
# see src/inference.pyx which can contain string values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/core/strings.py
Outdated
# see src/inference.pyx which can contain string values | ||
allowed_types = ['string', 'unicode', 'empty', | ||
'mixed', 'mixed-integer'] | ||
if isinstance(data, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See OP:
What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding
bytes
data. Should such an encode-decode-cycle explicitly part of the desired functionality for.str
(currently there are.str.encode
and.str.decode
methods), or not? I guess that even if not, this would need a deprecation cycle.
I originally commented out the only tests in test_strings
that needed bytes
and completely forbade bytes
, but then there were lots of test errors, see e.g. https://travis-ci.org/pandas-dev/pandas/jobs/441825774
Alternatively, it'd be possible to allow construction of the StringMethods
object but raise on inferred_dtype='bytes'
for all but the str.decode
method.
pandas/core/strings.py
Outdated
|
||
values = data if isinstance(data, Index) else data.values | ||
if is_categorical_dtype(data.dtype): | ||
inf_type = lib.infer_dtype(values.categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came from the existing code. Will change
pandas/core/strings.py
Outdated
else: | ||
inf_type = lib.infer_dtype(values) | ||
|
||
all_na_obj = is_object_dtype(values.dtype) and data.isna().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that's not True. For all-NA Series/Index of object dtype, the inference still returns 'floating'
.
>>> import pandas as pd
>>> import numpy as np
>>> import pandas._libs.lib as lib
>>> lib.infer_dtype(pd.Index([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan]))
'floating'
>>> lib.infer_dtype(pd.Series([]))
'floating'
>>> lib.infer_dtype(pd.Index([]))
'empty'
>>> lib.infer_dtype(pd.Index([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object))
'floating'
>>> lib.infer_dtype(pd.Series([], dtype=object))
'empty'
>>> lib.infer_dtype(pd.Index([], dtype=object))
'empty'
pandas/core/strings.py
Outdated
all_na_obj = is_object_dtype(values.dtype) and data.isna().all() | ||
|
||
# same for Series and Index (that is not MultiIndex) | ||
if inf_type not in allowed_types and not all_na_obj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok in principle, but it's definitely necessary to allow 'mixed'
:
>>> lib.infer_dtype(pd.Series(['a', np.nan]))
'mixed'
Disallowing 'mixed-integer'
would also break existing code, but I mentioned in the OP:
Finally, it would be possible to partially close #13877 by also excluding
'mixed-integer'
from the allowed inferred types.
Check out the example in #13877 - currently the .str
-methods are generally permissive about just returning NaN
if they can't compute individual values. Arguably, that's also the most desirable behaviour for things like .str.startswith
. Could introduce an 'errors'
-kwarg where necessary, but all of that needs to happen on method level - the .str
-constructor must allow these to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review; test additions will follow
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -208,6 +208,8 @@ Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`) | |||
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is how you would love to see the .str
-accessor work (and it's not like my sentence is better).
The problem is that the .str
-accessor has effectively been an .object
-accessor for a very long time, and is used as such - even pandas-internally, like .str.join
for lists or str.decode
for bytes.
Also things like calculating the lengths of a Series of lists/sets is very common - almost 10k views at this SO question.
So it will probably be a longer process to move away from that (and would IMO need some suitable replacements).
pandas/core/strings.py
Outdated
else: | ||
inf_type = lib.infer_dtype(values) | ||
|
||
all_na_obj = is_object_dtype(values.dtype) and data.isna().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary anymore if I use .dropna
. Then the only 'mixed'
are really mixed (and not caused by e.g. np.nan
)
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -208,6 +208,8 @@ Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`) | |||
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then i think we maybe need to just show a warning for non-string/unicode uses. I guess this would generalize to calling anything-other-than-decode on bytes as well (though should raise on that).
So what i'll suggest is that we do an infer pass at the very beginning, rejecting the obvious things, then pass that inferred string around so that the methods can decide what to do, and have several canned responsed, eg.. a standard warning, or an error (for the bytes things).
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -208,6 +208,8 @@ Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`) | |||
- The `.str`-accessor does not work for anything other than string data anymore. Any other uses (like ``bytes``, for example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may actually want to create a class object around which has the pointers to the data and the inferred state. this is similar to how we handled infer_dtype dispatching.
I implemented just that for the bytes case now. PTAL Side note: I needed to use #23422 already to be able to use the type inference. |
5f96220
to
e1b69af
Compare
pandas/core/strings.py
Outdated
|
||
values = getattr(data, 'values', data) # Series / Index | ||
values = getattr(values, 'categories', values) # categorical / normal | ||
# missing values mess up type inference -> skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
'mess up' -> obfuscate
pandas/core/strings.py
Outdated
@@ -2400,13 +2411,17 @@ def cat(self, others=None, sep=None, na_rep=None, join=None): | |||
'side': 'beginning', | |||
'method': 'split'}) | |||
def split(self, pat=None, n=-1, expand=False): | |||
if self._inferred_type in ['bytes']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a method instead and call it where appropriate (or can do a decorator)
def assert_bytes_allowed(self):
....
pandas/tests/test_strings.py
Outdated
@pytest.mark.parametrize('other', [Series, Index]) | ||
def test_str_cat_all_na(self, box, other): | ||
# check that all NaNs in caller / target work | ||
s = Index(['a', 'b', 'c', 'd']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this more transparent and don't use abbreviations. This is very hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand you here - this uses the same format as the other tests for str.cat
, and I can't see any abbreviations.
pandas/tests/test_strings.py
Outdated
@@ -377,11 +377,36 @@ def test_str_cat_align_mixed_inputs(self, join): | |||
with tm.assert_raises_regex(ValueError, rgx): | |||
s.str.cat([t, z], join=join) | |||
|
|||
def test_str_cat_raises(self): | |||
@pytest.mark.parametrize('box', [Series, Index]) | |||
def test_str_cat_raises(self, box): | |||
# non-strings hiding behind object dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a test that validates that a bytes object is not wokring for all methods but decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to a decorator like you asked - it's much more elegant IMO, and also uncovered some deficiencies (e.g. the function names were not preserved in the existing wrappers).
@@ -2917,7 +3021,8 @@ def rindex(self, sub, start=0, end=None): | |||
5 3.0 | |||
dtype: float64 | |||
""") | |||
len = _noarg_wrapper(len, docstring=_shared_docs['len'], dtype=int) | |||
len = _noarg_wrapper(len, docstring=_shared_docs['len'], | |||
forbidden_types=None, dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that len
should still be allowed for bytes
to preserve a sort of minimal functionality compared to what is working as of v.0.23.4
pandas/tests/test_strings.py
Outdated
('find', ['some_pattern']), | ||
('rfind', ['some_pattern']), | ||
# ('index', ['some_pattern']), | ||
# ('rindex', ['some_pattern']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index
and partition
and get_dummies
were having some errors on Index
. Those will have to be investigated (or xfailed)
pandas/tests/test_strings.py
Outdated
+ [('floating', [1.0, np.nan, 2.0]), | ||
('integer', [1, 2, 3]), | ||
# ('decimal', [...]), | ||
# ('complex', [...]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filling in these types is WIP, but will be easy.
pandas/tests/test_strings.py
Outdated
t.str | ||
assert not hasattr(t, 'str') | ||
|
||
def test_api_mi_raises(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the (existing) error for trying to call MultiIndex.str
was not tested so far.
pandas/tests/test_strings.py
Outdated
t = box(values, dtype=object) # object dtype to avoid casting | ||
|
||
bytes_allowed = method_name in ['encode', 'decode', 'len'] | ||
mixed_allowed = method_name not in ['cat'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this test is written/parametrized should make it very transparent which (inferred) dtypes are allowed for which method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of issue refs for the bugs I found. I thought about splitting these fixes off into separate PRs, but all of them would essentially need the same parametrized-fixture-machinery I'm building in this PR, so I'm leaving them here for now.
pandas/core/strings.py
Outdated
@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0): | |||
if regex.groups == 0: | |||
raise ValueError("pattern contains no capture groups") | |||
|
|||
if isinstance(arr, ABCIndex): | |||
if isinstance(arr, ABCIndexClass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line fixes #23555. (the problem was that a CategoricalIndex
is not an instance of ABCIndex
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, ABCIndexClass is the base index class.
pandas/core/strings.py
Outdated
|
||
# for category, we do the stuff on the categories, so blow it up | ||
# to the full series again | ||
# But for some operations, we have to do the stuff on the full values, | ||
# so make it possible to skip this step as the method already did this | ||
# before the transformation... | ||
if use_codes and self._is_categorical: | ||
result = take_1d(result, self._orig.cat.codes) | ||
# if self._orig is a CategoricalIndex, there is no .cat-accessor | ||
result = take_1d(result, Series(self._orig).cat.codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lines fixes #23556
count = _pat_wrapper(str_count, flags=True, name='count') | ||
startswith = _pat_wrapper(str_startswith, na=True, name='startswith') | ||
endswith = _pat_wrapper(str_endswith, na=True, name='endswith') | ||
findall = _pat_wrapper(str_findall, flags=True, name='findall') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, these methods would end up with the name 'str_count'
, instead of 'count'
, for example. xref #23551
pandas/tests/test_strings.py
Outdated
('integer', [1, np.nan, 2]), | ||
('mixed-integer-float', [1, np.nan, 2.0]), | ||
('decimal', [Decimal(1), np.nan, Decimal(2)]), | ||
# ('complex', [1 + 1j, np.nan, 2 + 2j]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the timedelta64
case below are commented out due to #23554
pandas/tests/test_strings.py
Outdated
pytest.xfail(reason='Method not nan-safe on Index; see GH 23558') | ||
if (method_name == 'get_dummies' and box == Index | ||
and inferred_dtype == 'empty' and dtype == object): | ||
pytest.xfail(reason='Need to fortify get_dummies corner case') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case wasn't an issue up until now, because Index.str
raised if the inferred dtype was 'empty'
. As such, I haven't opened an issue for it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this PR is too much going on. pls break apart with a sepearate PR
- for the name propagating fixing
- .bytes accessor
- test structure changes (you already did this i think as a pre-cursor)
pandas/_libs/lib.pyx
Outdated
@@ -2258,6 +2258,8 @@ def tuples_to_object_array(ndarray[object] tuples): | |||
result = np.empty((n, k), dtype=object) | |||
for i in range(n): | |||
tup = tuples[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would need to set the result to np.nan if its a null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.empty((n, k), dtype=object)
would have None
in them, which then get used just like np.nan
later. But can change of course.
pandas/_libs/lib.pyx
Outdated
@@ -2274,6 +2276,8 @@ def to_object_array_tuples(rows: list): | |||
|
|||
k = 0 | |||
for i in range(n): | |||
if checknull(rows[i]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/_libs/lib.pyx
Outdated
@@ -2283,11 +2287,15 @@ def to_object_array_tuples(rows: list): | |||
try: | |||
for i in range(n): | |||
row = rows[i] | |||
if checknull(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for these
pandas/core/strings.py
Outdated
@@ -54,6 +55,8 @@ def cat_core(list_of_columns, sep): | |||
nd.array | |||
The concatenation of list_of_columns with sep | |||
""" | |||
if sep == '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why dont' you
if sep == '':
list_with_sep = list_of_columns
else:
....
return np.sum(....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
pandas/core/strings.py
Outdated
@@ -939,7 +942,7 @@ def str_extractall(arr, pat, flags=0): | |||
if regex.groups == 0: | |||
raise ValueError("pattern contains no capture groups") | |||
|
|||
if isinstance(arr, ABCIndex): | |||
if isinstance(arr, ABCIndexClass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, ABCIndexClass is the base index class.
this looked ok, if @pandas-dev/pandas-core wants to finish this up. |
@jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly doc-comments. ping on green.
doc/source/user_guide/text.rst
Outdated
.. warning:: | ||
|
||
Before v.0.25.0, the ``.str``-accessor did only the most rudimentary type checks. Starting with | ||
v.0.25.0, the type of the Series is inferred (like it has already been the case for ``Index.str``), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove what you have in the parens (like it...)
doc/source/whatsnew/v0.25.0.rst
Outdated
The ``.str``-accessor performs stricter type checks | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reword this in a more positive light. (basically remove up thru numpy)
doc/source/whatsnew/v0.25.0.rst
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype. | ||
From now on, the inferred dtype of the Series is checked to be correct (particularly, not ``'bytes'``), as :attr:`Index.str` does already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From now on, the inferred dtype of the Series is checked to be correct (particularly, not
'bytes'
), as :attr:Index.str
does alread
:attr:`Series.str` will now infer the dtype data *within* the Series, in particular,
bytes will raise an exception
doc/source/whatsnew/v0.25.0.rst
Outdated
The ``.str``-accessor performs stricter type checks | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Due to the lack of a native string dtype in numpy, :attr:`Series.str` only checked whether the data was of ``object`` dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issues that are being closed here (before the examples)
@h-vetinari a couple of comments, merge master and ping on green for merge. |
@jreback Note that I enabled |
Argh, just double-checked and saw that I had forgotten to save the whatsnew changes... |
and inferred_dtype in ['boolean', 'date', 'time']): | ||
pytest.xfail(reason='Inferring incorrectly because of NaNs; ' | ||
'solved by GH 23167') | ||
if (method_name in ['partition', 'rpartition'] and box == Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice on removing some of the xfails; ideally we can remove the rest in future PRs
thanks @h-vetinari as you have seen, targeted PRs merge much much faster than giant refactors |
Fair, but that's not a good reason to completely avoid all refactors. Sometimes, painful(ly large) PRs are necessary. |
closes #23163
closes #23011
closes #23551
#23555 / #23556git diff upstream/master -u -- "*.py" | flake8 --diff
Several times while working around
str.cat
(most recently #22725) @jreback and @WillAyd mentioned that the.str
accessor should not work for (e.g.) bytes data. So far, the constructor of theStringMethods
object didn't infer forSeries
-- this PR adds that.What needs disussion is the following: I've commented out two tests about an encode-decode-roundtrip that cannot work when forbidding bytes data. Should such an encode-decode-cycle explicitly part of the desired functionality for.str
(currently there are.str.encode
and.str.decode
methods), or not? I guess that even if not, this would need a deprecation cycle.Alternatively, it'd be possible to allow construction of the
StringMethods
object but raise oninferred_dtype='bytes'
for all but thestr.decode
method.Finally, it would be possible to partially close #13877 by also excluding
'mixed-integer'
from the allowed inferred types,but for floats, the inference-code would need to be able to distinguish between actual floats and missing values (currently, both of those returninferred_type='mixed'
)