API: str.cat will align on Series #20347

h-vetinari · 2018-03-14T14:44:38Z

Fixes issue #18657, fixed existing tests, added new test; all pass.

After I pushed everything and thought about it some more, I realised that one may argue about the default alignment-behavior, and whether it should be changed to join=outer. The behavior as implemented is compatible with the current requirement that everything be of the same length. To me, it is more intuitive that the concatenated other is added to the current series without enlarging it, but I can also see the argument why that restriction is unnecessary.

PS. This is my first PR, tried to follow all the rules. Sorry if I overlooked something.

Edit: Also fixes #20842

pep8speaks · 2018-03-14T14:44:42Z

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 02, 2018 at 05:33 Hours UTC

h-vetinari · 2018-03-14T15:13:36Z

I just realised that due to the API-change, a what's-new entry is probably needed somewhere?

codecov · 2018-03-14T17:39:59Z

Codecov Report

Merging #20347 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20347      +/-   ##
==========================================
+ Coverage   91.79%   91.81%   +0.01%     
==========================================
  Files         153      153              
  Lines       49411    49478      +67     
==========================================
+ Hits        45359    45429      +70     
+ Misses       4052     4049       -3

Flag	Coverage Δ
#multiple	`90.21% <100%> (+0.01%)`	⬆️
#single	`41.85% <4%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.63% <100%> (+0.28%)`	⬆️
pandas/util/testing.py	`84.59% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4da79b...3f77b80. Read the comment docs.

h-vetinari · 2018-03-15T09:46:18Z

There are some useless commits in the appveyor-queue - how can those be cancelled? I'm guessing I don't have sufficient rights to do it. If someone who can should see this, you can cancel builds for commits (starting with): 41daef8d and d1c5543f.

jreback · 2018-03-15T10:18:21Z

pandas/tests/test_strings.py

@@ -2760,6 +2761,17 @@ def test_str_cat_raises_intuitive_error(self):
        with tm.assert_raises_regex(ValueError, message):
            s.str.cat('    ')

+    def test_str_cat_align(self):
+        # https://github.com/pandas-dev/pandas/issues/18657


needs another case where this would produce nans on some elements (e.g. the original issue)

jreback · 2018-03-15T10:19:39Z

pandas/core/strings.py

@@ -65,6 +78,11 @@ def str_cat(arr, others=None, sep=None, na_rep=None):
        If None, concatenates without any separator.
    na_rep : string or None, default None
        If None, NA in the series are ignored.
+    align : bool or None, default None
+        If used between two Series, determines whether they are aligned


add a versionadded tag

UserWarning -> FutureWarning

jreback · 2018-03-15T10:19:52Z

pandas/core/strings.py

+            and len(others) and isinstance(others, Series)):
+        if align is None:
+            align = False
+            warnings.warn("A future version of pandas will perform alignment "


FutureWarning

jreback · 2018-03-15T10:22:53Z

pandas/core/strings.py

+        if align is None:
+            align = False
+            warnings.warn("A future version of pandas will perform alignment "
+                          "when others is a series. To disable alignment (the "


this is a bit unclear, there is no 'previous behavior; the default is to not align, but it will change in the future

The text is the suggestion of @TomAugspurger in the original issue #18657

jreback

need a whatsnew subsection to explain this change, also pls update io.rst

h-vetinari · 2018-03-15T13:57:49Z

@jreback, I assume you meant text.rst and not io.rst. Added tests (changed alignment to join='outer'), added some tests, updated v0.23.0.txt and text.rst.

TomAugspurger

A few doc comments.

Have you checked the log output from Travis to see if any warnings in the test suite are uncaught? They're printed at the bottom of the test output.

TomAugspurger · 2018-03-15T14:04:45Z

doc/source/text.rst

+their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
+indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).
+
+If the ``align`` keyword is not passed, the method will currently fall back to the previous behavior (i.e. ``align=False``),


Put this in a .. warning:: directive.

Which part? I added one starting at "If the align keyword is not passed"

TomAugspurger · 2018-03-15T14:05:11Z

doc/source/text.rst

+.. ipython:: python
+
+    base = Series(['a', 'b', 'c', 'd', 'e'])
+	s = base.reindex([0, 1, 2, 3])


Can you indent all these the same?

TomAugspurger · 2018-03-15T14:06:14Z

doc/source/text.rst

+	s.str.cat(t, align=True)
+	s.str.cat(t, align=False, na_rep='')
+
+.. versionadded:: 0.23.0


It's unclear what this versionadded is referring to. The keyword? The .str.cat method? I think it's best to leave that to the API documentation.

TomAugspurger · 2018-03-15T14:06:38Z

doc/source/whatsnew/v0.23.0.txt

+``Series.str.cat`` has gained the ``align`` kwarg
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+So far, the method :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).


"So far, the method" -> "Previously"

TomAugspurger · 2018-03-15T14:07:09Z

doc/source/whatsnew/v0.23.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+So far, the method :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).
+The method has now gained a keyword ``align`` which controls this behavior. If ``False``, the behavior will be as previously. If ``True`` and ``others``


"which controls this behavior" -> "to control alignment"

I think the second sentence and the next paragraph can be simplified.

The default behavior, not aligning, has not changed. If `align` is not specified, a ``FutureWarning`` is issued and the series are not aligned. To silence the warning and not align, specify ``align=False``. To silence the warning and align the Series before concatenating, specify ``align=True``.

Not sure if I understood correctly; removed part with na_rep.

TomAugspurger · 2018-03-15T14:10:47Z

doc/source/whatsnew/v0.23.0.txt

+.. ipython:: python
+
+    base = Series(['a', 'b', 'c', 'd', 'e'])
+	s = base.reindex([0, 1, 2, 3])


Ensure these have the same indentation.

h-vetinari · 2018-03-15T15:47:26Z

@TomAugspurger @jreback Can one of you (tell me how to) remove all but the last of my commits from the appveyor-queue? It's quite far behind as it is, so no need to choke it with unnecessary commits (in Travis new commits automatically supersede old ones - why not in appveyor...?).

TomAugspurger · 2018-03-15T16:19:13Z

Appveyor will auto cancel a build if there are newer commits on the PR branch, but it doesn't show up as canceled until its turn in the queue.

…

On Thu, Mar 15, 2018 at 10:47 AM, h-vetinari ***@***.***> wrote: @TomAugspurger <https://github.com/tomaugspurger> @jreback <https://github.com/jreback> Can one of you (tell me how to) remove all but the last of my commits from the appveyor-queue? It's quite far behind as it is, so no need to choke it with unnecessary commits (in Travis new commits automatically supersede old ones - why not in appveyor...?). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20347 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItmN8d6A8SplKgALB5ivkC4OwDPSks5teo0UgaJpZM4SqkE_> .

h-vetinari · 2018-03-15T19:38:03Z

@TomAugspurger , re:appveyor: this is not the case, at least not immediately. I can see in https://ci.appveyor.com/project/pandas-dev/pandas/history that some old commits were tried (while the new ones already existed) and ran for 1-2 minutes. I asked because I saw that some other commits were explicitly cancelled by users, but I don't know how to do that.

jorisvandenbossche

Can you also add a test where others is a list of multiple Series ?

h-vetinari · 2018-03-15T21:16:14Z

@jorisvandenbossche I wasn't even aware that's a legal signature? I'm guessing all Series would be concatenated with the same sep (and na_rep, etc.)? And it is only legal if all are Series? What about a mix of Series and ndarray?

jorisvandenbossche · 2018-03-15T21:55:12Z

I'm guessing all Series would be concatenated with the same sep (and na_rep, etc.)?

Yes, I think so

And it is only legal if all are Series? What about a mix of Series and ndarray?

I suppose we should just process each element in the list separately, so then it does not really matter if it is a mixture.

jorisvandenbossche · 2018-03-15T21:55:49Z

pandas/core/strings.py

+                          "'align=True'", FutureWarning, stacklevel=4)
+        if align:
+            arr, others = arr.align(others, join='outer')
+        arrays = [list(arr), list(others)]


I don't think the conversion to list here is needed (on line 60 they get converted to an array anyhow)

This was part of the _get_array_list-function as I found it - I didn't investigate how it interplays with str_cat (which is the only place it called from), so I left it as it was.

jreback

more comments. don't worry about the CI.

jreback · 2018-03-15T23:25:36Z

doc/source/text.rst

+Concatenating Series
+--------------------
+
+The method :meth:`Series.str.cat` can be used to concatenate the records of two :class:`Series`. Depending on the value given to the ``align`` keyword,


can you describe the simpler usecase first (IOW just concat with no other, or other is a simple list). These can just be examples, doesn't have to be so much text.

In my defense, so far there was no description about str.cat in text.rst. But I will try to write up an overview. Where should it be placed, in your opinion? I would say directly after the splitting-section (natural opposites).

jreback · 2018-03-15T23:25:44Z

doc/source/text.rst

+their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
+indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).
+
+.. warning::


you don't need this you are already passing the align keyword. These docs should be written as if a user is seeing them w/o benefit of any past history. Just show what they should do.

Do you mean the warning? The text was verbatim from @TomAugspurger, but I will rework this as well.

its not necessary here is my point.

jreback · 2018-03-15T23:27:20Z

doc/source/text.rst

+
+    base = Series(['a', 'b', 'c', 'd', 'e'])
+    s = base.reindex([0, 1, 2, 3])
+    t = base.reindex([3, 0, 4, 1])


rather than showing a bunch of lines like this. Break this up into a conversation. E.g. show the construction of the Series (call it s), then do a cat with no other, then one with a list, finally with a Series.

Wrote a long overview - conversation-style -- in text.rst

jreback · 2018-03-15T23:28:14Z

doc/source/whatsnew/v0.23.0.txt

+``Series.str.cat`` has gained the ``align`` kwarg
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).


make this simpler. This should be just the first section and a previous and new section, see the other whatsnew entires for hints on structure. Add a reference to the docs in text.rst.

I'm not sure what "This should be just the first section and a previous and new section" means exactly, but I tried to copy the style of other whatsnew entries. Reference added.

jreback · 2018-03-15T23:28:26Z

doc/source/text.rst

@@ -429,6 +429,27 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.

 See also :func:`~pandas.get_dummies`.



need a reference tag here

I don't understand what should be added. Reference to what? And where? - In the "concatenation" section I'm writing?

jreback · 2018-03-15T23:28:47Z

pandas/core/strings.py

@@ -35,19 +35,32 @@
 _shared_docs = dict()


-def _get_array_list(arr, others):
+def _get_array_list(arr, others, align=True):
    from pandas.core.series import Series


can you add a mini-doc string here (its an internal function so doesn't have to be full fledged, but Parameters / Returns)

In the current version, I'm not touching anything about _get_array_list anymore. I agree that docstrings would be good, might add a draft.

jreback · 2018-03-15T23:30:07Z

pandas/core/strings.py

-    def cat(self, others=None, sep=None, na_rep=None):
+    def cat(self, others=None, sep=None, na_rep=None, align=None):
+        from pandas.core.series import Series
+        # FutureWarning for align=None emitted in one place only: str_cat


you don't need any of these comments

jreback · 2018-03-15T23:31:32Z

pandas/core/strings.py

+        if align is not None and align and isinstance(others, Series):
+            # str_cat deals with arrays only;
+            # make sure index is correct here as well for using _wrap_result
+            self._orig, others = self._orig.align(others, join='outer')


why is this needed here? is this not handled in str_cat? (which dispatched to _get_array_list)?

jreback · 2018-03-15T23:32:22Z

pandas/tests/test_strings.py

+        s = base.reindex([1, 3, 0, 2])
+        t = base.reindex([3, 0, 4, 1])
+        expect_rs_aligned = Series(['aa', 'bb', 'cc', 'dd'])
+        expect_rs_unaligned = Series(['ab', 'bd', 'ca', 'dc'])


put a blank line between cases. Add a comment to cases if needed. If you are writing things than once, pls parameterize.

Overlooked this, will change in next commit. You mean parametrisation of things like the following example?

def rt(**kwargs): r.str.cat(t, **kwargs)

I updated the tests, so that they are separated by lines, commented where necessary, and much easier to read.

h-vetinari · 2018-03-16T00:56:12Z

Is there some pandas default how often a FutureWarning shows up? I'm not getting the align-warning a second time, even I verified the code runs through the corresponding if.

h-vetinari · 2018-03-16T01:23:24Z

@jorisvandenbossche
I added the functionality you wanted, but it's potentially a bit unintuitive if there's a mix of ndarray and Series (because treating the elements sequentually would mean that the lengths of an array would have to match the length of the union of the indexes of all previous seriers-like elements). In this case, I demand that the ndarray elements match in length to the Series that's calling str.cat.

It was a substantial rewrite, but the code got better through it (found some bugs, and now its cleaner).
I added a lot of tests about the expected behavior - please have a look at those tests if something is unclear. Maybe some of these examples should make it into whatsnew or text.rst, but I'm too tired now. Feedback welcome.

h-vetinari · 2018-03-16T01:31:48Z

@jreback
Just saw your comments now (thanks!), will work on them tomorrow (if I can). Writing good docs is surely an important part.
The code changed a fair bit with the last commit -- initially, I left the structure as it was (i.e. str_cat and _get_array_list), but that was very much not optimal (and the reason for some of my comments).

Is str_cat part of the public API or just used to construct str.cat? Because I shifted all the relevant logic to str.cat now, and str_cat just provides the doc-string (i.e. align does not do anything for str_cat).

h-vetinari · 2018-03-16T08:02:20Z

@jorisvandenbossche @TomAugspurger @jreback
regarding legal inputs for others - if list of series are allowed, why not DataFrames too? It would be a very simple addition.

I added functionality and tests for it. If you like how everything so far works, I'll update the docs (just wanna leave the doc-writing for when the code has converged).

jorisvandenbossche · 2018-03-16T08:38:52Z

Is there some pandas default how often a FutureWarning shows up? I'm not getting the align-warning a second time, even I verified the code runs through the corresponding if.

That's normal. FutureWarnings only show up the first time you use something (you can warnings.filterwarnings to have it show always)

jorisvandenbossche · 2018-03-16T09:04:19Z

pandas/core/strings.py

-        return self._wrap_result(result, use_codes=(not self._is_categorical))
+        if align and isinstance(others, Series):
+            # str_cat deals with arrays only
+            data, others = data.align(others, join='outer')


I think we should actually align using a left join. The result of s.str.cat(others) should always preserve the shape and index of s IMO.

Current behaviour:

In [19]: s = pd.Series(['a', 'b']) In [20]: s.str.cat(pd.Series(['a', 'b'], index=[1, 2]), align=True) Out[20]: 0 NaN 1 ba 2 NaN dtype: object

That's how I started out (see OP), but it would be inconsistent with how index-alignment is handled elsewhere -- and being consistent in that trumps shape preservation, IMHO. str.cat is already special anyway, in that it is the only str-method that allows other Series as input.

And with na_rep, the behavior gets very intuitive again, IMO.

In [0]: s = pd.Series(['a', 'b']) In [1]: t = pd.Series(['a', 'b'], index=[1, 2]) In [2]: s.astype(bool) & t.astype(bool) Out[2]: 0 False 1 True 2 False dtype: bool In [3]: s.str.cat(t, align=True) Out[3]: 0 NaN 1 ba 2 NaN dtype: object In [4]: s.str.cat(t, align=True, na_rep='') Out[4]: 0 a 1 ab 2 b dtype: object In [5]: s.str.cat(t, align=True, na_rep='x') Out[5]: 0 ax 1 ab 2 xb dtype: object

I think it depends a lot on the application which behavior is desired. How about exposing a join="inner"|"left"|"right"|"outer" keyword, with default "left" (or "outer"...)?

A keyword with join='left' as the default makes the most sense to me.

Added join-keyword with default 'left'.

jreback · 2018-03-16T21:52:23Z

doc/source/text.rst

+their indexes will be aligned before concatenation (if ``align=True``) or not (if ``align=False``). As usual, alignment will expand to the union of both
+indexes, while introducing ``NaN`` for missing values in the respective other series (which can be easily handled with the ``na_rep``-keyword).
+
+.. warning::


its not necessary here is my point.

jreback · 2018-03-16T21:53:32Z

pandas/core/strings.py

+            # first achieve maximum extent of data
+            for x in others:
+                data, _ = data.align(x, join='outer')
+            # then bring elements of others to same size


woa, why are you adding all of this code????

I removed the code additions from str_cat and _get_array_list because that was messy and for several other reasons not a good way to do it -- all the changes are now in str.cat itself.

To be able to align in all the different cases (list of Series, list of np.ndarrays, mixture of both, plus all the other variants), it's necessary to add (some variant of) the code I added. I tried to keep it clean, non-redundant, documented, and nicely recursive. I don't believe the desired functionality can be added with substantially less code (have a look at the test cases). Currently working on the requested doc changes.

h-vetinari · 2018-03-17T05:00:19Z

@jreback @TomAugspurger @jorisvandenbossche
I added documentation (also for things I haven't touched, like _get_array_list), and wrote and extensive section in text.rst outlining the str.cat-method in general, and the expanded usability in particular.

If you would be so kind, please compile and read text.rst first, before you have a look at the code or tests; it should make for easier reading. I could document the code some more, but it should be quite understandable. Despite the lengths, I think it is not wasteful in terms of lines. It's just not that easy to cover all the desired cases plus all the alignment options.

Feedback welcome.

h-vetinari · 2018-03-17T18:18:36Z

The failure in the Travis-CI is an artefact -- the linter (only in the 2.7 run) complained about pandas/_libs/tslibs/period.pyx:1358:12: W291 trailing whitespace, but neither did I touch this file, nor is there any trailing whitespace around the line where the failure supposedly happened.

h-vetinari · 2018-03-18T10:31:08Z

@jorisvandenbossche

I suppose we should just process each element in the list separately, so then it does not really matter if it is a mixture.

A comment why this is not so easy (and why I align all elements in a list before starting concatenation):
If one starts concatenating right away, then one might get a problem with index changes induced by later list elements.

>>> s = pd.Series(['a', 'b', 'c', 'd'])
>>> t = pd.Series(['d', 'a', 'e', 'b'], index=[3, 0, 4, 1])
>>>
>>> # the following would be equivalent to
>>> # s.str.cat([t.values, t], align=True, join='outer', na_rep='-')
>>> # if list elements in `other` would be concatenated sequentially
>>> s.str.cat(t.values).str.cat(t, align=True, join='outer', na_rep='-')
0    ada
1    bab
2    ce-
3    dbd
4     -e  # <-- missing a NaN for first column!
dtype: object
>>> s.str.cat([t.values, t], align=True, join='outer', na_rep='-')
0    ada
1    bab
2    ce-
3    dbd
4    --e  # <-- this is more sensible behavior, IMO
dtype: object

Even worse, this would mean that any other arrays following in the list would be of the wrong length and trigger a warning...

h-vetinari · 2018-05-02T05:34:53Z

Circle-CI failed due to #20906. Rebased onto upstream and restarted.

h-vetinari · 2018-05-02T06:05:31Z

Annoyingly, there are still unrelated failures in "ci/script_single.sh" of the travis "3.6, NumPy dev" job, but at least that job doesn't fail the build.

=================================== FAILURES ===================================
___________________ TestClipboard.test_round_trip_frame_sep ____________________
self = <pandas.tests.io.test_clipboard.TestClipboard object at 0x7fa9961822e8>
    def test_round_trip_frame_sep(self):
        for dt in self.data_types:
>           self.check_round_trip_frame(dt, sep=',')
pandas/tests/io/test_clipboard.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_clipboard.py:77: in check_round_trip_frame
    tm.assert_frame_equal(data, result, check_dtype=False)
pandas/util/testing.py:1304: in assert_frame_equal
    '{shape!r}'.format(shape=right.shape))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = 'DataFrame', message = 'DataFrame shape mismatch', left = '(61, 3)'
right = '(5, 0)', diff = None
[...]
pandas/util/testing.py:1018: AssertionError
_____________________ TestClipboard.test_round_trip_frame ______________________
self = <pandas.tests.io.test_clipboard.TestClipboard object at 0x7fa996182ac8>
    def test_round_trip_frame(self):
        for dt in self.data_types:
>           self.check_round_trip_frame(dt)
pandas/tests/io/test_clipboard.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_clipboard.py:77: in check_round_trip_frame
    tm.assert_frame_equal(data, result, check_dtype=False)
pandas/util/testing.py:1304: in assert_frame_equal
    '{shape!r}'.format(shape=right.shape))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = 'DataFrame', message = 'DataFrame shape mismatch', left = '(61, 3)'
right = '(5, 3)', diff = None
[...]
pandas/util/testing.py:1018: AssertionError
--------------------- generated xml file: /tmp/single.xml ----------------------
===== 2 failed, 64 passed, 316 skipped, 25292 deselected in 36.41 seconds ======

h-vetinari · 2018-05-02T06:19:37Z

@TomAugspurger Green! =)

jreback

the checking code needs some work

jreback · 2018-05-02T10:55:12Z

pandas/core/strings.py

+                others = others.copy()
+                others.index = idx
+            return ([others[x] for x in others], fu_wrn)
+        elif isinstance(others, np.ndarray) and others.ndim == 2:


this is wrong

i don’t think we can align a ndarray at all like this
let’s can ndarray a that are > 1 dim

The DF-constructor works as expected for a 2-dim ndarray, but I haven't checked if this is tested behaviour. (essentially, df == DataFrame(df.values, columns=df.columns, index=df.index))

I would suggest not to can 2-dim ndarrays, because they are necessary to avoid alignment on the deprecation path for join:

[...] To disable alignment (the behavior before v.0.23) and silence this warning, use .values on any Series/Index/DataFrame in others. [...]

jreback · 2018-05-02T10:58:10Z

pandas/core/strings.py

+                return (los, fu_wrn)
+            # test if there is a mix of list-like and non-list-like (e.g. str)
+            elif (any(is_list_like(x) for x in others)
+                  and any(not is_list_like(x) for x in others)):


you can make this simpler by just checking for all is not list like (eg strings)

anything else will fail thru to the TypeError

jreback · 2018-05-02T10:59:52Z

pandas/core/strings.py

+            others = list(others)  # ensure iterators do not get read twice etc
+            if all(is_list_like(x) for x in others):
+                los = []
+                fu_wrn = False


can u name this parameter just warn

jreback · 2018-05-02T11:01:38Z

pandas/core/strings.py

+                fu_wrn = False
+                while others:
+                    nxt = others.pop(0)  # list-like as per check above
+                    # safety for iterators and other non-persistent list-likes


this whole section needs some work it’s way too hard to read and follow

jreback · 2018-05-02T11:03:23Z

pandas/core/strings.py

+                    is_legal = ((no_deep and nxt.dtype == object)
+                                or all((isinstance(x, compat.string_types)
+                                        or (not is_list_like(x) and isnull(x))
+                                        or x is None)


isnull already checks for None
only 1d objects are valid here (or all scalars)

do this check up front

TomAugspurger · 2018-05-02T11:08:02Z

@h-vetinari, I'll merge this shortly. I'm opening a followup issue.

…

On Wed, May 2, 2018 at 6:03 AM, Jeff Reback ***@***.***> wrote: ***@***.**** requested changes on this pull request. the checking code needs some work ------------------------------ In pandas/core/strings.py <#20347 (comment)>: > + if ignore_index and fu_wrn else others] + return (los, fu_wrn) + elif isinstance(others, Index): + fu_wrn = not others.equals(idx) + los = [Series(others.values, + index=(idx if ignore_index else others))] + return (los, fu_wrn) + elif isinstance(others, DataFrame): + fu_wrn = not others.index.equals(idx) + if ignore_index and fu_wrn: + # without copy, this could change "others" + # that was passed to str.cat + others = others.copy() + others.index = idx + return ([others[x] for x in others], fu_wrn) + elif isinstance(others, np.ndarray) and others.ndim == 2: this is wrong i don’t think we can align a ndarray at all like this let’s can ndarray a that are > 1 dim ------------------------------ In pandas/core/strings.py <#20347 (comment)>: > + or (not is_list_like(x) and isnull(x)) + or x is None) + for x in nxt)) + # DataFrame is false positive of is_legal + # because "x in df" returns column names + if not is_legal or isinstance(nxt, DataFrame): + raise TypeError(err_msg) + + nxt, fwn = self._get_series_list(nxt, + ignore_index=ignore_index) + los = los + nxt + fu_wrn = fu_wrn or fwn + return (los, fu_wrn) + # test if there is a mix of list-like and non-list-like (e.g. str) + elif (any(is_list_like(x) for x in others) + and any(not is_list_like(x) for x in others)): you can make this simpler by just checking for all is not list like (eg strings) anything else will fail thru to the TypeError ------------------------------ In pandas/core/strings.py <#20347 (comment)>: > + elif isinstance(others, DataFrame): + fu_wrn = not others.index.equals(idx) + if ignore_index and fu_wrn: + # without copy, this could change "others" + # that was passed to str.cat + others = others.copy() + others.index = idx + return ([others[x] for x in others], fu_wrn) + elif isinstance(others, np.ndarray) and others.ndim == 2: + others = DataFrame(others, index=idx) + return ([others[x] for x in others], False) + elif is_list_like(others): + others = list(others) # ensure iterators do not get read twice etc + if all(is_list_like(x) for x in others): + los = [] + fu_wrn = False can u name this parameter just warn ------------------------------ In pandas/core/strings.py <#20347 (comment)>: > + # without copy, this could change "others" + # that was passed to str.cat + others = others.copy() + others.index = idx + return ([others[x] for x in others], fu_wrn) + elif isinstance(others, np.ndarray) and others.ndim == 2: + others = DataFrame(others, index=idx) + return ([others[x] for x in others], False) + elif is_list_like(others): + others = list(others) # ensure iterators do not get read twice etc + if all(is_list_like(x) for x in others): + los = [] + fu_wrn = False + while others: + nxt = others.pop(0) # list-like as per check above + # safety for iterators and other non-persistent list-likes this whole section needs some work it’s way too hard to read and follow ------------------------------ In pandas/core/strings.py <#20347 (comment)>: > + # safety for iterators and other non-persistent list-likes + # do not map indexed/typed objects; would lose information + if not isinstance(nxt, (DataFrame, Series, + Index, np.ndarray)): + nxt = list(nxt) + + # known types without deep inspection + no_deep = ((isinstance(nxt, np.ndarray) and nxt.ndim == 1) + or isinstance(nxt, (Series, Index))) + # Nested list-likes are forbidden - elements of nxt must be + # strings/NaN/None. Need to robustify NaN-check against + # x in nxt being list-like (otherwise ambiguous boolean) + is_legal = ((no_deep and nxt.dtype == object) + or all((isinstance(x, compat.string_types) + or (not is_list_like(x) and isnull(x)) + or x is None) isnull already checks for None only 1d objects are valid here (or all scalars) do this check up front — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20347 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIi5osRhzmq7QhxBxozIbONWCN5Yiks5tuZKYgaJpZM4SqkE_> .

TomAugspurger · 2018-05-02T11:13:47Z

#20922 for the followup.

Thanks!

h-vetinari · 2018-05-02T11:22:39Z

@TomAugspurger @jreback
Will do the follow-up tonight. Thanks for the patience/reviews and helping get this across the line - guess it was a bit ambitious as a first PR... :)

h-vetinari force-pushed the str_cat_align branch from 9ff31f8 to 8ec2f1a Compare March 15, 2018 08:38

jreback requested changes Mar 15, 2018

View reviewed changes

jreback added API Design Strings String extension data type and string data labels Mar 15, 2018

jreback changed the title ~~Fixed issue 18657~~ API: str.cat will align on Series Mar 15, 2018

TomAugspurger reviewed Mar 15, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 15, 2018

View reviewed changes

jreback requested changes Mar 15, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 16, 2018

View reviewed changes

jreback requested changes Mar 16, 2018

View reviewed changes

h-vetinari added 8 commits May 2, 2018 07:33

Emit FutureWarning only for different indexes

d7587ba

Restrict legal argument combinations; no nesting

7c9735f

Fix edge case for NaN/None; adapted tests; fixes

1aedf72

Revert cat-output-for-cat-caller propsal

2143f19

Removed duplicate tests

fc9aa67

Improve tests/errors for str_cat; fix is_legal

fcdb57b

Avoid deep inspection for known types in _get_series_list

5a237ea

Incorporate review feedback

3f77b80

h-vetinari force-pushed the str_cat_align branch from 1c06234 to 3f77b80 Compare May 2, 2018 05:33

jreback requested changes May 2, 2018

View reviewed changes

TomAugspurger mentioned this pull request May 2, 2018

Followup to #20347 (str_cat) #20922

Closed

5 tasks

TomAugspurger merged commit f851699 into pandas-dev:master May 2, 2018

h-vetinari mentioned this pull request May 2, 2018

Follow-up #20347: incorporate review about _get_series_list #20923

Merged

jreback pushed a commit that referenced this pull request May 4, 2018

Follow-up #20347: incorporate review about _get_series_list (#20923)

ef019fa

h-vetinari mentioned this pull request Jun 5, 2018

DOC: fix mistake in Series.str.cat #21330

Merged

h-vetinari mentioned this pull request Jun 19, 2018

ENH: set accessor for Series (WIP) #21547

Closed

h-vetinari mentioned this pull request Jul 17, 2018

DEPR: list of lists in Series.str.cat #21950

Closed

h-vetinari mentioned this pull request Sep 2, 2018

CLN: tests for str.cat #22575

Merged

h-vetinari mentioned this pull request Sep 16, 2018

CLN/ERR: str.cat internals #22725

Merged

3 tasks

This was referenced Oct 5, 2018

sets in str.cat?! #23009

Closed

API: set should not be considered list_like #23061

Closed

h-vetinari mentioned this pull request Oct 16, 2018

Add allow_sets-kwarg to is_list_like #23065

Merged

4 tasks

h-vetinari mentioned this pull request Nov 15, 2018

BUG: concat warning bubbling up through str.cat #23725

Merged

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

		@@ -429,6 +429,27 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.

		See also :func:`~pandas.get_dummies`.

API: str.cat will align on Series #20347

API: str.cat will align on Series #20347

Conversation

h-vetinari commented Mar 14, 2018 • edited Loading

pep8speaks commented Mar 14, 2018 • edited Loading

Comment last updated on May 02, 2018 at 05:33 Hours UTC

h-vetinari commented Mar 14, 2018 • edited Loading

codecov bot commented Mar 14, 2018 • edited Loading

Codecov Report

h-vetinari commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Mar 15, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Mar 15, 2018

TomAugspurger commented Mar 15, 2018 via email

h-vetinari commented Mar 15, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

h-vetinari commented Mar 15, 2018

jorisvandenbossche commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Mar 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Mar 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Mar 16, 2018

h-vetinari commented Mar 16, 2018 • edited Loading

h-vetinari commented Mar 16, 2018

h-vetinari commented Mar 16, 2018 • edited Loading

jorisvandenbossche commented Mar 16, 2018

Choose a reason for hiding this comment

h-vetinari Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

h-vetinari Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Mar 14, 2018 •

edited

Loading

pep8speaks commented Mar 14, 2018 •

edited

Loading

h-vetinari commented Mar 14, 2018 •

edited

Loading

codecov bot commented Mar 14, 2018 •

edited

Loading

h-vetinari Mar 17, 2018 •

edited

Loading

h-vetinari Mar 17, 2018 •

edited

Loading

h-vetinari commented Mar 16, 2018 •

edited

Loading

h-vetinari commented Mar 16, 2018 •

edited

Loading

h-vetinari Mar 16, 2018 •

edited

Loading

h-vetinari Mar 16, 2018 •

edited

Loading

h-vetinari commented Mar 17, 2018 •

edited

Loading

h-vetinari commented Mar 18, 2018 •

edited

Loading

h-vetinari commented May 2, 2018 •

edited

Loading

h-vetinari commented May 2, 2018 •

edited

Loading