EHN: multi-column explode #40770

iynehz · 2021-04-03T14:21:41Z

closes ENH: Multi-Column explode #39240
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-04-03T14:21:47Z

Hello @stphnlyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-20 15:27:09 UTC

pandas/core/frame.py

phofl · 2021-04-06T19:22:02Z

pandas/core/frame.py

+        counts0 = self[columns[0]].apply(mylen)
+        for c in columns[1:]:
+            if not all(counts0 == self[c].apply(mylen)):
+                raise ValueError("columns must have matching element counts")


Document too

phofl · 2021-04-06T19:22:30Z

pandas/core/frame.py

+        else:
+            raise ValueError("column must be a scalar, tuple, or list thereof")
+
+        mylen = lambda x: len(x) if is_list_like(x) else -1


This is probably a performance killer, so please do this only if len(columns) > 1. Maybe there is a better way to do this? Also add asv for this case

phofl · 2021-04-06T19:23:07Z

pandas/core/frame.py


        df = self.reset_index(drop=True)
-        result = df[column].explode()
-        result = df.drop([column], axis=1).join(result)
+        result = DataFrame({c: df[c].explode() for c in columns})


This adds overhead for len = 1 case, let's look at asvs here

phofl · 2021-04-06T19:23:28Z

pandas/tests/frame/methods/test_explode.py

+    df1 = df.assign(C=[["a", "b", "c"], "foo", [], ["d", "e", "f"]])
+    df1.columns = list("ABC")
+    with pytest.raises(ValueError, match="columns must have matching element counts"):
+        df1.explode(list("AC"))


Add test where one side is scalar and other side is list

phofl · 2021-04-06T19:24:01Z

pandas/tests/frame/methods/test_explode.py

+
+
+def test_multi_columns():
+    df = pd.DataFrame(


add gh reference

phofl · 2021-04-06T19:25:11Z

pandas/core/frame.py

+            # mypy: Incompatible types in assignment (expression has type
+            # "List[Union[str, Tuple[Any, ...]]]", variable has type
+            # "List[Union[str, Tuple[Any, ...], List[Union[str, Tuple[Any, ...]]]]]")
+            columns = column  # type: ignore[assignment]


Could you type columns to avoid mypy error?

pandas/core/frame.py

jreback · 2021-04-07T14:59:53Z

pandas/core/frame.py

+        3    3  1     [d, e]
+        3    4  1     [d, e]
+
+        >>> df.explode(list('AC'))


add Multi-column explode (and can comment that the above is a single column)

jreback · 2021-04-07T15:02:32Z

pandas/core/frame.py

+        columns: list[str | tuple]
+        if is_scalar(column) or isinstance(column, tuple):
+            # mypy: List item 0 has incompatible type "Union[str, Tuple[Any, ...],
+            # List[Union[str, Tuple[Any, ...]]]]"; expected


you might be able to cast to remove this warning

Here column can be str or tuple, IMHO cast will make things complex here.

I think you can use assert here, this would remove the mypy warning

pandas/core/frame.py

jreback · 2021-04-07T15:04:01Z

pandas/core/frame.py

+        elif isinstance(column, list) and all(
+            map(lambda c: is_scalar(c) or isinstance(c, tuple), column)
+        ):
+            if len(column) == 0:


if not len(column)

if not colum should be enough?

I see PEP8 recommends if not seq. That should have minimal number of operations on the bytecode level.

pandas/core/frame.py

jreback · 2021-04-07T15:08:25Z

pandas/tests/frame/methods/test_explode.py

@@ -9,13 +9,34 @@ def test_error():
    df = pd.DataFrame(
        {"A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")), "B": 1}
    )
-    with pytest.raises(ValueError, match="column must be a scalar"):
+    with pytest.raises(


this test is getting big, can you split into multiple ones (ok to rename the original), parameterize is good as well if posilbe

jreback · 2021-04-07T15:09:09Z

pandas/tests/frame/methods/test_explode.py

+        {
+            "A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")),
+            "B": 1,
+            "C": [["a", "b", "c"], "foo", [], ["d", "e"]],


what about nan elements in C as well?

github-actions · 2021-05-09T00:01:43Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

iynehz · 2021-05-11T13:48:28Z

@jreback @phofl could you please take a look at my changes and let me know if there's still something to update?

phofl · 2021-05-11T22:50:14Z

pandas/tests/frame/methods/test_explode.py

        df.explode(list("AA"))

    df.columns = list("AA")
    with pytest.raises(ValueError, match="columns must be unique"):
        df.explode("A")


+def test_error_multi_columns():


Could you parametrize here?

phofl · 2021-05-11T22:52:15Z

small comments, otherwise lgtm

iynehz · 2021-05-13T16:38:35Z

@phofl I've pushed to address your comments

jreback · 2021-05-24T15:35:26Z

@stphnlyd pls merge master

phofl · 2021-05-24T15:35:50Z

pandas/tests/frame/methods/test_explode.py

+        ),
+    ],
+)
+def test_error_multi_columns(input_dict, input_index, input_subset, error_message):


Can you parametrize only the changing parts?

This test_error_multi_columns() is a new function I added in my PR, not something that already exists there. I added one more test case to this test_error_multi_columns() in this commit.

That was not what I was referring to. Most of your parametrization values are the same for all cases, so no need to put them into the parametrization

@phofl updated. btw "Python Dev / actions-310-dev" fails coverage upload, but that's not caused by me..

phofl · 2021-05-24T15:36:05Z

pandas/tests/frame/methods/test_explode.py

+        ),
+    ],
+)
+def test_multi_columns(


Similar here

similar as explained above

iynehz · 2021-06-12T12:22:04Z

@phofl any comments?

simonjayhawkins · 2021-06-12T14:07:07Z

pandas/core/frame.py

+            be str or tuple, and all specified columns their list-like data
+            on same row of the frame must have matching length.
+
+            .. versionadded:: 1.3.0


Suggested change

.. versionadded:: 1.3.0

.. versionadded:: 1.4.0

simonjayhawkins · 2021-06-12T14:09:12Z

@stphnlyd can you add a release note. doc/source/whatsnew/v1.4.0.rst

iynehz · 2021-06-17T15:09:01Z

@simonjayhawkins thanks I've squashed the PR and now target 1.4.0, and I've added release note. The PR checks failed but that's not caused by my changes.

jreback

looks good. is the case where an element is nan in one of the multi-columns and a list in the other one tested / handled?

jreback · 2021-06-17T15:35:02Z

@phofl good here?

phofl · 2021-06-17T21:09:30Z

Yes lgtm

jreback · 2021-06-18T01:50:30Z

doc/source/whatsnew/v1.4.0.rst

@@ -29,7 +29,7 @@ enhancement2

 Other enhancements
 ^^^^^^^^^^^^^^^^^^
-
+- :meth:`DataFrame.explode` now supports exploding multiple columns. Its ``column`` argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (:issue:`39240`)


ok putting this in 1.3, if you can move and ping on green.

iynehz · 2021-06-20T17:24:29Z

@jreback targeting 1.3 now

jreback · 2021-06-21T13:06:30Z

thanks @stphnlyd very nice

jreback · 2021-06-21T13:06:37Z

@meeseeksdev backport 1.3.x

lumberbot-app · 2021-06-21T13:06:44Z

Something went wrong ... Please have a look at my logs.

Co-authored-by: stphnlyd <stephanloyd9@gmail.com>

iynehz changed the title ~~:multi-column explode~~ EHN: multi-column explode Apr 3, 2021

iynehz force-pushed the explode branch 2 times, most recently from 8537396 to 34a099d Compare April 4, 2021 14:22

phofl requested changes Apr 6, 2021

View reviewed changes

iynehz force-pushed the explode branch 3 times, most recently from 8bd59de to 527a587 Compare April 7, 2021 14:38

jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2021

jreback requested changes Apr 7, 2021

View reviewed changes

github-actions bot added the Stale label May 9, 2021

lithomas1 removed the Stale label May 11, 2021

phofl reviewed May 11, 2021

View reviewed changes

iynehz force-pushed the explode branch from 2eb2811 to 29a2c43 Compare May 12, 2021 17:20

iynehz requested review from phofl and jreback May 24, 2021 15:26

phofl reviewed May 24, 2021

View reviewed changes

iynehz force-pushed the explode branch from 29a2c43 to 8d13866 Compare June 6, 2021 17:16

iynehz requested a review from phofl June 7, 2021 13:40

simonjayhawkins reviewed Jun 12, 2021

View reviewed changes

iynehz force-pushed the explode branch from 8d13866 to 2957ee7 Compare June 16, 2021 16:17

iynehz force-pushed the explode branch 2 times, most recently from d53d280 to f55ad59 Compare June 17, 2021 13:17

jreback requested changes Jun 17, 2021

View reviewed changes

jreback approved these changes Jun 18, 2021

View reviewed changes

jreback added this to the 1.3 milestone Jun 18, 2021

EHN: multi-column explode (pandas-dev#39240)

5f2bff9

iynehz force-pushed the explode branch from f55ad59 to 5f2bff9 Compare June 20, 2021 15:27

iynehz changed the base branch from master to 1.3.x June 20, 2021 16:31

iynehz changed the base branch from 1.3.x to master June 20, 2021 16:38

jreback merged commit 41a94b0 into pandas-dev:master Jun 21, 2021

meeseeksmachine mentioned this pull request Jun 21, 2021

Backport PR #40770 on branch 1.3.x (EHN: multi-column explode) #42155

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jun 21, 2021

Backport PR pandas-dev#40770: EHN: multi-column explode

0a8d7d2

jreback pushed a commit that referenced this pull request Jun 21, 2021

Backport PR #40770: EHN: multi-column explode (#42155)

0bfbbdf

Co-authored-by: stphnlyd <stephanloyd9@gmail.com>

neinkeinkaffee pushed a commit to neinkeinkaffee/pandas that referenced this pull request Jun 21, 2021

EHN: multi-column explode (pandas-dev#39240) (pandas-dev#40770)

eb28d78

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

EHN: multi-column explode (pandas-dev#39240) (pandas-dev#40770)

5ad153e

simonjayhawkins mentioned this pull request Aug 31, 2021

BUG: DataFrame.explode is failing on scalar int value. #43314

Closed

3 tasks

EHN: multi-column explode #40770

EHN: multi-column explode #40770

Conversation

iynehz commented Apr 3, 2021 • edited Loading

pep8speaks commented Apr 3, 2021 • edited Loading

Comment last updated at 2021-06-20 15:27:09 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 9, 2021

iynehz commented May 11, 2021

Choose a reason for hiding this comment

phofl commented May 11, 2021

iynehz commented May 13, 2021

jreback commented May 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iynehz commented Jun 12, 2021

Choose a reason for hiding this comment

simonjayhawkins commented Jun 12, 2021

iynehz commented Jun 17, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jun 17, 2021

phofl commented Jun 17, 2021

Choose a reason for hiding this comment

iynehz commented Jun 20, 2021

jreback commented Jun 21, 2021

jreback commented Jun 21, 2021

lumberbot-app bot commented Jun 21, 2021

iynehz commented Apr 3, 2021 •

edited

Loading

pep8speaks commented Apr 3, 2021 •

edited

Loading