Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: multi-column explode #40770

Merged
merged 1 commit into from
Jun 21, 2021
Merged

EHN: multi-column explode #40770

merged 1 commit into from
Jun 21, 2021

Conversation

iynehz
Copy link
Contributor

@iynehz iynehz commented Apr 3, 2021

@pep8speaks
Copy link

pep8speaks commented Apr 3, 2021

Hello @stphnlyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-20 15:27:09 UTC

@iynehz iynehz changed the title :multi-column explode EHN: multi-column explode Apr 3, 2021
@iynehz iynehz force-pushed the explode branch 2 times, most recently from 8537396 to 34a099d Compare April 4, 2021 14:22
pandas/core/frame.py Show resolved Hide resolved
counts0 = self[columns[0]].apply(mylen)
for c in columns[1:]:
if not all(counts0 == self[c].apply(mylen)):
raise ValueError("columns must have matching element counts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document too

else:
raise ValueError("column must be a scalar, tuple, or list thereof")

mylen = lambda x: len(x) if is_list_like(x) else -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a performance killer, so please do this only if len(columns) > 1. Maybe there is a better way to do this? Also add asv for this case


df = self.reset_index(drop=True)
result = df[column].explode()
result = df.drop([column], axis=1).join(result)
result = DataFrame({c: df[c].explode() for c in columns})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds overhead for len = 1 case, let's look at asvs here

df1 = df.assign(C=[["a", "b", "c"], "foo", [], ["d", "e", "f"]])
df1.columns = list("ABC")
with pytest.raises(ValueError, match="columns must have matching element counts"):
df1.explode(list("AC"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test where one side is scalar and other side is list



def test_multi_columns():
df = pd.DataFrame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add gh reference

# mypy: Incompatible types in assignment (expression has type
# "List[Union[str, Tuple[Any, ...]]]", variable has type
# "List[Union[str, Tuple[Any, ...], List[Union[str, Tuple[Any, ...]]]]]")
columns = column # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you type columns to avoid mypy error?

@iynehz iynehz force-pushed the explode branch 3 times, most recently from 8bd59de to 527a587 Compare April 7, 2021 14:38
@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2021
pandas/core/frame.py Show resolved Hide resolved
3 3 1 [d, e]
3 4 1 [d, e]

>>> df.explode(list('AC'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Multi-column explode (and can comment that the above is a single column)

columns: list[str | tuple]
if is_scalar(column) or isinstance(column, tuple):
# mypy: List item 0 has incompatible type "Union[str, Tuple[Any, ...],
# List[Union[str, Tuple[Any, ...]]]]"; expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to cast to remove this warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here column can be str or tuple, IMHO cast will make things complex here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use assert here, this would remove the mypy warning

pandas/core/frame.py Show resolved Hide resolved
elif isinstance(column, list) and all(
map(lambda c: is_scalar(c) or isinstance(c, tuple), column)
):
if len(column) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not len(column)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not colum should be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see PEP8 recommends if not seq. That should have minimal number of operations on the bytecode level.

pandas/core/frame.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
@@ -9,13 +9,34 @@ def test_error():
df = pd.DataFrame(
{"A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")), "B": 1}
)
with pytest.raises(ValueError, match="column must be a scalar"):
with pytest.raises(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is getting big, can you split into multiple ones (ok to rename the original), parameterize is good as well if posilbe

{
"A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")),
"B": 1,
"C": [["a", "b", "c"], "foo", [], ["d", "e"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about nan elements in C as well?

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 9, 2021
@iynehz
Copy link
Contributor Author

iynehz commented May 11, 2021

@jreback @phofl could you please take a look at my changes and let me know if there's still something to update?

@lithomas1 lithomas1 removed the Stale label May 11, 2021
df.explode(list("AA"))

df.columns = list("AA")
with pytest.raises(ValueError, match="columns must be unique"):
df.explode("A")


def test_error_multi_columns():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you parametrize here?

@phofl
Copy link
Member

phofl commented May 11, 2021

small comments, otherwise lgtm

@iynehz
Copy link
Contributor Author

iynehz commented May 13, 2021

@phofl I've pushed to address your comments

@iynehz iynehz requested review from phofl and jreback May 24, 2021 15:26
@jreback
Copy link
Contributor

jreback commented May 24, 2021

@stphnlyd pls merge master

),
],
)
def test_error_multi_columns(input_dict, input_index, input_subset, error_message):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize only the changing parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test_error_multi_columns() is a new function I added in my PR, not something that already exists there. I added one more test case to this test_error_multi_columns() in this commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not what I was referring to. Most of your parametrization values are the same for all cases, so no need to put them into the parametrization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl updated. btw "Python Dev / actions-310-dev" fails coverage upload, but that's not caused by me..

),
],
)
def test_multi_columns(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as explained above

@iynehz iynehz requested a review from phofl June 7, 2021 13:40
@iynehz
Copy link
Contributor Author

iynehz commented Jun 12, 2021

@phofl any comments?

be str or tuple, and all specified columns their list-like data
on same row of the frame must have matching length.

.. versionadded:: 1.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 1.3.0
.. versionadded:: 1.4.0

@simonjayhawkins
Copy link
Member

@stphnlyd can you add a release note. doc/source/whatsnew/v1.4.0.rst

@iynehz iynehz force-pushed the explode branch 2 times, most recently from d53d280 to f55ad59 Compare June 17, 2021 13:17
@iynehz
Copy link
Contributor Author

iynehz commented Jun 17, 2021

@simonjayhawkins thanks I've squashed the PR and now target 1.4.0, and I've added release note. The PR checks failed but that's not caused by my changes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. is the case where an element is nan in one of the multi-columns and a list in the other one tested / handled?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

@phofl good here?

@phofl
Copy link
Member

phofl commented Jun 17, 2021

Yes lgtm

@@ -29,7 +29,7 @@ enhancement2

Other enhancements
^^^^^^^^^^^^^^^^^^
-
- :meth:`DataFrame.explode` now supports exploding multiple columns. Its ``column`` argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (:issue:`39240`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok putting this in 1.3, if you can move and ping on green.

@jreback jreback added this to the 1.3 milestone Jun 18, 2021
@iynehz iynehz changed the base branch from master to 1.3.x June 20, 2021 16:31
@iynehz iynehz changed the base branch from 1.3.x to master June 20, 2021 16:38
@iynehz
Copy link
Contributor Author

iynehz commented Jun 20, 2021

@jreback targeting 1.3 now

@jreback jreback merged commit 41a94b0 into pandas-dev:master Jun 21, 2021
@jreback
Copy link
Contributor

jreback commented Jun 21, 2021

thanks @stphnlyd very nice

@jreback
Copy link
Contributor

jreback commented Jun 21, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 21, 2021

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jun 21, 2021
Co-authored-by: stphnlyd <stephanloyd9@gmail.com>
neinkeinkaffee pushed a commit to neinkeinkaffee/pandas that referenced this pull request Jun 21, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Multi-Column explode
6 participants