-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN: multi-column explode #40770
EHN: multi-column explode #40770
Conversation
iynehz
commented
Apr 3, 2021
•
edited
Loading
edited
- closes ENH: Multi-Column explode #39240
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
Hello @stphnlyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-06-20 15:27:09 UTC |
8537396
to
34a099d
Compare
pandas/core/frame.py
Outdated
counts0 = self[columns[0]].apply(mylen) | ||
for c in columns[1:]: | ||
if not all(counts0 == self[c].apply(mylen)): | ||
raise ValueError("columns must have matching element counts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document too
pandas/core/frame.py
Outdated
else: | ||
raise ValueError("column must be a scalar, tuple, or list thereof") | ||
|
||
mylen = lambda x: len(x) if is_list_like(x) else -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a performance killer, so please do this only if len(columns) > 1. Maybe there is a better way to do this? Also add asv for this case
pandas/core/frame.py
Outdated
|
||
df = self.reset_index(drop=True) | ||
result = df[column].explode() | ||
result = df.drop([column], axis=1).join(result) | ||
result = DataFrame({c: df[c].explode() for c in columns}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds overhead for len = 1 case, let's look at asvs here
df1 = df.assign(C=[["a", "b", "c"], "foo", [], ["d", "e", "f"]]) | ||
df1.columns = list("ABC") | ||
with pytest.raises(ValueError, match="columns must have matching element counts"): | ||
df1.explode(list("AC")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test where one side is scalar and other side is list
|
||
|
||
def test_multi_columns(): | ||
df = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add gh reference
pandas/core/frame.py
Outdated
# mypy: Incompatible types in assignment (expression has type | ||
# "List[Union[str, Tuple[Any, ...]]]", variable has type | ||
# "List[Union[str, Tuple[Any, ...], List[Union[str, Tuple[Any, ...]]]]]") | ||
columns = column # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you type columns to avoid mypy error?
8bd59de
to
527a587
Compare
3 3 1 [d, e] | ||
3 4 1 [d, e] | ||
|
||
>>> df.explode(list('AC')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Multi-column explode (and can comment that the above is a single column)
pandas/core/frame.py
Outdated
columns: list[str | tuple] | ||
if is_scalar(column) or isinstance(column, tuple): | ||
# mypy: List item 0 has incompatible type "Union[str, Tuple[Any, ...], | ||
# List[Union[str, Tuple[Any, ...]]]]"; expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to cast to remove this warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here column can be str or tuple, IMHO cast will make things complex here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use assert here, this would remove the mypy warning
pandas/core/frame.py
Outdated
elif isinstance(column, list) and all( | ||
map(lambda c: is_scalar(c) or isinstance(c, tuple), column) | ||
): | ||
if len(column) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not len(column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not colum should be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see PEP8 recommends if not seq
. That should have minimal number of operations on the bytecode level.
@@ -9,13 +9,34 @@ def test_error(): | |||
df = pd.DataFrame( | |||
{"A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")), "B": 1} | |||
) | |||
with pytest.raises(ValueError, match="column must be a scalar"): | |||
with pytest.raises( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is getting big, can you split into multiple ones (ok to rename the original), parameterize is good as well if posilbe
{ | ||
"A": pd.Series([[0, 1, 2], np.nan, [], (3, 4)], index=list("abcd")), | ||
"B": 1, | ||
"C": [["a", "b", "c"], "foo", [], ["d", "e"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about nan elements in C as well?
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
df.explode(list("AA")) | ||
|
||
df.columns = list("AA") | ||
with pytest.raises(ValueError, match="columns must be unique"): | ||
df.explode("A") | ||
|
||
|
||
def test_error_multi_columns(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you parametrize here?
small comments, otherwise lgtm |
@phofl I've pushed to address your comments |
@stphnlyd pls merge master |
), | ||
], | ||
) | ||
def test_error_multi_columns(input_dict, input_index, input_subset, error_message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize only the changing parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test_error_multi_columns() is a new function I added in my PR, not something that already exists there. I added one more test case to this test_error_multi_columns() in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was not what I was referring to. Most of your parametrization values are the same for all cases, so no need to put them into the parametrization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl updated. btw "Python Dev / actions-310-dev" fails coverage upload, but that's not caused by me..
), | ||
], | ||
) | ||
def test_multi_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar as explained above
@phofl any comments? |
be str or tuple, and all specified columns their list-like data | ||
on same row of the frame must have matching length. | ||
|
||
.. versionadded:: 1.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. versionadded:: 1.3.0 | |
.. versionadded:: 1.4.0 |
@stphnlyd can you add a release note. doc/source/whatsnew/v1.4.0.rst |
d53d280
to
f55ad59
Compare
@simonjayhawkins thanks I've squashed the PR and now target 1.4.0, and I've added release note. The PR checks failed but that's not caused by my changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. is the case where an element is nan in one of the multi-columns and a list in the other one tested / handled?
@phofl good here? |
Yes lgtm |
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -29,7 +29,7 @@ enhancement2 | |||
|
|||
Other enhancements | |||
^^^^^^^^^^^^^^^^^^ | |||
- | |||
- :meth:`DataFrame.explode` now supports exploding multiple columns. Its ``column`` argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (:issue:`39240`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok putting this in 1.3, if you can move and ping on green.
@jreback targeting 1.3 now |
thanks @stphnlyd very nice |
@meeseeksdev backport 1.3.x |
Something went wrong ... Please have a look at my logs. |
Co-authored-by: stphnlyd <stephanloyd9@gmail.com>