Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Series.str.split can return a DataFrame instead of Series of lists #8663

Merged
merged 1 commit into from
Oct 29, 2014

Conversation

billletson
Copy link
Contributor

closes #8428.

Adds a flag which when True returns a DataFrame with columns being the index of the lists generated by the string splitting operation. When False, it returns a 1D numpy array, as before. Defaults to false to not break compatibility.

In the case with no splits, returns a single column DataFrame rather than squashing to a Series.

@jreback jreback added API Design Enhancement Strings String extension data type and string data Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 29, 2014
@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

looks good. pls add a release note in v0.15.1.txt

I am not sure I love the option to_df, @jorisvandenbossche @TomAugspurger ?

maybe as_frame? (side issue, its possible we want to deprecate this in 0.16.0 and change it to have as_frame=True by default) - or whatever it is called

@@ -631,6 +631,9 @@ def str_split(arr, pat=None, n=None):
pat : string, default None
String or regular expression to split on. If None, splits on whitespace
n : int, default None (all)
to_df : Boolean, default False
If True, returns a DataFrame,
If False, returns an array with one dimension (elements are lists).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"array with one dimension" -> "series" ? (it does return a series no?)

very minor: "Boolean" doesn't need a capital B (that is more consistent)

@jorisvandenbossche
Copy link
Member

for the name, I would certainly use frame instead of df (in line with the to_frame method), but as_frame or to_frame are both good for me

edit: maybe as_frame sounds better!

@TomAugspurger
Copy link
Contributor

👍 for as_frame.

Eventually changing the default to True (to return a frame) sounds good. I think you're saying start deprecating in 0.16 right? And make the change later?

@immerrr
Copy link
Contributor

immerrr commented Oct 29, 2014

As a pure "what if" option, in DataFrame.to_dict(self, outtype) method, outtype is {'dict', 'list', 'series'}, so outtype={'list', 'frame'} might be better consistency-wise.

@jorisvandenbossche
Copy link
Member

@immerrr But this outtype kwarg is recently deprecated in favour of orient (which was more in line with from_dict) ...
#7840, http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

I don't know if there are other examples in pandas of similar things?

@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

orient sounds good

@jorisvandenbossche
Copy link
Member

ah, that is not what I meant :-)

How does the word 'orient' relate to the fact it is a series or expanded to dataframe?

@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

@jorisvandenbossche that is true. ok, outtype (sematically correct here), the other deprecation was more for consistency.

@jorisvandenbossche
Copy link
Member

in boxplot you have the return_type kwarg

@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

ohh, I like that. @billletson want to change to that kw?

@TomAugspurger yes, the idea is to change this in 0.16.0 (or maybe just deprecate and change the default later)

@immerrr
Copy link
Contributor

immerrr commented Oct 29, 2014

Speaking of making it the default, it makes a lot of sense to return frames by default when n is not None, i.e. when you have an upper boundary on number of created columns. When the number of splits is unknown in advance, it seems safer to return lists instead, or it may blow up on a single unfortunate value...

@billletson
Copy link
Contributor Author

Revised the kw, added a release note, as well as a couple more test cases.

@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

looks good to me, @jorisvandenbossche ?

@billletson can you also create a new issue to have the default changed to frame? pls reference this issue (i will mark it as 0.16.0)

@jorisvandenbossche
Copy link
Member

yep, looks good. @billletson Thanks a lot!

jorisvandenbossche added a commit that referenced this pull request Oct 29, 2014
ENH: Series.str.split can return a DataFrame instead of Series of lists
@jorisvandenbossche jorisvandenbossche merged commit 5cf3d85 into pandas-dev:master Oct 29, 2014
@jreback jreback modified the milestones: 0.15.2, 0.15.1 Oct 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: str.split should return a DataFrame
5 participants