Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add StringMethods (.str accessor) to Index, fixes #9068 #9667

Merged
merged 1 commit into from
Apr 10, 2015

Conversation

mortada
Copy link
Contributor

@mortada mortada commented Mar 17, 2015

fixes #9068

@mortada
Copy link
Contributor Author

mortada commented Apr 1, 2015

@jreback please let me know what you think of this, thanks

@jreback
Copy link
Contributor

jreback commented Apr 1, 2015

this looks good
pls add a release note in v0.16.1

@jreback
Copy link
Contributor

jreback commented Apr 1, 2015

may also want to update the docs to mention u can do this on Index (I think it's in text.rst iirc)

@mortada
Copy link
Contributor Author

mortada commented Apr 2, 2015

@jreback both release note and text.rst have been updated

@@ -497,6 +498,19 @@ def searchsorted(self, key, side='left'):
#### needs tests/doc-string
return self.values.searchsorted(key, side=side)

# string methods
def _make_str_accessor(self):
if not com.is_object_dtype(self.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check here to exclude string methods on MultiIndex objects, too? They have object dtype, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer you mean a simple check like this?

    if isinstance(self, MultiIndex):
        raise AttributeError(".str accessor is not supported for MultiIndex objects")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks about right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, I would suggest only putting this accessor on the appropriate types directly, but it's a little tricky with MultiIndex because it's a subclass of the generic Index type (which is what string indexes area)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to check the inferred type
which will be string or mixed for s multi index (iirc)

@mortada mortada force-pushed the index_str_methods branch 2 times, most recently from 4b9468b to 78b68cb Compare April 2, 2015 19:13
raise AttributeError("Can only use .str accessor with string "
"values (i.e. inferred_type is 'string')")
return StringMethods(self)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer @jreback I improved the checking here according to your recommendations - in the case of Series it's the same logic as before and in the case of Index it uses the inferred_type attribute

@sinhrks
Copy link
Member

sinhrks commented Apr 5, 2015

Nice, but 2 points:

  • There are some function returns bool, and Index(dtype=bool) is not usable very much (related to API: Index should support __inverse__ ops #8875). It may be better to return np.array in these cases.
  • Should prohibit to use return_type="frame" in split ?

@jreback
Copy link
Contributor

jreback commented Apr 5, 2015

@sinhrks can you elaborate on the first point (about bool)? an example?

@mortada

I think add a tests for all of the .str methods (e.g. to test the return types issue), I think you will have to raise if return_type='frame' is defined

@mortada
Copy link
Contributor Author

mortada commented Apr 6, 2015

@jreback just added a check and unit test for the case of Index and return_type='frame'

@sinhrks I'm looking into the bool issue you referenced

@shoyer
Copy link
Member

shoyer commented Apr 6, 2015

@jreback @mortada some str methods return booleans, e.g., str.isalpha. For series, it makes sense to return a As @sinhrks points out, we probably don't want to return a new Index object for these values, but rather a plain numpy array.

@mortada
Copy link
Contributor Author

mortada commented Apr 6, 2015

@shoyer, @sinhrks I see, indeed returning np.array in the boolean case seems a lot more useful.

@jreback an example would be:

idx = Index(['a1', 'a2', 'b1', 'b2'])
s = Series(range(4), index=idx)

If we return an np.array then an expression like

s[s.index.str.startswith('a')]

would work naturally, whereas if we return a boolean Index that's not very useful

@mortada
Copy link
Contributor Author

mortada commented Apr 6, 2015

Just added a check for the boolean Index case, it will now return np.array instead. Please take a look, thanks!

@@ -18,6 +18,7 @@ Enhancements
~~~~~~~~~~~~

- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`)
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add an example or two here, both to show off the new feature and to illustrate the behavior with boolean output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Strings String extension data type and string data labels Apr 6, 2015
@jreback jreback added this to the 0.16.1 milestone Apr 6, 2015
if isinstance(self.series, Index):
# if result is a boolean np.array, return the np.array
# instead of wrapping it into a boolean Index (GH 8875)
if result.dtype == bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_bool_dtype(result)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch I'll change this

@jreback
Copy link
Contributor

jreback commented Apr 6, 2015

@mortada yeh I think we need to do something for

Index(['foo_bar','bah_baz']).str.split('_',return_type='frame')

@sinhrks
Copy link
Member

sinhrks commented Apr 6, 2015

@mortada Thanks for quick action:)

@jorisvandenbossche
Copy link
Member

Didn't look in detail into this, but a quick question: what is the return type? I suppose in general a new index? But what if the method returns multiple elements for each entry? (eg returning a frame for a series)?

@mortada
Copy link
Contributor Author

mortada commented Apr 6, 2015

@jorisvandenbossche the return type is Index unless the dtype is bool, in which case the return type will be np.array.

if the method returns multiple elements it will be an Index of lists, and won't be a DataFrame. If the user specifies return_type='frame' it will raise. Here's an example:

In [1]: from pandas import Index
In [2]: idx = Index(['a b c', 'd e', 'f'])

In [3]: idx.str.split()
Out[3]: Index([['a', 'b', 'c'], ['d', 'e'], ['f']], dtype='object')

In [4]: idx.str.split(return_type='frame')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d22b631e0f0f> in <module>()
----> 1 idx.str.split(return_type='frame')

/Users/mortada_mehyar/code/github/pandas/pandas/core/strings.py in split(self, pat, n, return_type)
    987     @copy(str_split)
    988     def split(self, pat=None, n=-1, return_type='series'):
--> 989         result = str_split(self.series, pat, n=n, return_type=return_type)
    990         return self._wrap_result(result)
    991

/Users/mortada_mehyar/code/github/pandas/pandas/core/strings.py in str_split(arr, pat, n, return_type)
    652         raise ValueError("return_type must be {'series', 'frame'}")
    653     if return_type == 'frame' and isinstance(arr, Index):
--> 654         raise ValueError("return_type='frame' is not supported for string "
    655                          "methods on Index")
    656     if pat is None:

ValueError: return_type='frame' is not supported for string methods on Index

@mortada mortada force-pushed the index_str_methods branch 2 times, most recently from 4b68a72 to 1b48868 Compare April 6, 2015 17:42
@mortada
Copy link
Contributor Author

mortada commented Apr 6, 2015

@shoyer added examples to doc
@jreback now using is_bool_dtype

please take a look, thanks.

@jorisvandenbossche
Copy link
Member

By the way, I think this is a very nice feature, especially useful for cleaning up your column names (stripping whitespace at beginning or end -> typical problem you see on SO, replacing whitespace with _, convert to all lower case, ..) For the docs, maybe it would be useful to also focus on this and give an example where the column names are used (but maybe not needed for this PR)

@jreback
Copy link
Contributor

jreback commented Apr 8, 2015

@jorisvandenbossche I don't think its a problem if certain methods which have a return_type='frame' raise if you try to call them on an Index. The only other option is to actually return a frame as I think the purpose is what you mentioned above, to allow cleaning type ops on an Index, which you expect to get an Index.

Another option is to exclude the methods entirely from the Index (meaning ones that have a default return_type='frame')

@mortada
Copy link
Contributor Author

mortada commented Apr 8, 2015

@jreback I added reture_type='index' as a alias for 'series' as you suggested, please take a look

@jreback
Copy link
Contributor

jreback commented Apr 8, 2015

lgtm. @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

@mortada Are you sure you committed it? I still only see the frame/series options (or I am overlooking it completely ..)

@mortada
Copy link
Contributor Author

mortada commented Apr 8, 2015

@jorisvandenbossche oops I had to do another rebase and missed the last commit, thanks for catching!

@mortada
Copy link
Contributor Author

mortada commented Apr 9, 2015

@jorisvandenbossche it's fixed, please take a look thanks

@@ -632,9 +632,10 @@ def str_split(arr, pat=None, n=None, return_type='series'):
pat : string, default None
String or regular expression to split on. If None, splits on whitespace
n : int, default None (all)
return_type : {'series', 'frame'}, default 'series
return_type : {'series', 'index', 'frame'}, default 'series'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this {'series'/'index', 'frame'} to make it more clear they are aliases?

@jorisvandenbossche
Copy link
Member

The thing I am thinking now: won't it be confusing for users to get a Index back even if he/she supplies return_type='series' on a Index.str.split, or to get a series back with Series.str.split(.., return_type='index')?

I was also thinking of another option: an expand keyword (or another name), that indicates for False: give same dimension back (so for series/index keep it a series/index), and for True: expand series to dataframe.
This would then be a duplicate for return_type of course. But the return_type was only introduced in 0.15.1 (and for partition it is still in a PR), so if we want to change this: better now than later. Or has this ship sailed?

But maybe we shouldn't block this PR with this discussion and open a new issue for this. So I am OK with merging.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2015

@jorisvandenbossche

how about we just do

return_type='same'|'expand' to satisfy this need? (and can be easily back-compat)

but let's do that in another issue / PR

@mortada
Copy link
Contributor Author

mortada commented Apr 10, 2015

@jreback @jorisvandenbossche I'm happy to do another PR for the return_type

@jreback
Copy link
Contributor

jreback commented Apr 10, 2015

ok, merging this, but @mortada I would appreciate another issue (and PR!) for the return_type issue.

jreback added a commit that referenced this pull request Apr 10, 2015
ENH: add StringMethods (.str accessor) to Index, fixes #9068
@jreback jreback merged commit 2734fff into pandas-dev:master Apr 10, 2015
@jorisvandenbossche
Copy link
Member

@mortada really a nice improvement!

I will open an new issue to discuss the return_type thing

@mortada
Copy link
Contributor Author

mortada commented Apr 10, 2015

@jorisvandenbossche thanks!

Another thing you had mentioned was expanding the docs to cover some questions on SO about cleaning up column names. Could you please point me to some of those SO questions? I can add more examples to the docs.

@jorisvandenbossche
Copy link
Member

@mortada I don't have a specific SO question on my mind, but you often see questions like: "why does df['my_col'] give a KeyError as it is in the columns?" with response -> "check df.columns, maybe you have starting or trailing spaces in the names" (something you don't easily see in the console output, you can only detect it through the alignment)

So for me, this are typical ways to clean up column names:

  • remove starting/trailing whitespace (eg leftover of reading it in)
  • convert all columns names to lower case (eg something typical you would want for writing it to certain databases/on certain platforms to not have to deal with case sensitivity issues)
  • remove alle whitespace in the column names, or convert all whitespaces to underscores ("my col" -> "my_col")

@shoyer
Copy link
Member

shoyer commented Apr 11, 2015

@mortada very nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add StringMethods (e.g. .str) to Index
5 participants