-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add StringMethods (.str accessor) to Index, fixes #9068 #9667
Conversation
@jreback please let me know what you think of this, thanks |
this looks good |
may also want to update the docs to mention u can do this on Index (I think it's in text.rst iirc) |
9bab07a
to
00e6aea
Compare
@jreback both release note and |
@@ -497,6 +498,19 @@ def searchsorted(self, key, side='left'): | |||
#### needs tests/doc-string | |||
return self.values.searchsorted(key, side=side) | |||
|
|||
# string methods | |||
def _make_str_accessor(self): | |||
if not com.is_object_dtype(self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a check here to exclude string methods on MultiIndex objects, too? They have object dtype, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer you mean a simple check like this?
if isinstance(self, MultiIndex):
raise AttributeError(".str accessor is not supported for MultiIndex objects")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks about right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, I would suggest only putting this accessor on the appropriate types directly, but it's a little tricky with MultiIndex because it's a subclass of the generic Index type (which is what string indexes area)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better to check the inferred type
which will be string or mixed for s multi index (iirc)
4b9468b
to
78b68cb
Compare
raise AttributeError("Can only use .str accessor with string " | ||
"values (i.e. inferred_type is 'string')") | ||
return StringMethods(self) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, but 2 points:
|
78b68cb
to
e6b0a4b
Compare
@shoyer, @sinhrks I see, indeed returning @jreback an example would be:
If we return an
would work naturally, whereas if we return a boolean |
e6b0a4b
to
23cf6b3
Compare
Just added a check for the |
@@ -18,6 +18,7 @@ Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`) | |||
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add an example or two here, both to show off the new feature and to illustrate the behavior with boolean output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will do
if isinstance(self.series, Index): | ||
# if result is a boolean np.array, return the np.array | ||
# instead of wrapping it into a boolean Index (GH 8875) | ||
if result.dtype == bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_bool_dtype(result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch I'll change this
@mortada yeh I think we need to do something for
|
@mortada Thanks for quick action:) |
Didn't look in detail into this, but a quick question: what is the return type? I suppose in general a new index? But what if the method returns multiple elements for each entry? (eg returning a frame for a series)? |
@jorisvandenbossche the return type is if the method returns multiple elements it will be an
|
4b68a72
to
1b48868
Compare
By the way, I think this is a very nice feature, especially useful for cleaning up your column names (stripping whitespace at beginning or end -> typical problem you see on SO, replacing whitespace with |
@jorisvandenbossche I don't think its a problem if certain methods which have a Another option is to exclude the methods entirely from the |
1b48868
to
8968009
Compare
@jreback I added |
lgtm. @jorisvandenbossche |
8968009
to
7d734fb
Compare
@mortada Are you sure you committed it? I still only see the frame/series options (or I am overlooking it completely ..) |
7d734fb
to
ed77c72
Compare
@jorisvandenbossche oops I had to do another rebase and missed the last commit, thanks for catching! |
@jorisvandenbossche it's fixed, please take a look thanks |
@@ -632,9 +632,10 @@ def str_split(arr, pat=None, n=None, return_type='series'): | |||
pat : string, default None | |||
String or regular expression to split on. If None, splits on whitespace | |||
n : int, default None (all) | |||
return_type : {'series', 'frame'}, default 'series | |||
return_type : {'series', 'index', 'frame'}, default 'series' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this {'series'/'index', 'frame'}
to make it more clear they are aliases?
The thing I am thinking now: won't it be confusing for users to get a Index back even if he/she supplies I was also thinking of another option: an But maybe we shouldn't block this PR with this discussion and open a new issue for this. So I am OK with merging. |
how about we just do
but let's do that in another issue / PR |
@jreback @jorisvandenbossche I'm happy to do another PR for the |
ed77c72
to
f98bcb8
Compare
ok, merging this, but @mortada I would appreciate another issue (and PR!) for the |
ENH: add StringMethods (.str accessor) to Index, fixes #9068
@mortada really a nice improvement! I will open an new issue to discuss the |
@jorisvandenbossche thanks! Another thing you had mentioned was expanding the docs to cover some questions on SO about cleaning up column names. Could you please point me to some of those SO questions? I can add more examples to the docs. |
@mortada I don't have a specific SO question on my mind, but you often see questions like: "why does So for me, this are typical ways to clean up column names:
|
@mortada very nice, thanks! |
fixes #9068