-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add string extension type #27949
Conversation
This adds a new extension type 'string' for storing string data. The data model is essentially unchanged from master. String are still stored in an object-dtype ndarray. Scalar elements are still Python strs, and `np.nan` is still used as the string dtype.
Does the .str accessor only show up on string-dtype Series, or also object-coincidentally-all-string Series as in master? |
Also on object-dtype as in master. This PR doesn't change the current behavior at all (modulo new bugs). |
May be a more suitable question for an issue, but |
It'll have to for now (in my first commit I allowed We'll want to think of a migration path for making |
I was curious about how much the string validation (ruling out relative: absolute: code
I believe there are two sources of relative slowdown
|
(CI is passing now). |
are you looking to get this merged quickly? i.e. can this go on the "review later this week" pile or does it belong in the "later today" pile? |
No. This can go in the "when I'm bored" pile. I don't think there's a huge
rush. It'd be nice for 1.0 but not essential.
…On Mon, Aug 19, 2019 at 3:43 PM jbrockmendel ***@***.***> wrote:
are you looking to get this merged quickly? i.e. can this go on the
"review later this week" pile or does it belong in the "later today" pile?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#27949?email_source=notifications&email_token=AAKAOIULOTZCKS2XW7YFRHDQFMAYPA5CNFSM4IMKKYTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4UHSDI#issuecomment-522746125>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIXGPYKMAKIMN55YAO3QFMAYPANCNFSM4IMKKYTA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the intention to have this have an encoding (maybe parameterized)? or is that a unicode type? IOW shouldn't this be string[utf8]
? (and what you have here is a base class for this), we could certianly default dtype='string'
-> string[utf8]
.
also rebase when you have a chance, have only glanced, but looks pretty good so far.
Mmm, this type is for in-memory strings, so I don't think an encoding is necessary. I don't think we're exposing the actual storage anywhere, which is when the encoding would matter. If we had a ByteArray, then yes I think that should be parametrized by encoding (with UTF8 as the default). |
Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-10-04 14:03:39 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of clarity this doesn't offer any real speed or memory savings over the object dtype right? Just a better way to ensure that an array only contains strings?
I am a bit biased as I have clear preference, but trying to summarize arguments for strings vs text: Reasons to go with "string":
Reasons to go with "text":
SQL has VARCHAR (so also character), Postgres also has a non-SQL-standard TEXT type (but the equivalent of what we are building here is not TEXT but VARCHAR, AFAIU)
IMO, we shouldn't do that now, let's first get some experience with it as an opt-in experimental feature. |
yep i’ll retract my support for Text and go with String but i think we need to make it very clear that for now str is not the same as string |
Yep, I think we'll want And we'll likely want to deprecate I'll make the switch back to StringDtype / StringArray. |
self._validate() | ||
|
||
def _validate(self): | ||
"""Validate that we only store NA or strings.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this pass if you passing a StringArray itself? (and do we infer correctly in is_string_array)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this pass if you passing a StringArray itself?
Done.
and do we infer correctly in is_string_array
We (Cython) actually raises on lib.is_string_array(StringArray)
since it's expecting an ndarray. Not sure what's best here. IIUC, we only use len(values)
and values.dtype
.
else: | ||
dtype = None | ||
|
||
result = arr._constructor_expanddim( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does arr.dtype work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not always: #27953
thanks @TomAugspurger very nice. IIRC there are a couple of followups. |
did this have an issue to close? |
There's the main String Dtype issue:
#8640.
Leaving that open for part 2, which is storing strings natively, rather
than as Python objects.
…On Sat, Oct 5, 2019 at 6:18 PM Jeff Reback ***@***.***> wrote:
did this have an issue to close?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27949?email_source=notifications&email_token=AAKAOIRXO5WVDNVFDSNWZILQNEOE3A5CNFSM4IMKKYTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN53TA#issuecomment-538697164>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRFZU7SQIODB5URCYDQNEOE3ANCNFSM4IMKKYTA>
.
|
This adds a new extension type 'string' for storing string data.
The data model is essentially unchanged from master. String are still
stored in an object-dtype ndarray. Scalar elements are still Python
strs, and
np.nan
is still used as the string dtype.Things are pretty well contained. The major changes outside the new array are
core/strings.py
to handle things correctly (mostly, returningstring
dtype when there's astring
input.No rush on reviewing this. Just parking it here for now.