API: Add string extension type #27949

TomAugspurger · 2019-08-16T15:22:23Z

This adds a new extension type 'string' for storing string data.

The data model is essentially unchanged from master. String are still
stored in an object-dtype ndarray. Scalar elements are still Python
strs, and np.nan is still used as the string dtype.

Things are pretty well contained. The major changes outside the new array are

docs
core/strings.py to handle things correctly (mostly, returning string dtype when there's a string input.

No rush on reviewing this. Just parking it here for now.

This adds a new extension type 'string' for storing string data. The data model is essentially unchanged from master. String are still stored in an object-dtype ndarray. Scalar elements are still Python strs, and `np.nan` is still used as the string dtype.

jbrockmendel · 2019-08-16T16:00:38Z

Does the .str accessor only show up on string-dtype Series, or also object-coincidentally-all-string Series as in master?

TomAugspurger · 2019-08-16T16:42:31Z

Also on object-dtype as in master.

This PR doesn't change the current behavior at all (modulo new bugs).

mroeschke · 2019-08-16T18:34:58Z

May be a more suitable question for an issue, but dtype=str will still point to object in this PR and in the future(?)

TomAugspurger · 2019-08-16T19:01:15Z

dtype=str will still point to object in this PR and in the future(?)

It'll have to for now (in my first commit I allowed dtype='str' as well as dtype='string', but that of course broke things.)

We'll want to think of a migration path for making 'str' mean StringDtype, and probably we'll want to infer ['a', 'b'] as string rather than object.

TomAugspurger · 2019-08-16T19:48:22Z

I was curious about how much the string validation (ruling out pd.array(['a', 1], dtype="string") cost.

relative:

absolute:

code

from timeit import default_timer as tic
import pandas as pd
import numpy as np
import pandas.util.testing as tm

ns = [0, 10, 100, 1_000, 10_000, 100_000, 1_000_000]
times = []
for n in ns:
    data = tm.makeStringIndex(n).tolist()
    t0 = tic()
    np.array(data, dtype=object)
    t1 = tic()
    pd.core.arrays.StringArray._from_sequence(data)
    t2 = tic()
    
    times.append((n, t1 - t0, t2 - t1))

df = pd.DataFrame(times, columns=['n', 'numpy', 'pandas']).set_index('n')
df.div(df['numpy'], axis=0).plot(
    logx=True, title="Overhead Relative to NumPy"
);

I believe there are two sources of relative slowdown

per-value validation that the scalars are strings or NA
Ensuring that we have a consistent NA value (no mix of None and np.nan. Right now, this is done with an isna() (linear scan over the values) and a setitem if needed. I suspect that this could be rolled into the per-value scalar validation if desired, but I'm less concerned about perf right now.

TomAugspurger · 2019-08-19T17:29:50Z

(CI is passing now).

jbrockmendel · 2019-08-19T20:43:41Z

are you looking to get this merged quickly? i.e. can this go on the "review later this week" pile or does it belong in the "later today" pile?

TomAugspurger · 2019-08-19T20:52:50Z

No. This can go in the "when I'm bored" pile. I don't think there's a huge rush. It'd be nice for 1.0 but not essential.

…

On Mon, Aug 19, 2019 at 3:43 PM jbrockmendel ***@***.***> wrote: are you looking to get this merged quickly? i.e. can this go on the "review later this week" pile or does it belong in the "later today" pile? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#27949?email_source=notifications&email_token=AAKAOIULOTZCKS2XW7YFRHDQFMAYPA5CNFSM4IMKKYTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4UHSDI#issuecomment-522746125>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIXGPYKMAKIMN55YAO3QFMAYPANCNFSM4IMKKYTA> .

jreback

is the intention to have this have an encoding (maybe parameterized)? or is that a unicode type? IOW shouldn't this be string[utf8] ? (and what you have here is a base class for this), we could certianly default dtype='string' -> string[utf8].

also rebase when you have a chance, have only glanced, but looks pretty good so far.

TomAugspurger · 2019-09-08T19:25:35Z

is the intention to have this have an encoding (maybe parameterized)? or is that a unicode type?

Mmm, this type is for in-memory strings, so I don't think an encoding is necessary. I don't think we're exposing the actual storage anywhere, which is when the encoding would matter.

If we had a ByteArray, then yes I think that should be parametrized by encoding (with UTF8 as the default).

pep8speaks · 2019-09-09T14:10:50Z

Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-04 14:03:39 UTC

WillAyd

For the sake of clarity this doesn't offer any real speed or memory savings over the object dtype right? Just a better way to ensure that an array only contains strings?

doc/source/user_guide/text.rst

doc/source/whatsnew/v1.0.0.rst

pandas/core/arrays/string_.py

jorisvandenbossche · 2019-09-30T15:59:11Z

I am a bit biased as I have clear preference, but trying to summarize arguments for strings vs text:

Reasons to go with "string":

It matches our own existing terminology (string methods with str accessor)
It matches Python terminology (str type, the scalar value of this array is a python string)
It matches what some other libraries / languages do (Arrow, Julia (they also have a Char), C++, ..; R uses character)

Reasons to go with "text":

A less subtle difference with "dtype=str" (which for now keeps giving the object dtype)
.. other?

SQL has VARCHAR (so also character), Postgres also has a non-SQL-standard TEXT type (but the equivalent of what we are building here is not TEXT but VARCHAR, AFAIU)

i think would be more amenable to (and alias of string) if we start showing. FutureWarning if passing ‘str’ as a dtype (in preparation for the switch)

IMO, we shouldn't do that now, let's first get some experience with it as an opt-in experimental feature.
But I think we agree the long term goal is to change "str" to mean this StringDtype at some point? For me, that is another reason to go with "string" now, as that will cause less confusion in the future (at some point, "str" and "string" can be aliases then).

jreback · 2019-09-30T16:04:07Z

yep i’ll retract my support for Text and go with String

but i think we need to make it very clear that for now str is not the same as string

TomAugspurger · 2019-10-01T12:00:24Z

Yep, I think we'll want dtype='str' to mean this new EA dtype in the future.

And we'll likely want to deprecate .str on object-dtype columns. But that's down the road.

I'll make the switch back to StringDtype / StringArray.

jreback · 2019-10-02T12:17:08Z

pandas/core/arrays/string_.py

+        self._validate()
+
+    def _validate(self):
+        """Validate that we only store NA or strings."""


does this pass if you passing a StringArray itself? (and do we infer correctly in is_string_array)?

does this pass if you passing a StringArray itself?

Done.

and do we infer correctly in is_string_array

We (Cython) actually raises on lib.is_string_array(StringArray) since it's expecting an ndarray. Not sure what's best here. IIUC, we only use len(values) and values.dtype.

pandas/core/arrays/string_.py

pandas/core/strings.py

jreback · 2019-10-02T12:22:09Z

pandas/core/strings.py

+    else:
+        dtype = None
+
+    result = arr._constructor_expanddim(


does arr.dtype work here?

Not always: #27953

pandas/tests/arrays/string_/test_text.py

jreback · 2019-10-05T23:18:08Z

thanks @TomAugspurger very nice. IIRC there are a couple of followups.

jreback · 2019-10-05T23:18:29Z

did this have an issue to close?

TomAugspurger · 2019-10-07T11:33:37Z

There's the main String Dtype issue: #8640. Leaving that open for part 2, which is storing strings natively, rather than as Python objects.

…

On Sat, Oct 5, 2019 at 6:18 PM Jeff Reback ***@***.***> wrote: did this have an issue to close? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27949?email_source=notifications&email_token=AAKAOIRXO5WVDNVFDSNWZILQNEOE3A5CNFSM4IMKKYTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN53TA#issuecomment-538697164>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIRFZU7SQIODB5URCYDQNEOE3ANCNFSM4IMKKYTA> .

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 16, 2019

test fixups

3ecb5cc

string dtype

59a7d39

TomAugspurger added 9 commits August 16, 2019 14:52

35 compat

7c07070

doc

9e1a73b

fixups

16ccad8

doc

1027463

doc

aafb53b

Merge remote-tracking branch 'upstream/master' into ea-string

9cdfe2f

fix doc warnings

ab49169

fixup docstrings

978fb55

fixup docstrings

aebc688

TomAugspurger mentioned this pull request Aug 28, 2019

Fixing schema handling in arrow-parquet dask/dask#5307

Merged

2 tasks

jreback requested changes Sep 8, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into ea-string

d90d0ad

lint

41dc0f9

WillAyd reviewed Sep 9, 2019

View reviewed changes

fixup

8714a53

TomAugspurger added 5 commits October 1, 2019 07:00

Merge remote-tracking branch 'upstream/master' into ea-string

41f234c

rename

dc9ef3c

rename

9419af2

doc updates

462b29d

fixups

0391563

jreback reviewed Oct 2, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Oct 3, 2019

API/ENH: dtype='string' / pd.String #8640

Closed

TomAugspurger added this to the 1.0 milestone Oct 3, 2019

TomAugspurger added 4 commits October 3, 2019 08:37

Merge remote-tracking branch 'upstream/master' into ea-string

129fe29

move and perf

6aebd8c

test is_string_dtype

2ee5e30

helper

7e92cde

jreback approved these changes Oct 5, 2019

View reviewed changes

jreback merged commit 9cfb8b5 into pandas-dev:master Oct 5, 2019

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Oct 7, 2019

fix mypy error introduced in pandas-dev#27949

9b147ce

WillAyd mentioned this pull request Nov 11, 2019

WIP: Dict Array Extension #29557

Closed

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Add string extension type (pandas-dev#27949)

b78c9f6

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API: Add string extension type (pandas-dev#27949)

c4e1ecd

bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020

API: Add string extension type (pandas-dev#27949)

aec946f

simonjayhawkins mentioned this pull request Jul 24, 2020

BUG: ValueError in read_csv when dtype='string' and parse_dates is present #34066

Closed

3 tasks

jorisvandenbossche mentioned this pull request Sep 13, 2020

PERF: StringArray construction #36325

Merged

simonjayhawkins mentioned this pull request May 12, 2021

[ArrowStringArray] TST: parametrize str.extractall tests #41419

Merged

jorisvandenbossche mentioned this pull request Mar 17, 2023

BUG: any() and all() raise with extension strings #51939

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add string extension type #27949

API: Add string extension type #27949

TomAugspurger commented Aug 16, 2019

jbrockmendel commented Aug 16, 2019

TomAugspurger commented Aug 16, 2019

mroeschke commented Aug 16, 2019

TomAugspurger commented Aug 16, 2019 •

edited

Loading

TomAugspurger commented Aug 16, 2019 •

edited

Loading

TomAugspurger commented Aug 19, 2019

jbrockmendel commented Aug 19, 2019

TomAugspurger commented Aug 19, 2019 via email

jreback left a comment

TomAugspurger commented Sep 8, 2019

pep8speaks commented Sep 9, 2019 •

edited

Loading

WillAyd left a comment

jorisvandenbossche commented Sep 30, 2019

jreback commented Sep 30, 2019

TomAugspurger commented Oct 1, 2019

jreback Oct 2, 2019

TomAugspurger Oct 4, 2019

jreback Oct 2, 2019

TomAugspurger Oct 4, 2019

jreback commented Oct 5, 2019

jreback commented Oct 5, 2019

TomAugspurger commented Oct 7, 2019 via email

API: Add string extension type #27949

API: Add string extension type #27949

Conversation

TomAugspurger commented Aug 16, 2019

jbrockmendel commented Aug 16, 2019

TomAugspurger commented Aug 16, 2019

mroeschke commented Aug 16, 2019

TomAugspurger commented Aug 16, 2019 • edited Loading

TomAugspurger commented Aug 16, 2019 • edited Loading

TomAugspurger commented Aug 19, 2019

jbrockmendel commented Aug 19, 2019

TomAugspurger commented Aug 19, 2019 via email

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 8, 2019

pep8speaks commented Sep 9, 2019 • edited Loading

Comment last updated at 2019-10-04 14:03:39 UTC

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 30, 2019

jreback commented Sep 30, 2019

TomAugspurger commented Oct 1, 2019

jreback Oct 2, 2019

Choose a reason for hiding this comment

TomAugspurger Oct 4, 2019

Choose a reason for hiding this comment

jreback Oct 2, 2019

Choose a reason for hiding this comment

TomAugspurger Oct 4, 2019

Choose a reason for hiding this comment

jreback commented Oct 5, 2019

jreback commented Oct 5, 2019

TomAugspurger commented Oct 7, 2019 via email

TomAugspurger commented Aug 16, 2019 •

edited

Loading

TomAugspurger commented Aug 16, 2019 •

edited

Loading

pep8speaks commented Sep 9, 2019 •

edited

Loading