Skip to content

Commit

Permalink
Backport PR #54794 on branch 2.1.x (Infer string storage based on inf…
Browse files Browse the repository at this point in the history
…er_string option) (#54839)

Backport PR #54794: Infer string storage based on infer_string option

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
  • Loading branch information
meeseeksmachine and phofl authored Aug 29, 2023
1 parent 0084f77 commit 3b7f411
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 3 deletions.
6 changes: 5 additions & 1 deletion doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,15 @@ We are collecting feedback on this decision `here <https://github.com/pandas-dev
Avoid NumPy object dtype for strings by default
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, all strings were stored in columns with NumPy object dtype.
Previously, all strings were stored in columns with NumPy object dtype by default.
This release introduces an option ``future.infer_string`` that infers all
strings as PyArrow backed strings with dtype ``"string[pyarrow_numpy]"`` instead.
This is a new string dtype implementation that follows NumPy semantics in comparison
operations and will return ``np.nan`` as the missing value indicator.
Setting the option will also infer the dtype ``"string"`` as a :class:`StringDtype` with
storage set to ``"pyarrow_numpy"``, ignoring the value behind the option
``mode.string_storage``.

This option only works if PyArrow is installed. PyArrow backed strings have a
significantly reduced memory footprint and provide a big performance improvement
compared to NumPy object (:issue:`54430`).
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,11 @@ def na_value(self) -> libmissing.NAType | float: # type: ignore[override]

def __init__(self, storage=None) -> None:
if storage is None:
storage = get_option("mode.string_storage")
infer_string = get_option("future.infer_string")
if infer_string:
storage = "pyarrow_numpy"
else:
storage = get_option("mode.string_storage")
if storage not in {"python", "pyarrow", "pyarrow_numpy"}:
raise ValueError(
f"Storage must be 'python' or 'pyarrow'. Got {storage} instead."
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -492,7 +492,8 @@ def use_inf_as_na_cb(key) -> None:

string_storage_doc = """
: string
The default storage for StringDtype.
The default storage for StringDtype. This option is ignored if
``future.infer_string`` is set to True.
"""

with cf.config_prefix("mode"):
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -2115,6 +2115,14 @@ def test_series_string_inference_array_string_dtype(self):
ser = Series(np.array(["a", "b"]))
tm.assert_series_equal(ser, expected)

def test_series_string_inference_storage_definition(self):
# GH#54793
pytest.importorskip("pyarrow")
expected = Series(["a", "b"], dtype="string[pyarrow_numpy]")
with pd.option_context("future.infer_string", True):
result = Series(["a", "b"], dtype="string")
tm.assert_series_equal(result, expected)


class TestSeriesConstructorIndexCoercion:
def test_series_constructor_datetimelike_index_coercion(self):
Expand Down

0 comments on commit 3b7f411

Please sign in to comment.