Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Backport of string changes for 2.3 release #59513

Open
wants to merge 42 commits into
base: 2.3.x
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 14, 2024

No description provided.

@rhshadrach rhshadrach added the Strings String extension data type and string data label Aug 14, 2024
@rhshadrach rhshadrach added this to the 2.2.3 milestone Aug 14, 2024
@WillAyd WillAyd force-pushed the 2.3.0 branch 3 times, most recently from e126806 to a2c3db0 Compare August 15, 2024 17:54
@WillAyd
Copy link
Member Author

WillAyd commented Aug 15, 2024

I'm unsure about the failing tests. Seems like we might have behavior in our string types that works off of what is done in #57361 but guessing we don't want to backport that PR to 2.3?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 21, 2024

A lot of failures seem related to the handling of to_numeric with errors="ignore", which was removed for the 3.0 release.

Prior to #57361, this is what a to_numeric call would yield:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
'-47393996303418497800'

with all of the backport, it looks like the new behavior casts it to a float:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
-4.73939963e+19

@WillAyd
Copy link
Member Author

WillAyd commented Aug 22, 2024

@mroeschke could I get some of your help on the test failures? A lot of them look related to NumPy and have to do with how it handles overflows, casting, and some syntax things with numexpr.

Any chance you saw those on main and know what the fix is? https://github.com/pandas-dev/pandas/actions/runs/10514684927/job/29133089833?pr=59513#step:8:2013

@jbrockmendel
Copy link
Member

@WillAyd looking at the failures in the first build, i think #58484 may need to be backported

@mroeschke
Copy link
Member

I don't remember seeing failures related to eval but I'll dig a little bit more.

In addition to what @jbrockmendel mentioned, you'll need #59545 and the nomkl install part of #59553

@jbrockmendel
Copy link
Member

Getting close here

___________________ TestBasic.test_write_index[fastparquet] ____________________
[gw2] linux -- Python 3.12.5 /home/runner/micromamba/envs/test/bin/python3.12
[XPASS(strict)] fastparquet write into index

@bnavigator
Copy link
Contributor

I don't remember seeing failures related to eval but I'll dig a little bit more.

Already reported in May (und unfortunately not picked up by anyone) here: #58548. It tracks the failures in main and sees a fix somewhere between a3e751c and 236d89b.

Fixed by backport here: #59535

@WillAyd
Copy link
Member Author

WillAyd commented Aug 27, 2024

It tracks the failures in main and sees a fix somewhere between a3e751c and 236d89b.

I don't believe that issue was directly solved, but it was at least removed from the test suite as part of changes in https://github.com/pandas-dev/pandas/pull/56709/files#diff-f145e441b5820d235a78589ee9ee6c9c49fea0de6ca659a972b61b6aa29f9df8 that replaced the fixture of:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])

with a fixture that instead uses:

[float, "float32", "float64"]

@lithomas1
Copy link
Member

lithomas1 commented Sep 7, 2024

@WillAyd

Mind if I push some fixes for the non string dtype builds to your branch?
I finally got to taking a look at this PR and I have a couple of fixes locally.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 7, 2024

Sure by all means. FWIW I've been trying to keep the order of cherry picked commits maintained and put any custom patches at the end

WillAyd and others added 22 commits September 20, 2024 13:54
…t pyarrow (pandas-dev#59437)

* TST (string dtype): add test build with future strings enabled without pyarrow

* ensure the build doesn't override the default ones

* uninstall -> remove

* avoid jobs with same env being cancelled

* use different python version for both future jobs

* add some xfails

* fixup xfails

* less strict
* REF (string): de-duplicate _str_map (2)

* mypy fixup
…pandas-dev#59470)

* String dtype: fix convert_dtypes() to convert NaN-string to NA-string

* fix CoW tracking for conversion to python storage strings

* remove xfails
… None) (pandas-dev#59488)

* String dtype: honor mode.string_storage option (and change default to None)

* fix test + explicitly test default

* use 'auto' instead of None
…9505)

* BUG: ArrowEA comparisons with mismatched types

* move whatsnew

* GH ref
…andas-dev#59514)

* REF (string): Move StringArrayNumpySemantics methods to base class

* mypy fixup
* REF (string): remove _str_na_value

* mypy fixup
…ss (pandas-dev#59501)

* REF: move ArrowStringArrayNumpySemantics methods to parent class

* REF: move methods to ArrowStringArray

* mypy fixup

* Fix incorrect double-unpacking

* move methods to subclass
…pandas-dev#59526)

* API (string): return str dtype for .dt methods, DatetimeIndex methods

* mypy fixup
@lithomas1 lithomas1 added the Build Library building on various platforms label Sep 20, 2024
@lithomas1
Copy link
Member

(tagging build so we get some CI signal from the wheel builders as well)

…maybe_converts_object`) if requested (pandas-dev#59487)

* String dtype: maybe_converts_object give precedence to nullable dtype

* update datetimelike input validation

* update tests and remove xfails

* explicitly test pd.array() behaviour (remove xfail)

* fixup allow_2d

* undo changes related to datetimelike input validation

* fix test for str on current main

---------

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 20, 2024

@WillAyd looking into some of the failures, and pushed a few commits:

  • first commits is fixing up a backported PR's commit (feel free to merge that with the actual commit in a next rebase)
  • second and fourth (coming) commit are general fixes to tests (for tests that for some reason (e.g. removed) didn't need a fix on main, but need a fix here). Feel free to combine those all in a single commit
  • third commit is backporting String dtype: still return nullable NA-variant in object inference (maybe_converts_object) if requested #59487, which seemed to be missing here. The actual code change in that PR does not apply on this branch, so in practice this PR has become a test-only change. But those test changes were still needed to get a bunch of tests passing

With those changes, the number of failures went from 916 to around 140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants