fix(_dtypes): non pandas boolean numpy type was deprecated #927

ThomasDsantos · 2024-07-30T10:33:00Z

Hello guys !

First, I'm not at all a professional about parquet file, but as I was using this library to decode a file that I cannot share, I faced an AttributeError on a Numpy dependance:

  File "test.py", line 14, in main
    df = pd.read_parquet(buffer_, engine="fastparquet")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 402, in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 135, in __init__
    self._parse_header(fn, verify)
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 236, in _parse_header
    self._set_attrs()
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 250, in _set_attrs
    self._dtypes()
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 996, in _dtypes
    dtype[col] = np.float_()
                 ^^^^^^^^^
  File "venv/lib/python3.12/site-packages/numpy/__init__.py", line 397, in __getattr__
    raise AttributeError(
AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.. Did you mean: 'float16'?

using dependencies

    "pandas==2.2.2",
    "pyarrow==17.0.0",
    "numpy==2.0.0",
    "fastparquet==2024.5.0",

I could downgrade to numpy 1.26.0 to fix it quickly, but this seemed easy to fix so I proposed this PR, please feel free to comment and discuss, or to fix in another way.

I'm terribly sorry that I can't share original parquet file, I tried to go in file but I wasn't able to isolate problem from metadata and all these corner case (rg[1][i][3].get(12).get(3) == 0 what? 😂 )

Don't know if this could help, but some metadata from the file that was causing errors:

# my_parquet.parquet
[...]
org.apache.spark.sql.parquet.row.metadata\x18`{"type":"struct","fields":[{"name":"my_column","type":"boolean","nullable":true,"metadata":{}}]}\x00\x18Zparquet-mr version 1.10.99.7.1.7.1000-14
[...]

fastparquet/api.py

martindurant · 2024-07-30T13:20:36Z

Thanks for taking a look - looks like a numpy 2 change that we missed.

when decoding nullable boolean from parquet files created without pandas we were facing an AttributeError on np.float_ that was deprecated and replaced by np.float64

martindurant · 2024-07-30T14:32:17Z

It seems like the frequency string for "hour" has changed to "h" https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects ; will fix in a separate PR.

martindurant reviewed Jul 30, 2024

View reviewed changes

fastparquet/api.py Outdated Show resolved Hide resolved

fix(_dtypes): non pandas boolean numpy type was deprecated

73fbc62

when decoding nullable boolean from parquet files created without pandas we were facing an AttributeError on np.float_ that was deprecated and replaced by np.float64

ThomasDsantos force-pushed the fix/non-pandas-null-booleans branch from 1dda8e6 to 73fbc62 Compare July 30, 2024 13:25

ThomasDsantos requested a review from martindurant July 30, 2024 13:28

martindurant approved these changes Jul 30, 2024

View reviewed changes

martindurant merged commit 36ed695 into dask:main Jul 30, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(_dtypes): non pandas boolean numpy type was deprecated #927

fix(_dtypes): non pandas boolean numpy type was deprecated #927

ThomasDsantos commented Jul 30, 2024 •

edited

Loading

martindurant commented Jul 30, 2024

martindurant commented Jul 30, 2024

fix(_dtypes): non pandas boolean numpy type was deprecated #927

fix(_dtypes): non pandas boolean numpy type was deprecated #927

Conversation

ThomasDsantos commented Jul 30, 2024 • edited Loading

martindurant commented Jul 30, 2024

martindurant commented Jul 30, 2024

ThomasDsantos commented Jul 30, 2024 •

edited

Loading