Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(_dtypes): non pandas boolean numpy type was deprecated #927

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

ThomasDsantos
Copy link

@ThomasDsantos ThomasDsantos commented Jul 30, 2024

Hello guys !

First, I'm not at all a professional about parquet file, but as I was using this library to decode a file that I cannot share, I faced an AttributeError on a Numpy dependance:

  File "test.py", line 14, in main
    df = pd.read_parquet(buffer_, engine="fastparquet")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 402, in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 135, in __init__
    self._parse_header(fn, verify)
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 236, in _parse_header
    self._set_attrs()
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 250, in _set_attrs
    self._dtypes()
  File "venv/lib/python3.12/site-packages/fastparquet/api.py", line 996, in _dtypes
    dtype[col] = np.float_()
                 ^^^^^^^^^
  File "venv/lib/python3.12/site-packages/numpy/__init__.py", line 397, in __getattr__
    raise AttributeError(
AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.. Did you mean: 'float16'?

using dependencies

    "pandas==2.2.2",
    "pyarrow==17.0.0",
    "numpy==2.0.0",
    "fastparquet==2024.5.0",

I could downgrade to numpy 1.26.0 to fix it quickly, but this seemed easy to fix so I proposed this PR, please feel free to comment and discuss, or to fix in another way.

I'm terribly sorry that I can't share original parquet file, I tried to go in file but I wasn't able to isolate problem from metadata and all these corner case (rg[1][i][3].get(12).get(3) == 0 what? 😂 )

Don't know if this could help, but some metadata from the file that was causing errors:

# my_parquet.parquet
[...]
org.apache.spark.sql.parquet.row.metadata\x18`{"type":"struct","fields":[{"name":"my_column","type":"boolean","nullable":true,"metadata":{}}]}\x00\x18Zparquet-mr version 1.10.99.7.1.7.1000-14
[...]

fastparquet/api.py Outdated Show resolved Hide resolved
@martindurant
Copy link
Member

Thanks for taking a look - looks like a numpy 2 change that we missed.

when decoding nullable boolean from parquet files created without pandas
we were facing an AttributeError on np.float_ that was deprecated and replaced by np.float64
@martindurant
Copy link
Member

It seems like the frequency string for "hour" has changed to "h" https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects ; will fix in a separate PR.

@martindurant martindurant merged commit 36ed695 into dask:main Jul 30, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants