Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_parquet + sink_parquet with same filename PanicException and file truncation #12843

Open
2 tasks done
cmdlineluser opened this issue Dec 1, 2023 · 2 comments
Open
2 tasks done
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@cmdlineluser
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import tempfile
import polars as pl

f = tempfile.NamedTemporaryFile()

df = pl.DataFrame({"A": [1]})
df.write_parquet(f.name)

print(pl.read_parquet(f.name))
# ┌─────┐
# │ A   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘

pl.scan_parquet(f.name).sink_parquet(f.name)
# ComputeError: parquet: File out of specification: A parquet file must contain a header and footer with at least 12 bytes

Log output

No response

Issue description

Just noticed Polars lets you use the same filename here, but it ends up truncating the file.

Not sure if this is supposed to be allowed or not, but if not, it should raise and leave the input intact?

Expected behavior

"Work" or raise exception and leave input intact.

Installed versions

--------Version info---------
Polars:               0.19.18
Index type:           UInt32
Platform:             macOS-13.6.1-arm64-arm-64bit
Python:               3.11.6 (main, Nov  2 2023, 04:39:40) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@cmdlineluser cmdlineluser added bug Something isn't working python Related to Python Polars labels Dec 1, 2023
@Putnam14
Copy link

Putnam14 commented Jan 4, 2024

I just got bit by this too, pl.scan_parquet(f.name).with_columns(pl.col(pl.Utf8).str.strip_chars()).sink_parquet(f.name).

The seek_len function feels sketchy at best in an asynchronous context to determine the file size.

On rust-lang/rust#59359 there's mention of using file.metadata().len() to get the size of a file without requiring a mutable borrow, maybe that's a better fit here?

I found an older discussion of this, and it does make sense that you wouldn't be able to write to the same file you're streaming from - it's already open. I do think the error message could be improved here.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added the A-io-parquet Area: reading/writing Parquet files label Jan 21, 2024
@cmdlineluser
Copy link
Contributor Author

This seems to now panic on 1.8.2

thread '<unnamed>' panicked at crates/polars-io/src/parquet/read/mmap.rs:49:23:
range end index 92 out of range for slice of length 0
PanicException: range end index 92 out of range for slice of length 0

@cmdlineluser cmdlineluser changed the title scan_parquet + sink_parquet with same filename raises exception and truncates file scan_parquet + sink_parquet with same filename PanicException and file truncation Sep 29, 2024
@deanm0000 deanm0000 added P-low Priority: low A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

4 participants