-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF file compression not working #29310
Comments
Correct URL is https://github.com/pandas-dev/pandas/pull/28890/files |
Hi @WuraolaOyewusi are currently working on this issue? If not, do you mind if I give it a shot? |
@yp1996 . Please go ahead. |
@yp1996 ,are you still working on this? If not I'll like to work on it! |
@joybhallaa we haven't heard from them in over 2 weeks, so if this is something you'd like to work on I think you're OK to go ahead - please comment 'take' so the issue will be assigned to you |
take |
Sure - you can see #29404 for a starting point if you like |
@MarcoGorelli , sure will do! |
Hey all,
These tests are checking for conditions and returning true or false, but not changing the values for these 2 parameters. I may be wrong about some things, as I am still fairly new to pytest. |
Hi joybhalla, |
@sathyz yes ! |
@sathyz feel free to take this issue ! |
Hi, as a newbie, I would love to take on this issue if it's okay. |
Thank you for the suggestion. I will definitely do that. I will come back if @sathyz don't want to work on this. |
take |
So I have looked into the code and here are the problems i found:
with HDFStore(tmpfile, "w", complevel=5, complib='bzip2') as store:
store.put(gname, df, format='t') would result in bzip2(5), whereas with HDFStore(tmpfile, "w", complevel=5, complib='bzip2') as store:
store.put(gname, df, format='t', complevel=2, complib='zlib') would result in zlib(2) So my suggestions are
But imo, doing just the former (and maybe fix the check) would already resolves this issue. The latter one deserve its own issue |
update: pandas/tests/io/pytables/test_store.py::TestHDFStore::test_complibs_default_settings # Set complib and check to see if compression is disabled
with ensure_clean_path(setup_path) as tmpfile:
df.to_hdf(tmpfile, "df", complib="zlib")
result = pd.read_hdf(tmpfile, "df")
tm.assert_frame_equal(result, df)
with tables.open_file(tmpfile, mode="r") as h5file:
for node in h5file.walk_nodes(where="/df", classname="Leaf"):
assert node.filters.complevel == 0
assert node.filters.complib is None |
Thanks @quangngd for all the research on this. I'm not an expert in HDF, but feels like there is a compression for the whole store, and another one for each table. If I'm understanding it correctly, the test is checking that the tables are not compressed when using Can you run a |
Tracing back to PR #16355, @jreback has already mentioned this issue in the code review: #16355 (comment) but in the end it was still merged. |
xref #28890 (review)
While updating the performance comparison part of the IO docs it was found that compressed size values for .hdf file formats were the same as uncompressed .hdf file formats.
This seems to be caused by the next lines saving the same files:
We need to see why the
complib
parameter is being ignored, and fix it so thehdf5
file is saved compressed when used.CC @datapythonista
The text was updated successfully, but these errors were encountered: