Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata) #324

Closed
techvslife opened this issue Nov 21, 2017 · 5 comments

Comments

@techvslife
Copy link

techvslife commented Nov 21, 2017

I create a property named someid (it is NOT a column) in a pandas dataframe named df and assign it a value:
df.someid = 24
I make sure the property is there:
print(df.someid) #prints 24
I save the dataframe to a feather file:
df.to_feather("C:/pandas/DataLoss.feather")
I read the feather file back into a dataframe:
df = pd.read_feather("C:/pandas/DataLoss.feather")
I try to retrieve the property from the dataframe:
print(df.someid)
And I do not get its value (24) back. Instead I get this error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-99-42d4540986ab> in <module>()
----> 1 print(df.someid)

~\Anaconda3\envs\env_feather\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'
  1. This may well be by design, although afaik undocumented, but it seems to me that an error should be generated whenever user data is lost--if the feather format is intended to be the native storage format for a pandas dataframe. (However, the inability to store indexes is documented; fwiw, I wish that limitation weren't present either.)
  2. Or it may be there is a newer, better way of specifying user properties (tags / metadata) that I don't know about, in which case perhaps the old method should give a warning and be deprecated.
  3. Note that I could store the property as a column (as a repeating constant), but that uses up significant space as the number of rows increases (as the value repeats on every row)--even if stored as a categorical. That wouldn't be an issue if there were compression, but there appears not to be compression. I could use parquet format instead, and perhaps in the latest release that would be recommended over feather format. (I haven't seen a good comparison of the pros and cons of each.)

Thanks for any assistance or comments. (--Apologies in advance if I missed something in the docs or online; I didn't see anything relevant after a reasonable search, except the unapproved use of the undocumented _metadata.)

I realize there are thorny, perhaps unresolvable, problems with dataframe table-level metadata propagation (e.g. how to handle vertical dataframe concatenations), but I think what happens (by design) should be documented. As it is, I don't know what is supposed to be happening by design here.

@techvslife techvslife changed the title Feather does not restore a dataframe's user-defined properties (tags or metadata) read_feather method does not restore a dataframe's user-defined properties (tags or metadata) Nov 21, 2017
@techvslife techvslife changed the title read_feather method does not restore a dataframe's user-defined properties (tags or metadata) read_feather method does not restore a dataframe's user-defined properties (attributes or tags or metadata) Nov 21, 2017
@techvslife techvslife changed the title read_feather method does not restore a dataframe's user-defined properties (attributes or tags or metadata) read_feather method does not restore a dataframe's table-level user-defined properties (attributes or tags or metadata) Nov 21, 2017
@wesm
Copy link
Owner

wesm commented Nov 21, 2017

Not even pickle preserves these properties when using pandas, I don't think this is something we intend to fix in generality:

In [1]: import pandas as pd

In [2]: df = pd.util.testing.makeDataFrame()

In [3]: df.someid = 24

In [4]: import pickle

In [5]: pickle.loads(pickle.dumps(df)).someid
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-71102f1fb31f> in <module>()
----> 1 pickle.loads(pickle.dumps(df)).someid

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'someid'

@wesm wesm closed this as completed Nov 21, 2017
@techvslife
Copy link
Author

techvslife commented Nov 21, 2017

Thank you for the speedy response (and for creating pandas!). In that case, I hope that some sort of compression support gets added to feather. (I have to store the property as a column to preserve it, so that means repeating it many, many times.)

@wesm
Copy link
Owner

wesm commented Nov 21, 2017

I'm waiting on the R community to get more involved, we have had the compression support in Apache Arrow already for a very long time

@techvslife
Copy link
Author

Great, hope they catch up. (It's a big advantage to have compression, especially when you need to move around a hundred thousand dataframes with three hundred thousand rows each!)

@techvslife
Copy link
Author

(just fyi: some useful cross-references:
Looks like the "lost attribute" issue popped up as early as 2012:
https://stackoverflow.com/questions/13250499/attributes-to-a-subclass-of-pandas-dataframe-disappear-after-pickle

One of the more comprehensive discussions I've found:
pandas-dev/pandas#2485
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants