-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dictionary.save_as_text/load_from_text is dangerous #56
Comments
Yeah, they are not meant to be equivalent (why have 2 equivalent functions?). The text format is meant mostly for human inspection, not as a custom re-implementation of the pickle serialization protocol. The preferred way to save and load objects in gensim is via |
Because pickle could be unsafe and the text format can be inspected and modified by humans before it's imported back. |
Oh, security in gensim, that's a new one! I'm afraid taking care of that properly would require more effort than just serializing to text in If there are more users with similar needs (or if your solution is sufficiently generic), we can brainstorm some gensim-wide "fix", I'm open to that. |
I guess I also wrongly made this assumption. IMHO, the name is misleading. If I can use two versions of loading and saving. The UNIX philosophy tell us to go for the text version... :-) |
I think more suitable name for |
You're right. I'm in favour of |
I am not sure... It might take a little time for me to be familiar with the code, and only after that I am able to rename it. |
I'm coming in a bit late but would like to help: I find the text version of the dictionary extremely useful because I can manipulate it with different (non-Python) tools. E.g., even a spreadsheet. However, I often need to read the new dictionary back in again. After looking at the code, the problem seems to be that reading in an array doesn't reset the Dictionary's num_docs attribute. This is used in filter_extremes so any call to this method results in zero documents surviving the filter. This also applies when using the load method. I'm not sure if the filter_extremes method needs to re-calculate num_docs in case further calls are made? What I've done is a very quick fix but helps filter_extremes work with loaded dictionaries. |
Just got bit by this. Tried to save_as_text then load_from_text, but load doesn't respect the format from the save. Possible fixes:
|
I would like to work on this. I will try to make the load_from_text work, optimally without breaking the backward compatibility. |
save_as_text now writes num_docs on the first line. load_as_text loads it in backward compatible way.
save_as_text now writes num_docs on the first line. load_as_text loads it in backward compatible way.
Fix Dictionary save_as_text method #56 + fix lint errors
Is this fixed by #1402 ? Do we still need to change the documentation? |
Works fine in |
I thought that Dictionary.save_as_text and load_from_text is equivalent to Dictionary.save/load, but it isn't. The text format does not keep the "num_docs" and after loading a Dict from a text, several methods do not work anymore::
Everything works as expected. Now the text version (assuming the dct from above)::
This behaviour is not documented anywhere and I'd think that save_as_text and load_from_text should lead to a fully functional Dictionary object
The text was updated successfully, but these errors were encountered: