Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary.save_as_text/load_from_text is dangerous #56

Closed
heuer opened this issue Oct 11, 2011 · 12 comments
Closed

Dictionary.save_as_text/load_from_text is dangerous #56

heuer opened this issue Oct 11, 2011 · 12 comments
Assignees
Labels
documentation Current issue related to documentation

Comments

@heuer
Copy link

heuer commented Oct 11, 2011

I thought that Dictionary.save_as_text and load_from_text is equivalent to Dictionary.save/load, but it isn't. The text format does not keep the "num_docs" and after loading a Dict from a text, several methods do not work anymore::

>>> dct = Dictionary()
>>> _ = dct.doc2bow(['Hi', 'there'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'all'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'there', 'world'], allow_update=True)
>>> print dct
Dictionary(4 unique tokens)
>>> print dct.num_docs
3
>>> dct.save('./test.dct')
>>> d2 = Dictionary.load('./test.dct')
>>> print d2.num_docs
3
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2
Dictionary(2 unique tokens)
>>> print d2.token2id
{'world': 0, 'all': 1}

Everything works as expected. Now the text version (assuming the dct from above)::

>>> dct.save_as_text('./test.txt')
>>> d2 = Dictionary.load_from_text('./test.txt')
>>> print d2.num_docs
0
>>> print d2 # Everything is fine here, expecting 4 tokens
Dictionary(4 unique tokens)
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2 # Whoops, all tokens are gone, expected 2, see above
Dictionary(0 unique tokens)
>>> print d2.token2id
{}

This behaviour is not documented anywhere and I'd think that save_as_text and load_from_text should lead to a fully functional Dictionary object

@piskvorky
Copy link
Owner

Yeah, they are not meant to be equivalent (why have 2 equivalent functions?). The text format is meant mostly for human inspection, not as a custom re-implementation of the pickle serialization protocol.

The preferred way to save and load objects in gensim is via save() and load(). Text is for debugging and portability. You're right that the documentation should mention this; I'll fix it (or you open a pull request if you have the words ready :)

@heuer
Copy link
Author

heuer commented Oct 11, 2011

why have 2 equivalent functions?

Because pickle could be unsafe and the text format can be inspected and modified by humans before it's imported back.

@piskvorky
Copy link
Owner

Oh, security in gensim, that's a new one!

I'm afraid taking care of that properly would require more effort than just serializing to text in Dictionary. This is the first time I hear of this use case -- users usually run experiments with their own code and data -- so at the moment, I would suggest you override functions you deem unsafe for your scenario yourself. Perhaps setting pickle protocol to 0 (ascii) in save() would be enough for your human inspection requirements? Can you say more about these requirements?

If there are more users with similar needs (or if your solution is sufficiently generic), we can brainstorm some gensim-wide "fix", I'm open to that.

@fannix
Copy link

fannix commented Apr 12, 2012

I guess I also wrongly made this assumption. IMHO, the name is misleading. If I can use two versions of loading and saving. The UNIX philosophy tell us to go for the text version... :-)

@fannix
Copy link

fannix commented Apr 12, 2012

I think more suitable name for sav_as_text would be output_mapping or print_mapping

@piskvorky
Copy link
Owner

You're right.

I'm in favour of save_mapping, or write_mapping -- something that evokes opening files (unlike print_mapping). Can you rename it?

@fannix
Copy link

fannix commented Apr 12, 2012

I am not sure... It might take a little time for me to be familiar with the code, and only after that I am able to rename it.

@salmoni
Copy link

salmoni commented Nov 9, 2012

I'm coming in a bit late but would like to help: I find the text version of the dictionary extremely useful because I can manipulate it with different (non-Python) tools. E.g., even a spreadsheet. However, I often need to read the new dictionary back in again. After looking at the code, the problem seems to be that reading in an array doesn't reset the Dictionary's num_docs attribute. This is used in filter_extremes so any call to this method results in zero documents surviving the filter. This also applies when using the load method.

I'm not sure if the filter_extremes method needs to re-calculate num_docs in case further calls are made?

What I've done is a very quick fix but helps filter_extremes work with loaded dictionaries.

@mattsta
Copy link

mattsta commented Feb 22, 2014

Just got bit by this. Tried to save_as_text then load_from_text, but load doesn't respect the format from the save.

Possible fixes:

  • make load_from_text work. (ideal)
  • rename save_as_text to save_human_readable then remove load_from_text.
  • improve documentation to make it clear _as_text aren't for reusable persistence.

@vlejd
Copy link
Contributor

vlejd commented Jun 8, 2017

I would like to work on this. I will try to make the load_from_text work, optimally without breaking the backward compatibility.

vlejd added a commit to vlejd/gensim that referenced this issue Jun 8, 2017
save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
vlejd added a commit to vlejd/gensim that referenced this issue Jun 8, 2017
save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
menshikh-iv added a commit that referenced this issue Jun 15, 2017
Fix Dictionary save_as_text method #56 + fix lint errors
@vlejd
Copy link
Contributor

vlejd commented Jun 19, 2017

Is this fixed by #1402 ? Do we still need to change the documentation?

@menshikh-iv
Copy link
Contributor

Works fine in 3.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

7 participants