Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Dictionary save_as_text method #56 + fix lint errors #1402

Merged
merged 5 commits into from
Jun 15, 2017

Conversation

vlejd
Copy link
Contributor

@vlejd vlejd commented Jun 8, 2017

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
Fixed some lint error on the way.

save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
@vlejd vlejd force-pushed the 56_fix_dict_save_as_text branch from cac6314 to 2351a90 Compare June 8, 2017 19:23
@@ -157,16 +156,15 @@ def testFilterTokens(self):
d.filter_tokens([0])

expected = {'computer': 0, 'eps': 8, 'graph': 10, 'human': 1,
'interface': 2, 'minors': 11, 'response': 3, 'survey': 4,
'system': 5, 'time': 6, 'trees': 9, 'user': 7}
'interface': 2, 'minors': 11, 'response': 3, 'survey': 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indent (no vertical)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

def test_saveAsText(self):
"""`Dictionary` can be saved as textfile. """
tmpf = get_tmpfile('save_dict_test.txt')
small_text = [["prvé", "slovo"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
"A survey of user opinion of computer system response time",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
for document in documents]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
for text in texts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It actually fits on one line.

self.assertEqual(serialized_lines[2][1:], "\tdruhé\t2\n")
self.assertEqual(serialized_lines[3][1:], "\tprvé\t1\n")

def test_loadFromText(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add case with old file format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part is for format, the second is for new. I create two different functions for better verbosity.

Hanging indent and split test_loadFromText to test_loadFromText_legacy.
@menshikh-iv
Copy link
Contributor

Thank you @vlejd

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean up! I'd ask for one minor fix re. the logging level.

result.num_docs = int(line.strip())
continue
else:
logging.warning("Text does not contain num_docs on the first line.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a warning?

There's no ambiguity (3 columns vs 1 column), so I think INFO (or even DEBUG) should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning is not about number ambiguity, but about the fact that you are using old serialization format.
I put there a warning, because it may cause you errors somewhere down the pipeline (as mentioned by the original issue #56). If it does not make sense, I can change it, but I think warning should be the best.

Copy link
Owner

@piskvorky piskvorky Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but the current message doesn't help the users.

What "text"? Why should it contain "num_docs"? What should the user do about it?

This should be a backward compatible change, so "errors down the pipeline" are not acceptable. CC @menshikh-iv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like: "This dictionary was serialized using old format. Please set the dictionary.num_docs manually or some dictionary methods may not work as intended."

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 22, 2017

@vlejd you PR has a few problems with python >= 3.5 on Windows:

======================================================================
ERROR: `Dictionary` can be loaded from textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 234, in test_loadFromText
    d = Dictionary.load_from_text(tmpf)
  File "C:\Python35-x64\lib\site-packages\gensim\corpora\dictionary.py", line 358, in load_from_text
    line = utils.to_unicode(line)
  File "C:\Python35-x64\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
======================================================================
ERROR: `Dictionary` can be loaded from textfile in legacy format.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 220, in test_loadFromText_legacy
    d = Dictionary.load_from_text(tmpf)
  File "C:\Python35-x64\lib\site-packages\gensim\corpora\dictionary.py", line 358, in load_from_text
    line = utils.to_unicode(line)
  File "C:\Python35-x64\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
======================================================================
FAIL: `Dictionary` can be saved as textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_corpora_dictionary.py", line 197, in test_saveAsText
    self.assertEqual(serialized_lines[1][1:], "\tdruh�\t2\n")
AssertionError: '\tdruhé\t2\n' != '\tdruh�\t2\n'
- 	druhé	2
? 	    ^^
+ 	druh�	2
? 	    ^
-------------------- >> begin captured logging << --------------------
gensim.corpora.dictionary: INFO: adding document #0 to Dictionary(0 unique tokens: [])
gensim.corpora.dictionary: INFO: built Dictionary(3 unique tokens: ['druh�', 'prv�', 'slovo']) from 3 documents (total 6 corpus positions)
gensim.corpora.dictionary: INFO: saving dictionary mapping to C:\Users\appveyor\AppData\Local\Temp\1\save_dict_test.txt
--------------------- >> end captured logging << ---------------------
======================================================================

Can you fix this ?

@vlejd
Copy link
Contributor Author

vlejd commented Jun 22, 2017

Not really. I do not have access to windows. But I can change the test to not use strange characters. The correct solution would be to properly set up the encoding which I can not do without testing on windows (it would be a shot into dark).

@vlejd vlejd mentioned this pull request Jun 22, 2017
@piskvorky
Copy link
Owner

-1 on masking the error by removing non-ASCII characters.

@menshikh-iv
Copy link
Contributor

I'll soon get a laptop with windows and try to understand what the problem is @vlejd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants