-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Dictionary save_as_text method #56 + fix lint errors #1402
Fix Dictionary save_as_text method #56 + fix lint errors #1402
Conversation
save_as_text now writes num_docs on the first line. load_as_text loads it in backward compatible way.
cac6314
to
2351a90
Compare
@@ -157,16 +156,15 @@ def testFilterTokens(self): | |||
d.filter_tokens([0]) | |||
|
|||
expected = {'computer': 0, 'eps': 8, 'graph': 10, 'human': 1, | |||
'interface': 2, 'minors': 11, 'response': 3, 'survey': 4, | |||
'system': 5, 'time': 6, 'trees': 9, 'user': 7} | |||
'interface': 2, 'minors': 11, 'response': 3, 'survey': 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use hanging indent (no vertical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
def test_saveAsText(self): | ||
"""`Dictionary` can be saved as textfile. """ | ||
tmpf = get_tmpfile('save_dict_test.txt') | ||
small_text = [["prvé", "slovo"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
"The intersection graph of paths in trees", | ||
"Graph minors IV Widths of trees and well quasi ordering", | ||
"Graph minors A survey"] | ||
"A survey of user opinion of computer system response time", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
stoplist = set('for a of the and to in'.split()) | ||
texts = [[word for word in document.lower().split() if word not in stoplist] | ||
for document in documents] | ||
for document in documents] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
|
||
# remove words that appear only once | ||
all_tokens = sum(texts, []) | ||
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) | ||
texts = [[word for word in text if word not in tokens_once] | ||
for text in texts] | ||
for text in texts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It actually fits on one line.
self.assertEqual(serialized_lines[2][1:], "\tdruhé\t2\n") | ||
self.assertEqual(serialized_lines[3][1:], "\tprvé\t1\n") | ||
|
||
def test_loadFromText(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add case with old file format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part is for format, the second is for new. I create two different functions for better verbosity.
Hanging indent and split test_loadFromText to test_loadFromText_legacy.
Thank you @vlejd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clean up! I'd ask for one minor fix re. the logging level.
result.num_docs = int(line.strip()) | ||
continue | ||
else: | ||
logging.warning("Text does not contain num_docs on the first line.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a warning?
There's no ambiguity (3 columns vs 1 column), so I think INFO (or even DEBUG) should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning is not about number ambiguity, but about the fact that you are using old serialization format.
I put there a warning, because it may cause you errors somewhere down the pipeline (as mentioned by the original issue #56). If it does not make sense, I can change it, but I think warning should be the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but the current message doesn't help the users.
What "text"? Why should it contain "num_docs"? What should the user do about it?
This should be a backward compatible change, so "errors down the pipeline" are not acceptable. CC @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like: "This dictionary was serialized using old format. Please set the dictionary.num_docs manually or some dictionary methods may not work as intended."
@vlejd you PR has a few problems with python >= 3.5 on Windows:
Can you fix this ? |
Not really. I do not have access to windows. But I can change the test to not use strange characters. The correct solution would be to properly set up the encoding which I can not do without testing on windows (it would be a shot into dark). |
-1 on masking the error by removing non-ASCII characters. |
I'll soon get a laptop with windows and try to understand what the problem is @vlejd |
save_as_text now writes num_docs on the first line.
load_as_text loads it in backward compatible way.
Fixed some lint error on the way.