Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix write method of file requires byte-like object, not str #1750

Merged
merged 3 commits into from
Dec 5, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,7 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character
Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.

"""
if output_file is None:
outfile = sys.stdout
else:
if output_file is not None:
outfile = smart_open(output_file, 'wb')

try:
Expand All @@ -122,9 +120,13 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character
output_data["section_texts"].append(section_content)
if (idx + 1) % 100000 == 0:
logger.info("processed #%d articles (at %r now)", idx + 1, article_title)
outfile.write(json.dumps(output_data) + "\n")
if output_file is None:
sys.stdout.write(json.dumps(output_data) + "\n")
else:
outfile.write((json.dumps(output_data) + "\n").encode())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not: always use explicit encoding!

In this case, the output must be utf8.

Copy link
Owner

@piskvorky piskvorky Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd prefer to write utf8 even to stdout (sys.stdout.buffer), because that's the script's contract -- that's what we tell users we output. It's not a special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> help(str.encode)
Help on method_descriptor:

encode(...)
    S.encode(encoding='utf-8', errors='strict') -> bytes

On the Python2 too.

Copy link
Owner

@piskvorky piskvorky Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not:

u"ř".encode()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0159' in position 0: ordinal not in range(128)

(and even if it did, we'd still want to be explicit)

finally:
outfile.close()
if output_file is not None:
outfile.close()


def extract_page_xmls(f):
Expand Down