load_etext() returns accent-pruned version of text #102

lissyx · 2018-05-15T14:03:18Z

I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:

the one queried by default : http://aleph.gutenberg.org/1/0/1/6/10160/10160.txt
the one with -8: http://aleph.gutenberg.org/1/0/1/6/10160/10160-8.txt

Now, I am not sure if it's expected or not, but the one named 10160.txt does not contain any accent when the 10160-8.txt file do contains them. Changing the order of the extension in

gutenberg/gutenberg/acquire/text.py

Line 66 in e76d96e

extensions = ('.txt', '-8.txt', '-0.txt')

help working around the issue.

The text was updated successfully, but these errors were encountered:

c-w · 2018-05-15T16:46:15Z

Hi @lissyx. Thanks for reaching out and for reporting this.

Do you think it would be worth to change the order of the extensions by default? Are there any cases where it wouldn't be beneficial to load the -8 version of the text?

If changing the order is the right way to go: would you mind making a pull request for this? Thanks!

lissyx · 2018-05-15T18:20:06Z

Thanks @c-w for the quick reply. Part of the problem here, is that I have no idea if it's an expected behavior or not, and I could not find any documentation related to project gutenberg. I extracted a random 1000 french ebooks dataset, and so far, only one exposed this behavior. Do you know what the -8 and -0 are for? I'd be happy to make a PR once we know for sure the proper fix :)

hugovk · 2018-05-15T19:30:14Z

File formats other than plain text will have a format-designator appended to the filename, as well as an appropriate file extension. The following list indicates the most common formats likely to be found at Project Gutenberg:

Plain text       12345.txt          12345.zip    (encoding: us-ascii)
8-bit plain text 12345-8.txt        12345-8.zip  (encodings: iso-8859-1, windows-1252, MacRoman, ...)
Big-5            12345-5.txt        12345-5.zip  (encoding: big-5)
Unicode          12345-0.txt        12345-0.zip  (encoding: utf-8)

HTML             12345-h.htm        12345-h.zip
TeX              12345-t.tex        12345-t.zip
XML              12345-x.xml        12345-x.zip
MP3              12345-m-###.mp3    12345-m.zip
RTF              12345-r.rtf        12345-r.zip

PDF              12345-pdf.pdf      12345-pdf.zip
LIT              12345-lit.lit      12345-lit.zip
MS Word Doc      12345-doc.doc      12345-doc.zip
PDB              12345-pdb.pdb      12345-pdb.zip

https://www.gutenberg.org/files/

lissyx · 2018-05-15T19:55:22Z

Thanks @hugovk, I failed to find that. So maybe the order of extensions should be changed in favor of: -0.txt, -8.txt, .txt ?

hugovk · 2018-05-15T20:00:48Z

Yes, that sounds sensible.

MasterOdin · 2018-05-15T22:56:05Z

I'd like to probably also request we add a flag at least for preferring ascii over other sources. We may also want to document that table within this library for what you should expect from an ebook.

lissyx · 2018-05-16T06:44:50Z

@MasterOdin Would that be what you had in mind ?
#103

No test there yet, I'll add them afterwards, just want to make sure it's good enough. I'm thinking that changing default behavior might not be the best idea in the world, but I'd like an external eye on that.

lissyx · 2018-05-16T11:09:27Z

Updated PR with tests.

lissyx · 2018-05-16T12:29:00Z

@c-w There might also be another issue hidden:

python -c 'from gutenberg.acquire import load_etext; print(load_etext(55517, refresh_cache=True)[0:1000])' 
INFO:rdflib:RDFLib Version: 4.2.2
ï»¿The Project Gutenberg EBook of Correspondance, by Ã�mile Zola

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.

Title: Correspondance
       Lettres de jeunesse

Author: Ã�mile Zola

Release Date: September 10, 2017 [EBook #55517]

Language: French

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK CORRESPONDANCE ***

Advertised as UTF-8, but the content is kind of broken indeed. Digging again in load_etext, it seems requests thinks the encoding is "ISO-8859-1" when the file really is UTF-8. So far, forcing response.encoding = "utf-8" fixes it for me locally.

Given that the writing of the file uses UTF-8 after, I'd be tempted to add response.encoding = "utf-8" to my PR.

lissyx · 2018-05-16T12:52:17Z

I can confirm that multiple mirrors do not set any charset informations when downloading, while the main website does:

$ curl -L -v http://www.gutenberg.org/files/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain; charset=UTF-8
$ curl -L -v http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
$ curl -L -v http://aleph.gutenberg.org/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain

MasterOdin · 2018-05-16T13:04:06Z

Using the apparent_encoding still sets things right though?

lissyx · 2018-05-16T14:15:44Z

@MasterOdin Yes, it seems to be okay doing it like that.

Resolves #102

lissyx mentioned this issue May 16, 2018

Changing default ordering of extensions for fetching ebooks #103

Merged

c-w closed this as completed in #103 May 18, 2018

c-w pushed a commit that referenced this issue May 18, 2018

Changing default ordering of extensions for fetching ebooks (#103)

36cc023

Resolves #102

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_etext() returns accent-pruned version of text #102

load_etext() returns accent-pruned version of text #102

lissyx commented May 15, 2018

c-w commented May 15, 2018

lissyx commented May 15, 2018

hugovk commented May 15, 2018

lissyx commented May 15, 2018

hugovk commented May 15, 2018

MasterOdin commented May 15, 2018 •

edited

Loading

lissyx commented May 16, 2018

lissyx commented May 16, 2018

lissyx commented May 16, 2018

lissyx commented May 16, 2018

MasterOdin commented May 16, 2018

lissyx commented May 16, 2018

load_etext() returns accent-pruned version of text #102

load_etext() returns accent-pruned version of text #102

Comments

lissyx commented May 15, 2018

c-w commented May 15, 2018

lissyx commented May 15, 2018

hugovk commented May 15, 2018

lissyx commented May 15, 2018

hugovk commented May 15, 2018

MasterOdin commented May 15, 2018 • edited Loading

lissyx commented May 16, 2018

lissyx commented May 16, 2018

lissyx commented May 16, 2018

lissyx commented May 16, 2018

MasterOdin commented May 16, 2018

lissyx commented May 16, 2018

MasterOdin commented May 15, 2018 •

edited

Loading