Skip to content
This repository has been archived by the owner on Jan 12, 2023. It is now read-only.

load_etext() returns accent-pruned version of text #102

Closed
lissyx opened this issue May 15, 2018 · 12 comments
Closed

load_etext() returns accent-pruned version of text #102

lissyx opened this issue May 15, 2018 · 12 comments

Comments

@lissyx
Copy link
Contributor

lissyx commented May 15, 2018

I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:

Now, I am not sure if it's expected or not, but the one named 10160.txt does not contain any accent when the 10160-8.txt file do contains them. Changing the order of the extension in

extensions = ('.txt', '-8.txt', '-0.txt')
help working around the issue.

@c-w
Copy link
Owner

c-w commented May 15, 2018

Hi @lissyx. Thanks for reaching out and for reporting this.

Do you think it would be worth to change the order of the extensions by default? Are there any cases where it wouldn't be beneficial to load the -8 version of the text?

If changing the order is the right way to go: would you mind making a pull request for this? Thanks!

@lissyx
Copy link
Contributor Author

lissyx commented May 15, 2018

Thanks @c-w for the quick reply. Part of the problem here, is that I have no idea if it's an expected behavior or not, and I could not find any documentation related to project gutenberg. I extracted a random 1000 french ebooks dataset, and so far, only one exposed this behavior. Do you know what the -8 and -0 are for? I'd be happy to make a PR once we know for sure the proper fix :)

@hugovk
Copy link
Collaborator

hugovk commented May 15, 2018

File formats other than plain text will have a format-designator appended to the filename, as well as an appropriate file extension. The following list indicates the most common formats likely to be found at Project Gutenberg:

Plain text       12345.txt          12345.zip    (encoding: us-ascii)
8-bit plain text 12345-8.txt        12345-8.zip  (encodings: iso-8859-1, windows-1252, MacRoman, ...)
Big-5            12345-5.txt        12345-5.zip  (encoding: big-5)
Unicode          12345-0.txt        12345-0.zip  (encoding: utf-8)

HTML             12345-h.htm        12345-h.zip
TeX              12345-t.tex        12345-t.zip
XML              12345-x.xml        12345-x.zip
MP3              12345-m-###.mp3    12345-m.zip
RTF              12345-r.rtf        12345-r.zip

PDF              12345-pdf.pdf      12345-pdf.zip
LIT              12345-lit.lit      12345-lit.zip
MS Word Doc      12345-doc.doc      12345-doc.zip
PDB              12345-pdb.pdb      12345-pdb.zip

https://www.gutenberg.org/files/

@lissyx
Copy link
Contributor Author

lissyx commented May 15, 2018

Thanks @hugovk, I failed to find that. So maybe the order of extensions should be changed in favor of: -0.txt, -8.txt, .txt ?

@hugovk
Copy link
Collaborator

hugovk commented May 15, 2018

Yes, that sounds sensible.

@MasterOdin
Copy link
Collaborator

MasterOdin commented May 15, 2018

I'd like to probably also request we add a flag at least for preferring ascii over other sources. We may also want to document that table within this library for what you should expect from an ebook.

@lissyx
Copy link
Contributor Author

lissyx commented May 16, 2018

@MasterOdin Would that be what you had in mind ?
#103

No test there yet, I'll add them afterwards, just want to make sure it's good enough. I'm thinking that changing default behavior might not be the best idea in the world, but I'd like an external eye on that.

@lissyx
Copy link
Contributor Author

lissyx commented May 16, 2018

Updated PR with tests.

@lissyx
Copy link
Contributor Author

lissyx commented May 16, 2018

@c-w There might also be another issue hidden:

python -c 'from gutenberg.acquire import load_etext; print(load_etext(55517, refresh_cache=True)[0:1000])' 
INFO:rdflib:RDFLib Version: 4.2.2
The Project Gutenberg EBook of Correspondance, by �mile Zola

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.

Title: Correspondance
       Lettres de jeunesse

Author: �mile Zola

Release Date: September 10, 2017 [EBook #55517]

Language: French

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK CORRESPONDANCE ***

Advertised as UTF-8, but the content is kind of broken indeed. Digging again in load_etext, it seems requests thinks the encoding is "ISO-8859-1" when the file really is UTF-8. So far, forcing response.encoding = "utf-8" fixes it for me locally.

Given that the writing of the file uses UTF-8 after, I'd be tempted to add response.encoding = "utf-8" to my PR.

@lissyx
Copy link
Contributor Author

lissyx commented May 16, 2018

I can confirm that multiple mirrors do not set any charset informations when downloading, while the main website does:

$ curl -L -v http://www.gutenberg.org/files/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain; charset=UTF-8
$ curl -L -v http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
$ curl -L -v http://aleph.gutenberg.org/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain

@MasterOdin
Copy link
Collaborator

Using the apparent_encoding still sets things right though?

@lissyx
Copy link
Contributor Author

lissyx commented May 16, 2018

@MasterOdin Yes, it seems to be okay doing it like that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants