-
Notifications
You must be signed in to change notification settings - Fork 59
load_etext() returns accent-pruned version of text #102
Comments
Hi @lissyx. Thanks for reaching out and for reporting this. Do you think it would be worth to change the order of the extensions by default? Are there any cases where it wouldn't be beneficial to load the If changing the order is the right way to go: would you mind making a pull request for this? Thanks! |
Thanks @c-w for the quick reply. Part of the problem here, is that I have no idea if it's an expected behavior or not, and I could not find any documentation related to project gutenberg. I extracted a random 1000 french ebooks dataset, and so far, only one exposed this behavior. Do you know what the |
|
Thanks @hugovk, I failed to find that. So maybe the order of extensions should be changed in favor of: |
Yes, that sounds sensible. |
I'd like to probably also request we add a flag at least for preferring ascii over other sources. We may also want to document that table within this library for what you should expect from an ebook. |
@MasterOdin Would that be what you had in mind ? No test there yet, I'll add them afterwards, just want to make sure it's good enough. I'm thinking that changing default behavior might not be the best idea in the world, but I'd like an external eye on that. |
Updated PR with tests. |
@c-w There might also be another issue hidden:
Advertised as UTF-8, but the content is kind of broken indeed. Digging again in Given that the writing of the file uses UTF-8 after, I'd be tempted to add |
I can confirm that multiple mirrors do not set any charset informations when downloading, while the main website does:
|
Using the apparent_encoding still sets things right though? |
@MasterOdin Yes, it seems to be okay doing it like that. |
I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:
-8
: http://aleph.gutenberg.org/1/0/1/6/10160/10160-8.txtNow, I am not sure if it's expected or not, but the one named
10160.txt
does not contain any accent when the10160-8.txt
file do contains them. Changing the order of the extension ingutenberg/gutenberg/acquire/text.py
Line 66 in e76d96e
The text was updated successfully, but these errors were encountered: