JW300 v1c #70

kpu · 2021-09-14T17:03:28Z

https://opus.nlpl.eu/JW300.php

"Version 1c provides proper raw untokenized texts and also fixes some additional problems with language codes. "

thammegowda · 2021-09-15T17:44:47Z

Thanks!
this will be added in the next version.

we just need to change URLs here

Lines 415 to 416 in 8cc7f5b

    
           bi_url_pat = 'http://opus.nlpl.eu/download.php?f=JW300/v1/xml/%s-%s.xml.gz' 
        
           mon_url_pat = 'http://opus.nlpl.eu/download.php?f=JW300/v1/xml/%s.zip'

from v1 --> v1c

thammegowda · 2021-10-03T19:46:01Z

@kpu and @jorgtied
I tried to use v1c, but I am still getting tokenized text instead of the expected raw/untokenized text.

Its the same case with opustools (getting tokenized data)

$ pip install opustools
$ opus_read -d JW300 -s af -t bg -wm moses -r v1c -w jw300.af jw300.bg

$ head -3 jw300.af jw300.bg
==> jw300.af <==
Gesinsbeplanning — die Christelike beskouing
BY DIE eerste Wêreldbevolkingskonferensie in 1974 het die 140 nasies wat vergader het , besluit dat alle egpare “ die basiese reg het om vryelik en op ’ n verantwoordelike wyse te besluit oor die aantal en die spasiëring van hulle kinders en om die inligting , opvoeding en middele te hê om dit te doen ” .
Baie beskou dit as ’ n goeie besluit .

==> jw300.bg <==
Планиране на семейството — християнският възглед
СТО и четиридесетте страни , които участвуваха в първата Световна конференция по въпросите на населението , проведена през 1974 г . , решиха , че всяко семейство „ има основното право да решава свободно и отговорно относно броя на децата си , а също и разликата помежду им , и да има информацията , образованието и средствата да прави това “ .
Много хора смятат , че това решение е добро .

I manually inspected this v1c file: https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip
It has tokenized words:

<?xml version="1.0" encoding="utf-8"?>
<text>
<s id="1">
    <w id="1.1">“</w>
    <w id="1.2">A</w>
    <w id="1.3">Good</w>
    <w id="1.4">Word</w>
    <w id="1.5">for</w>
    <w id="1.6">the</w>
    <w id="1.7">Witnesses</w>
    <w id="1.8">”</w>
</s>
<s id="2">
    <w id="2.1">THE</w>
    <w id="2.2">preaching</w>
    <w id="2.3">activity</w>
    <w id="2.4">of</w>
    <w id="2.5">Jehovah</w>
    <w id="2.6">’</w>
    <w id="2.7">s</w>
    <w id="2.8">witnesses</w>
    <w id="2.9">is</w>
    <w id="2.10">growing</w>
    <w id="2.11">very</w>
    <w id="2.12">rapidly</w>
    <w id="2.13">.</w>
</s>

Cant figure out what I am missing here to get raw untokenized text.

jorgtied · 2021-10-03T19:49:09Z

Try adding `-p raw` to the command. That should give you the untokenized text. Jörg

…

On 3. Oct 2021, at 22.46, Thamme Gowda ***@***.***> wrote: @kpu <https://github.com/kpu> and @jorgtied <https://github.com/jorgtied> I tried to use v1c, but I am still getting tokenized text instead of the expected raw/untokenized text. Its the same case with opustools (getting tokenized data) $ pip install opustools $ opus_read -d JW300 -s af -t bg -wm moses -r v1c -w jw300.af jw300.bg $ head -3 jw300.af jw300.bg ==> jw300.af <== Gesinsbeplanning — die Christelike beskouing BY DIE eerste Wêreldbevolkingskonferensie in 1974 het die 140 nasies wat vergader het , besluit dat alle egpare “ die basiese reg het om vryelik en op ’ n verantwoordelike wyse te besluit oor die aantal en die spasiëring van hulle kinders en om die inligting , opvoeding en middele te hê om dit te doen ” . Baie beskou dit as ’ n goeie besluit . ==> jw300.bg <== Планиране на семейството — християнският възглед СТО и четиридесетте страни , които участвуваха в първата Световна конференция по въпросите на населението , проведена през 1974 г . , решиха , че всяко семейство „ има основното право да решава свободно и отговорно относно броя на децата си , а също и разликата помежду им , и да има информацията , образованието и средствата да прави това “ . Много хора смятат , че това решение е добро . I manually inspected this v1c file: https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip <https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip> It has tokenized words: <?xml version="1.0" encoding="utf-8"?> <text> <s id="1"> <w id="1.1">“</w> <w id="1.2">A</w> <w id="1.3">Good</w> <w id="1.4">Word</w> <w id="1.5">for</w> <w id="1.6">the</w> <w id="1.7">Witnesses</w> <w id="1.8">”</w> </s> <s id="2"> <w id="2.1">THE</w> <w id="2.2">preaching</w> <w id="2.3">activity</w> <w id="2.4">of</w> <w id="2.5">Jehovah</w> <w id="2.6">’</w> <w id="2.7">s</w> <w id="2.8">witnesses</w> <w id="2.9">is</w> <w id="2.10">growing</w> <w id="2.11">very</w> <w id="2.12">rapidly</w> <w id="2.13">.</w> </s> Cant figure out what I am missing here to get raw untokenized text. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEWCPTNIR4EBLKBPDLJCMTUFCXIJANCNFSM5EATE5UQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

thammegowda · 2021-10-03T20:27:20Z

@jorgtied Works now. Thanks for the quick reply.

thammegowda mentioned this issue Oct 4, 2021

[WIP] v0.3.0 #72

Merged

thammegowda closed this as completed in #72 Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JW300 v1c #70

JW300 v1c #70

kpu commented Sep 14, 2021

thammegowda commented Sep 15, 2021

thammegowda commented Oct 3, 2021

jorgtied commented Oct 3, 2021 via email

thammegowda commented Oct 3, 2021

JW300 v1c #70

JW300 v1c #70

Comments

kpu commented Sep 14, 2021

thammegowda commented Sep 15, 2021

thammegowda commented Oct 3, 2021

jorgtied commented Oct 3, 2021 via email

thammegowda commented Oct 3, 2021