Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JW300 v1c #70

Closed
kpu opened this issue Sep 14, 2021 · 4 comments · Fixed by #72
Closed

JW300 v1c #70

kpu opened this issue Sep 14, 2021 · 4 comments · Fixed by #72

Comments

@kpu
Copy link
Collaborator

kpu commented Sep 14, 2021

https://opus.nlpl.eu/JW300.php

"Version 1c provides proper raw untokenized texts and also fixes some additional problems with language codes. "

@thammegowda
Copy link
Owner

Thanks!
this will be added in the next version.


we just need to change URLs here

bi_url_pat = 'http://opus.nlpl.eu/download.php?f=JW300/v1/xml/%s-%s.xml.gz'
mon_url_pat = 'http://opus.nlpl.eu/download.php?f=JW300/v1/xml/%s.zip'

from v1 --> v1c

@thammegowda
Copy link
Owner

@kpu and @jorgtied
I tried to use v1c, but I am still getting tokenized text instead of the expected raw/untokenized text.

Its the same case with opustools (getting tokenized data)

$ pip install opustools
$ opus_read -d JW300 -s af -t bg -wm moses -r v1c -w jw300.af jw300.bg

$ head -3 jw300.af jw300.bg
==> jw300.af <==
Gesinsbeplanning — die Christelike beskouing
BY DIE eerste Wêreldbevolkingskonferensie in 1974 het die 140 nasies wat vergader het , besluit dat alle egpare “ die basiese reg het om vryelik en op ’ n verantwoordelike wyse te besluit oor die aantal en die spasiëring van hulle kinders en om die inligting , opvoeding en middele te hê om dit te doen ” .
Baie beskou dit as ’ n goeie besluit .

==> jw300.bg <==
Планиране на семейството — християнският възглед
СТО и четиридесетте страни , които участвуваха в първата Световна конференция по въпросите на населението , проведена през 1974 г . , решиха , че всяко семейство „ има основното право да решава свободно и отговорно относно броя на децата си , а също и разликата помежду им , и да има информацията , образованието и средствата да прави това “ .
Много хора смятат , че това решение е добро .

I manually inspected this v1c file: https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip
It has tokenized words:

<?xml version="1.0" encoding="utf-8"?>
<text>
<s id="1">
    <w id="1.1">“</w>
    <w id="1.2">A</w>
    <w id="1.3">Good</w>
    <w id="1.4">Word</w>
    <w id="1.5">for</w>
    <w id="1.6">the</w>
    <w id="1.7">Witnesses</w>
    <w id="1.8">”</w>
</s>
<s id="2">
    <w id="2.1">THE</w>
    <w id="2.2">preaching</w>
    <w id="2.3">activity</w>
    <w id="2.4">of</w>
    <w id="2.5">Jehovah</w>
    <w id="2.6">’</w>
    <w id="2.7">s</w>
    <w id="2.8">witnesses</w>
    <w id="2.9">is</w>
    <w id="2.10">growing</w>
    <w id="2.11">very</w>
    <w id="2.12">rapidly</w>
    <w id="2.13">.</w>
</s>

Cant figure out what I am missing here to get raw untokenized text.

@jorgtied
Copy link

jorgtied commented Oct 3, 2021 via email

@thammegowda
Copy link
Owner

@jorgtied Works now. Thanks for the quick reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants