-
Notifications
You must be signed in to change notification settings - Fork 221
Update deinflect.json #199
Update deinflect.json #199
Conversation
ext/bg/lang/deinflect.json
Outdated
] | ||
}, | ||
{ | ||
"kanaIn": "逝ったり", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's with the 〜たり entries in the latter half of the -tara
block and similarly 〜たら in the -tari
block below. Otherwise the new conjugations look fine to me
ext/bg/lang/deinflect.json
Outdated
] | ||
}, | ||
{ | ||
"kanaIn": "逝ったら", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to the previous comment, from here until bottom of the block
They seem to be present in the original too, so if this turns out to be an error, better report it to Rikaichamp |
Also, if you used some script to do this, please do share 😄 |
Hmm, you're right, I didn't notice this. I'll open an issue on rikaichamp to see what's up with that.
https://gist.github.com/toasted-nutbread/798c3f96f570307be2ace92c6da03d10 |
0b83f4f
to
e812e76
Compare
Looks like this was a bug. A bugfix has been submitted to birchill/10ten-ja-reader#111 and I have rebuilt the JSON data in this commit. |
Thanks! |
@toasted-nutbread @siikamiika for future reference, the script I used to convert #!/usr/bin/env python2
import codecs
import json
import sys
class Deinflector:
class Rule:
def __init__(self, source, target, types, reason):
self.source = unicode(source)
self.target = unicode(target)
self.types = int(types)
self.reason = int(reason)
class Result:
def __init__(self, stem, types, conjugations):
self.stem = unicode(stem)
self.types = int(types)
self.conjugations = list(conjugations)
def __init__(self, filename=None):
if filename == None:
self.close()
else:
self.load(filename)
def close(self):
self.conjugations = list()
self.rules = dict()
def load(self, filename):
self.close()
try:
with codecs.open(filename, 'rb', 'utf-8') as fp:
lines = [line.strip() for line in fp.readlines()]
# ignore the first line which is the file header
del lines[0]
except IOError:
return False
for line in lines:
fields = line.split('\t')
fieldCount = len(fields)
if fieldCount == 1:
self.conjugations.append(fields[0])
elif fieldCount == 4:
rule = self.Rule(*fields)
sourceLength = len(rule.source)
if sourceLength not in self.rules:
self.rules[sourceLength] = list()
self.rules[sourceLength].append(rule)
else:
self.close()
return False
return True
def deinflect(self, word):
results = [self.Result(word, 0xff, list())]
have = {word: 0}
for result in results:
for length, group in sorted(self.rules.items(), reverse=True):
if length > len(result.stem):
continue
for rule in group:
if result.types & rule.types == 0 or result.stem[-length:] != rule.source:
continue
new = result.stem[:len(result.stem) - len(rule.source)] + rule.target
if len(new) <= 1:
continue
if new in have:
result = results[have[new]]
result.types |= (rule.types >> 8)
continue
have[new] = len(results)
conjugations = [self.conjugations[rule.reason]] + result.conjugations
results.append(self.Result(new, rule.types >> 8, conjugations))
return [
(result.stem, u', '.join(result.conjugations), result.types) for result in results
]
def validate(self, types, tags):
for tag in tags:
valid = (
types & 1 and tag == 'v1' or
types & 2 and tag[:2] == 'v5' or
types & 4 and tag == 'adj-i' or
types & 8 and tag == 'vk' or
types & 16 and tag[:3] == 'vs-'
)
if valid:
return True
return False
d = Deinflector('deinflect.dat')
types = ['v1', 'v5', 'adj-i', 'vk', 'vs-']
items = dict()
for i, l in d.rules.items():
for r in l:
modsIn = list()
for i, t in enumerate(types):
if r.types & (1 << i):
modsIn.append(t)
modsOut = list()
for i, t in enumerate(types):
if (r.types >> 8) & (1 << i):
modsOut.append(t)
#print '%s -> %s (%s) %s' % (r.source, r.target, d.conjugations[r.reason], ', '.join(mods))
reason = d.conjugations[r.reason]
if reason not in items:
items[reason] = list()
items[reason].append({
'expIn': [r.source],
'expOut': r.target,
'tagsIn': modsIn,
'tagsOut': modsOut,
})
#sys.setdefaultencoding('utf-8')
print json.dumps(items, sort_keys=True, indent=4, encoding='utf-8', ensure_ascii=False).encode('utf-8') |
This was used to dump the original data from the crazy format that Rikaichan used. |
@FooSoft @toasted-nutbread Thanks for sharing! I found a 2010 version of |
It looks like the rikaichamp's ruleset is missing entries for ずる サ行変格活用 verbs (code "vz") |
Deinflection rules updated using rikaichamp's ruleset, which has been updated more recently.
Fixes #118