Update deinflect.json #199

toasted-nutbread · 2019-09-01T17:48:30Z

Deinflection rules updated using rikaichamp's ruleset, which has been updated more recently.

Fixes #118

siikamiika · 2019-09-01T18:33:20Z

ext/bg/lang/deinflect.json

+            ]
+        },
+        {
+            "kanaIn": "逝ったり",


I wonder what's with the 〜たり entries in the latter half of the -tara block and similarly 〜たら in the -tari block below. Otherwise the new conjugations look fine to me

siikamiika · 2019-09-01T18:34:32Z

ext/bg/lang/deinflect.json

+            ]
+        },
+        {
+            "kanaIn": "逝ったら",


Referring to the previous comment, from here until bottom of the block

siikamiika · 2019-09-01T18:35:08Z

They seem to be present in the original too, so if this turns out to be an error, better report it to Rikaichamp

siikamiika · 2019-09-01T18:36:16Z

Also, if you used some script to do this, please do share 😄

toasted-nutbread · 2019-09-01T19:07:55Z

They seem to be present in the original too, so if this turns out to be an error, better report it to Rikaichamp

Hmm, you're right, I didn't notice this. I'll open an issue on rikaichamp to see what's up with that.

Also, if you used some script to do this, please do share

https://gist.github.com/toasted-nutbread/798c3f96f570307be2ace92c6da03d10

toasted-nutbread · 2019-09-02T02:17:57Z

Looks like this was a bug. A bugfix has been submitted to birchill/10ten-ja-reader#111 and I have rebuilt the JSON data in this commit.

siikamiika · 2019-09-02T04:12:45Z

Thanks!

FooSoft · 2019-09-02T18:41:33Z

@toasted-nutbread @siikamiika for future reference, the script I used to convert deinflect.dat is as follows:

#!/usr/bin/env python2

import codecs
import json
import sys


class Deinflector:
    class Rule:
        def __init__(self, source, target, types, reason):
            self.source = unicode(source)
            self.target = unicode(target)
            self.types = int(types)
            self.reason = int(reason)


    class Result:
        def __init__(self, stem, types, conjugations):
            self.stem = unicode(stem)
            self.types = int(types)
            self.conjugations = list(conjugations)


    def __init__(self, filename=None):
        if filename == None:
            self.close()
        else:
            self.load(filename)


    def close(self):
        self.conjugations = list()
        self.rules = dict()


    def load(self, filename):
        self.close()

        try:
            with codecs.open(filename, 'rb', 'utf-8') as fp:
                lines = [line.strip() for line in fp.readlines()]
            # ignore the first line which is the file header
            del lines[0]
        except IOError:
            return False

        for line in lines:
            fields = line.split('\t')
            fieldCount = len(fields)

            if fieldCount == 1:
                self.conjugations.append(fields[0])
            elif fieldCount == 4:
                rule = self.Rule(*fields)
                sourceLength = len(rule.source)
                if sourceLength not in self.rules:
                    self.rules[sourceLength] = list()
                self.rules[sourceLength].append(rule)
            else:
                self.close()
                return False

        return True


    def deinflect(self, word):
        results = [self.Result(word, 0xff, list())]
        have = {word: 0}

        for result in results:
            for length, group in sorted(self.rules.items(), reverse=True):
                if length > len(result.stem):
                    continue

                for rule in group:
                    if result.types & rule.types == 0 or result.stem[-length:] != rule.source:
                        continue

                    new = result.stem[:len(result.stem) - len(rule.source)] + rule.target
                    if len(new) <= 1:
                        continue

                    if new in have:
                        result = results[have[new]]
                        result.types |= (rule.types >> 8)
                        continue

                    have[new] = len(results)

                    conjugations = [self.conjugations[rule.reason]] + result.conjugations
                    results.append(self.Result(new, rule.types >> 8, conjugations))

        return [
            (result.stem, u', '.join(result.conjugations), result.types) for result in results
        ]


    def validate(self, types, tags):
        for tag in tags:
            valid = (
                types & 1 and tag == 'v1' or
                types & 2 and tag[:2] == 'v5' or
                types & 4 and tag == 'adj-i' or
                types & 8 and tag == 'vk' or
                types & 16 and tag[:3] == 'vs-'
            )

            if valid:
                return True

        return False



d = Deinflector('deinflect.dat')

types = ['v1', 'v5', 'adj-i', 'vk', 'vs-']
items = dict()

for i, l in d.rules.items():
    for r in l:
        modsIn = list()
        for i, t in enumerate(types):
            if r.types & (1 << i):
                modsIn.append(t)

        modsOut = list()
        for i, t in enumerate(types):
            if (r.types >> 8) & (1 << i):
                modsOut.append(t)

        #print '%s -> %s (%s) %s' % (r.source, r.target, d.conjugations[r.reason], ', '.join(mods))

        reason = d.conjugations[r.reason]
        if reason not in items:
            items[reason] = list()

        items[reason].append({
            'expIn': [r.source],
            'expOut': r.target,
            'tagsIn': modsIn,
            'tagsOut': modsOut,
        })

#sys.setdefaultencoding('utf-8')

print json.dumps(items, sort_keys=True, indent=4, encoding='utf-8', ensure_ascii=False).encode('utf-8')

FooSoft · 2019-09-02T18:43:01Z

This was used to dump the original data from the crazy format that Rikaichan used.

siikamiika · 2019-09-03T09:06:55Z

@FooSoft @toasted-nutbread Thanks for sharing! I found a 2010 version of deinflect.dat on Google. I wonder if it's an optimized version of a higher level format or if that was actually edited directly. At least Rikaichamp uses enum for the deinflect reason.

epistularum · 2020-10-11T15:13:53Z

#910

It looks like the rikaichamp's ruleset is missing entries for ずるサ行変格活用 verbs (code "vz")

siikamiika reviewed Sep 1, 2019

View reviewed changes

toasted-nutbread mentioned this pull request Sep 1, 2019

-tari and -tara inflections birchill/10ten-ja-reader#110

Closed

Update deinflect.json

e812e76

toasted-nutbread force-pushed the deinflect-json-update branch from 0b83f4f to e812e76 Compare September 2, 2019 02:16

siikamiika approved these changes Sep 2, 2019

View reviewed changes

siikamiika merged commit eee89fa into FooSoft:master Sep 2, 2019

toasted-nutbread mentioned this pull request Oct 5, 2019

Deinflector optimization #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deinflect.json #199

Update deinflect.json #199

toasted-nutbread commented Sep 1, 2019

siikamiika Sep 1, 2019

siikamiika Sep 1, 2019

siikamiika commented Sep 1, 2019

siikamiika commented Sep 1, 2019

toasted-nutbread commented Sep 1, 2019

toasted-nutbread commented Sep 2, 2019

siikamiika commented Sep 2, 2019

FooSoft commented Sep 2, 2019

FooSoft commented Sep 2, 2019

siikamiika commented Sep 3, 2019

epistularum commented Oct 11, 2020

Update deinflect.json #199

Update deinflect.json #199

Conversation

toasted-nutbread commented Sep 1, 2019

siikamiika Sep 1, 2019

Choose a reason for hiding this comment

siikamiika Sep 1, 2019

Choose a reason for hiding this comment

siikamiika commented Sep 1, 2019

siikamiika commented Sep 1, 2019

toasted-nutbread commented Sep 1, 2019

toasted-nutbread commented Sep 2, 2019

siikamiika commented Sep 2, 2019

FooSoft commented Sep 2, 2019

FooSoft commented Sep 2, 2019

siikamiika commented Sep 3, 2019

epistularum commented Oct 11, 2020