Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Update deinflect.json #199

Merged
merged 1 commit into from
Sep 2, 2019

Conversation

toasted-nutbread
Copy link
Collaborator

Deinflection rules updated using rikaichamp's ruleset, which has been updated more recently.

Fixes #118

]
},
{
"kanaIn": "逝ったり",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what's with the 〜たり entries in the latter half of the -tara block and similarly 〜たら in the -tari block below. Otherwise the new conjugations look fine to me

]
},
{
"kanaIn": "逝ったら",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to the previous comment, from here until bottom of the block

@siikamiika
Copy link
Collaborator

They seem to be present in the original too, so if this turns out to be an error, better report it to Rikaichamp

@siikamiika
Copy link
Collaborator

Also, if you used some script to do this, please do share 😄

@toasted-nutbread
Copy link
Collaborator Author

They seem to be present in the original too, so if this turns out to be an error, better report it to Rikaichamp

Hmm, you're right, I didn't notice this. I'll open an issue on rikaichamp to see what's up with that.

Also, if you used some script to do this, please do share

https://gist.github.com/toasted-nutbread/798c3f96f570307be2ace92c6da03d10

@toasted-nutbread
Copy link
Collaborator Author

Looks like this was a bug. A bugfix has been submitted to birchill/10ten-ja-reader#111 and I have rebuilt the JSON data in this commit.

@siikamiika siikamiika merged commit eee89fa into FooSoft:master Sep 2, 2019
@siikamiika
Copy link
Collaborator

Thanks!

@FooSoft
Copy link
Owner

FooSoft commented Sep 2, 2019

@toasted-nutbread @siikamiika for future reference, the script I used to convert deinflect.dat is as follows:

#!/usr/bin/env python2

import codecs
import json
import sys


class Deinflector:
    class Rule:
        def __init__(self, source, target, types, reason):
            self.source = unicode(source)
            self.target = unicode(target)
            self.types = int(types)
            self.reason = int(reason)


    class Result:
        def __init__(self, stem, types, conjugations):
            self.stem = unicode(stem)
            self.types = int(types)
            self.conjugations = list(conjugations)


    def __init__(self, filename=None):
        if filename == None:
            self.close()
        else:
            self.load(filename)


    def close(self):
        self.conjugations = list()
        self.rules = dict()


    def load(self, filename):
        self.close()

        try:
            with codecs.open(filename, 'rb', 'utf-8') as fp:
                lines = [line.strip() for line in fp.readlines()]
            # ignore the first line which is the file header
            del lines[0]
        except IOError:
            return False

        for line in lines:
            fields = line.split('\t')
            fieldCount = len(fields)

            if fieldCount == 1:
                self.conjugations.append(fields[0])
            elif fieldCount == 4:
                rule = self.Rule(*fields)
                sourceLength = len(rule.source)
                if sourceLength not in self.rules:
                    self.rules[sourceLength] = list()
                self.rules[sourceLength].append(rule)
            else:
                self.close()
                return False

        return True


    def deinflect(self, word):
        results = [self.Result(word, 0xff, list())]
        have = {word: 0}

        for result in results:
            for length, group in sorted(self.rules.items(), reverse=True):
                if length > len(result.stem):
                    continue

                for rule in group:
                    if result.types & rule.types == 0 or result.stem[-length:] != rule.source:
                        continue

                    new = result.stem[:len(result.stem) - len(rule.source)] + rule.target
                    if len(new) <= 1:
                        continue

                    if new in have:
                        result = results[have[new]]
                        result.types |= (rule.types >> 8)
                        continue

                    have[new] = len(results)

                    conjugations = [self.conjugations[rule.reason]] + result.conjugations
                    results.append(self.Result(new, rule.types >> 8, conjugations))

        return [
            (result.stem, u', '.join(result.conjugations), result.types) for result in results
        ]


    def validate(self, types, tags):
        for tag in tags:
            valid = (
                types & 1 and tag == 'v1' or
                types & 2 and tag[:2] == 'v5' or
                types & 4 and tag == 'adj-i' or
                types & 8 and tag == 'vk' or
                types & 16 and tag[:3] == 'vs-'
            )

            if valid:
                return True

        return False



d = Deinflector('deinflect.dat')

types = ['v1', 'v5', 'adj-i', 'vk', 'vs-']
items = dict()

for i, l in d.rules.items():
    for r in l:
        modsIn = list()
        for i, t in enumerate(types):
            if r.types & (1 << i):
                modsIn.append(t)

        modsOut = list()
        for i, t in enumerate(types):
            if (r.types >> 8) & (1 << i):
                modsOut.append(t)

        #print '%s -> %s (%s) %s' % (r.source, r.target, d.conjugations[r.reason], ', '.join(mods))

        reason = d.conjugations[r.reason]
        if reason not in items:
            items[reason] = list()

        items[reason].append({
            'expIn': [r.source],
            'expOut': r.target,
            'tagsIn': modsIn,
            'tagsOut': modsOut,
        })

#sys.setdefaultencoding('utf-8')

print json.dumps(items, sort_keys=True, indent=4, encoding='utf-8', ensure_ascii=False).encode('utf-8')

@FooSoft
Copy link
Owner

FooSoft commented Sep 2, 2019

This was used to dump the original data from the crazy format that Rikaichan used.

@siikamiika
Copy link
Collaborator

@FooSoft @toasted-nutbread Thanks for sharing! I found a 2010 version of deinflect.dat on Google. I wonder if it's an optimized version of a higher level format or if that was actually edited directly. At least Rikaichamp uses enum for the deinflect reason.

@epistularum
Copy link

#910

It looks like the rikaichamp's ruleset is missing entries for ずる サ行変格活用 verbs (code "vz")

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"potential or passive" is wrong for Godan verbs ending in る
4 participants