Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

add epwing support for meikyou #2

Merged
merged 3 commits into from Feb 17, 2017
Merged

add epwing support for meikyou #2

merged 3 commits into from Feb 17, 2017

Conversation

ghost
Copy link

@ghost ghost commented Feb 17, 2017

Introduction

I use the meikyou epwing dictionary on firefox with rikaisama but I am looking to move away from firefox after over a decade of use and onto chrome so I needed a replacement for rikaisama and yomichan is basically the best option. I've got it to the point where it works very well in yomichan as far as I can tell so I think this is a good time to make the pull request. Below are notes I took that will be helpful (I hope) in verifying that my work is correct to your standards (also, check your koohii PMs). Also I seriously don't know how you managed to put yourself through the bitmap glyph mappings for the daijirin extractor.

Also I am relatively inexperienced with the licensing stuff so I didn't know what to actually put at the top of the meikyou.go file I added; I will leave that to you.

Regexes

Normal expressions

These are enclosed in 【】.

  • ▼▽ are indicators for less common kanji alternatives that should be removed.
    -- 【忘れな草・▼勿▽忘草】 -> 【忘れな草・勿忘草】

  • Sometimes a term or part of one is enclosed by 〈〉 and should be replaced with the enclosed text. Only 831 headings contain 〈〉.
    -- 【山〈時鳥〉】-> 【山時鳥】

  • Sometimes a term or part of one is enclosed by 《》 and should be replaced with the enclosed text. Only 324 headings contain 《》.
    -- 【大《伯父》・大《叔父》】 -> 【大伯父・大叔父】

  • There are no headings containing nesting of 〈〉 in 《》 or vice-versa.

  • Alternatives are usually delimited by . Sometimes alternatives are contained in () following the main entry so these should be included as well. 3019 headings contain such alternatives enclosed in (). In a small number of headings, all the expressions are contained in (); there are 118 such headings. A heading contains at most one () enclosed set of alternatives.
    -- 【▼葦(▼蘆・▼葭・▼芦)】 -- remove indicators --> 【葦(蘆・葭・芦)】 -- append four terms --> [葦, 蘆, 葭, 芦]
    -- 【▽普く・▽遍く(▽周く)】 -- remove indicators --> 【普く・遍く(周く)】 -- append three terms --> [普く, 遍く, 周く]
    -- 【(緩り)】 -- append one term --> 緩り

Foreign expressions

These are enclosed in [] and contain a word and then optionally following that a country of origin. There are no headings which contain both foreign expressions and normal expressions. There are 5431 foreign expressions.

  • Sometimes 和製 and + wrap a word to obviously indicate that it is wasei-eigo. 和製 should be removed, + replaced with a space, and origin information should be removed. The language of origin information is not important in my opinion. The origin information that appears is: 中国, 朝鮮, イタリア, スペイン, ドイツ, フランス, オランダ, ポルトガル, ギリシア, アラビア, チベット, タガログ, ヘブライ, ヒンディー, マレーシア, ラテン, アフリカーンス, ロシア, ハワイ, マレー, スウェーデン, ノルウェー・デンマーク, フィンランド, サンスクリット, ポーランド.
    -- [material] -> [material]
    -- [mamma・mama] -> [mamma・mama]
    -- [madomoiselleフランス] -> [madomoiselle]
    -- [和製matroosオランダ+pipe] -> [matroos pipe]
    -- [叉焼中国] -> [叉焼]

Readings and Other expressions

Normally the reading of the expression(s) preceeds 【】 in the heading. However, sometimes there are also other expressions placed there. Most other expressions look like an expression you would find normally enclosed in 【】 but there are 6 such "special" other expressions enclosed in parentheses characters that do not appear anywhere else.
-- 〔小さい〕
-- 〔大きい〕
-- [夏]
-- [秋]
-- [春]
-- [冬]

These are all obviously very common words so I do not think these are worth addressing. The difference between a match that is a reading and a match that is an other expression can be determined by checking if we found any expressions normally (the same way it is handled for daijirin).

Tags

Tags are wrapped in 〘〙 which are wide bitmap glyphs 45118 and 45119 respectively in the actual Meikyou epwing. Tags are separated by . When more than one set 〘〙 of enclosed tags exists in the text field, they are on different lines. I am assuming that the convention is to add rules that correspond to the EDICT tags so I have tried to stick to that as much as possible. That being said the exportRules code is a bit messy due to not using regexes as in the daijirin and daijisen extractors. I am not sure I understand the use of the rules (my guess is that it is used to help with yomichan's deinflector), and if the only rules that matter are "adj-i", "vs", "vk", "v5", and "v1" then the exportRules code I wrote can be simplified a lot more.

These are all the unique tags in Meikyou:
ニ,トニ,他,他上一,他下一,他下二,他五,他四,他サ変,代,副,副ト,副トニ,副助,助動,助動 下一型,助動 下二型,助動 五型,助動 四型,助動 ラ変型,助動 形動型,助動 形型,助動 特活型,動上一,動下一,動下二,動五,動四,動サ変,動特活,名,形,形ク,形シク,形動,形動ナリ,形動トタル,感,接,接助,接尾,接頭,格助,終助,自,自上一,自上二,自下一,自下二,自五,自他,自他上一,自他下一,自他五,自他サ変,自四,自サ変,補動,補動五,補動四,補形,連体,連語

  • ignored tags: 動特活,感,連語,ニ,助動,形動ナリ,形シク,形ク

  • Based on the daijirin and daijisen extractor it appears that nidan, yodan, and godan verbs are all being grouped under the v5 rule so I have stayed consistent with that.

Glyph Tables

I manually created the tables based on the bitmap glyphs dumped from ebfont from your eb project repo on commit 6d0af07d883a239279d4984ce1785debabcf795d (which appears to still be the linked submodule in zero-epwing currently).

Below are several notes I recorded as I was going through the process of creating the narrow and wide tables. Despite the fact that there are many unused characters I spent a lot of making sure that the table is correct and for places I've left notes below for more information about particular glyphs where I thought it justified mention. I primarily used utf8-chartable.de, weblio.jp, kotobank.com, glyphwiki.org, and mdbg.net/chindict/chindict.php?page=radicals to hunt down obscure kanji and other weird characters in the wide table, and extended latin/greek/other characters in the narrow table.

I determined whether or not a glyph is unused based on whether or not dictionary entries came up after grepping for the inline markers on the output of the bundled zero-epwing binary with yomichan-import @ 816e9e6. I didn't look too closely at the zero-epwing code so in case it unintentionally filters out entries or some kind of text referring to those glyphs I tried to determine the mappings anyway in case it became relevant in later versions zero-epwing. Given how nonsensical this format is I wouldn't be surprised if those glyphs just never got used either though.

Notes on the narrow table

Unused characters: 41249-41257, 41259-41289, 41291-41312, 41314, 41316, 41318-41319, 41325-41327, 41329-41331, 41333-41334, 41336-41340, 41342, 41505-41507, 41509-41583, 41585-41589, 41591-41597, 41598, 41761-41775, 41777-41830, 41841, 41848-41850, 42021.

  • 41847 is used in the dictionary definition for 踊り字; I am fairly sure this character is actually intended to be the wide character 〻 (U+303B). It doesn't quite visually match the character though (thin backwards S) but the dictionary definition suggests that it should indeed be 〻. I have mapped it to 〻 but I am not sure if you are okay so I will leave it to your discretion.

  • 41851 is used in the dictionary definition for クレッシェンド and デクレッシェンド but based on what I looked up the image file it points to is not actually the correct symbol for crescendo. However, "correcting" the entry would mean using U+1D192 which is not too well supported by fonts from what I can tell so I decided to go with a character (U+2227) that visually matched the bitmap glyph given.

  • 41817 I cannot find a matching character for this but it is not used in any dictionary entries so I have left it unmapped.

  • 41852 I have no idea what this is supposed to be but it is also not used in any dictionary entries so I have left it unmapped.

Notes on the wide table

Unused characters: 45089-45094, 45110-45111, 45113-45114, 45133-45138, 45140-45141, 45149, 45345, 45376, 45378, 45388, 45418, 45431, 45637-45638, 45685-45687, 45689-45690, 45858, 45862, 45864.

  • 45120 and 45121 are \/ but it is obvious from the dictionary entry they appear in, 踊り字, that they are meant to actually be the single character ゝ. I believe this is a consequence of the dictionary entry being intended to be read vertically (related http://soudan1.biglobe.ne.jp/qa6874321.html including JIS misery). In any case I've mapped them to \ and / respectively.

  • 45149 I cannot find a matching character for this; I'm not even sure this is actually a CJK character. Since it does not appear in any dictionary entries, I have left it unmapped.

  • 45676 it hard to determine from the glyph bitmap I believe it should be 爛. This is the entry for reference: りゅう【隆】(造)\n{{w_45095}}もりあがる。もりあげる。「─起」\n{{w_45096}}さかんになる。「─盛」「興─」 {{w_45676}}\n. 45095 and 45096 are ① and ② respectively. Looking online for dictionary entries for 爛 suggests that this is the intended character.

  • 45681 is the geta character 〓 but it shows up in a dictionary entry for しけいと which is: しけ‐いと【▼{{w_45861}}糸】{{w_45118}}名{{w_45119}}\n繭の外皮からとった粗悪な生糸。しけのいと。しけ。多く織物の横糸などに用いる。「─織」\n. Based on what I found online it looks like this should be 絓糸 and not 〓糸 but I have mapped the glyph bitmap to 〓 despite this since this is how it would show up.

  • 45863 is the character for a musical whole note. However, the actual whole note character is U+1D15D and since this is not too well supported by fonts I went with the white circle character ○ that looks similar enough to the bitmap (which itself does not look quite like the whole note).

  • 45864 is a glyph of three sound waves contained in a box. I couldn't find a unicode character that matches this that closely but U+1F50A would correspond to essentially the same thing. Since it isn't well supported by fonts and it is does not appear in any dictionary entries I just left it unmapped.

@ghost ghost changed the title add epwing support for meikyo add epwing support for meikyou Feb 17, 2017
@FooSoft
Copy link
Owner

FooSoft commented Feb 17, 2017

Thank you for your excellent work, I can say without any hesitation that this is the cleanest, best documented pull request I have received in all my years of doing open source on GitHub 🥇 I'm sure that your efforts will be highly appreciated by everyone using Yomichan to better their understanding of Japanese. I keep on wanting to expand yomichan-import with support for other dictionaries but it always has to be done at the expense of development time on the actual extension...

The bitmap glyph mappings are honestly a huge pain in the ass. I've managed to scrounge some tables from here and there, but they are not top quality (Daijirin and Daijisen tables have some errors). I've made changes to zero-epwing to dump out font glyph data in all available sizes, and I am planning on creating a simple OCR tool to build these tables automatically. Interestingly enough, there are no libraries that I have found that do a good job with Japanese character recognition; I'm hoping that I can get good results with my method (it will be run offline and the character tables will be hard-coded into the source files like they are now).

Regarding the copyright stuff, I'm not not too fussy about. The file is going to be MIT license like the rest of the project, and since you wrote everything in the file you will be credited as the author 👍

@FooSoft FooSoft merged commit 81180c7 into FooSoft:master Feb 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant