This repository has been archived by the owner on Feb 25, 2023. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
I use the meikyou epwing dictionary on firefox with rikaisama but I am looking to move away from firefox after over a decade of use and onto chrome so I needed a replacement for rikaisama and yomichan is basically the best option. I've got it to the point where it works very well in yomichan as far as I can tell so I think this is a good time to make the pull request. Below are notes I took that will be helpful (I hope) in verifying that my work is correct to your standards (also, check your koohii PMs). Also I seriously don't know how you managed to put yourself through the bitmap glyph mappings for the daijirin extractor.
Also I am relatively inexperienced with the licensing stuff so I didn't know what to actually put at the top of the
meikyou.go
file I added; I will leave that to you.Regexes
Normal expressions
These are enclosed in
【】
.▼▽ are indicators for less common kanji alternatives that should be removed.
--
【忘れな草・▼勿▽忘草】
->【忘れな草・勿忘草】
Sometimes a term or part of one is enclosed by 〈〉 and should be replaced with the enclosed text. Only 831 headings contain 〈〉.
--
【山〈時鳥〉】
->【山時鳥】
Sometimes a term or part of one is enclosed by 《》 and should be replaced with the enclosed text. Only 324 headings contain 《》.
--
【大《伯父》・大《叔父》】
->【大伯父・大叔父】
There are no headings containing nesting of 〈〉 in 《》 or vice-versa.
Alternatives are usually delimited by
・
. Sometimes alternatives are contained in()
following the main entry so these should be included as well. 3019 headings contain such alternatives enclosed in()
. In a small number of headings, all the expressions are contained in()
; there are 118 such headings. A heading contains at most one()
enclosed set of alternatives.--
【▼葦(▼蘆・▼葭・▼芦)】
-- remove indicators -->【葦(蘆・葭・芦)】
-- append four terms -->[葦, 蘆, 葭, 芦]
--
【▽普く・▽遍く(▽周く)】
-- remove indicators -->【普く・遍く(周く)】
-- append three terms -->[普く, 遍く, 周く]
--
【(緩り)】
-- append one term -->緩り
Foreign expressions
These are enclosed in
[]
and contain a word and then optionally following that a country of origin. There are no headings which contain both foreign expressions and normal expressions. There are 5431 foreign expressions.中国, 朝鮮, イタリア, スペイン, ドイツ, フランス, オランダ, ポルトガル, ギリシア, アラビア, チベット, タガログ, ヘブライ, ヒンディー, マレーシア, ラテン, アフリカーンス, ロシア, ハワイ, マレー, スウェーデン, ノルウェー・デンマーク, フィンランド, サンスクリット, ポーランド
.--
[material]
->[material]
--
[mamma・mama]
->[mamma・mama]
--
[madomoiselleフランス]
->[madomoiselle]
--
[和製matroosオランダ+pipe]
->[matroos pipe]
--
[叉焼中国]
->[叉焼]
Readings and Other expressions
Normally the reading of the expression(s) preceeds
【】
in the heading. However, sometimes there are also other expressions placed there. Most other expressions look like an expression you would find normally enclosed in【】
but there are 6 such "special" other expressions enclosed in parentheses characters that do not appear anywhere else.-- 〔小さい〕
-- 〔大きい〕
-- [夏]
-- [秋]
-- [春]
-- [冬]
These are all obviously very common words so I do not think these are worth addressing. The difference between a match that is a reading and a match that is an other expression can be determined by checking if we found any expressions normally (the same way it is handled for daijirin).
Tags
Tags are wrapped in
〘〙
which are wide bitmap glyphs 45118 and 45119 respectively in the actual Meikyou epwing. Tags are separated by・
. When more than one set〘〙
of enclosed tags exists in the text field, they are on different lines. I am assuming that the convention is to add rules that correspond to the EDICT tags so I have tried to stick to that as much as possible. That being said theexportRules
code is a bit messy due to not using regexes as in the daijirin and daijisen extractors. I am not sure I understand the use of the rules (my guess is that it is used to help with yomichan's deinflector), and if the only rules that matter are "adj-i", "vs", "vk", "v5", and "v1" then theexportRules
code I wrote can be simplified a lot more.These are all the unique tags in Meikyou:
ニ,トニ,他,他上一,他下一,他下二,他五,他四,他サ変,代,副,副ト,副トニ,副助,助動,助動 下一型,助動 下二型,助動 五型,助動 四型,助動 ラ変型,助動 形動型,助動 形型,助動 特活型,動上一,動下一,動下二,動五,動四,動サ変,動特活,名,形,形ク,形シク,形動,形動ナリ,形動トタル,感,接,接助,接尾,接頭,格助,終助,自,自上一,自上二,自下一,自下二,自五,自他,自他上一,自他下一,自他五,自他サ変,自四,自サ変,補動,補動五,補動四,補形,連体,連語
ignored tags:
動特活,感,連語,ニ,助動,形動ナリ,形シク,形ク
Based on the daijirin and daijisen extractor it appears that nidan, yodan, and godan verbs are all being grouped under the v5 rule so I have stayed consistent with that.
Glyph Tables
I manually created the tables based on the bitmap glyphs dumped from ebfont from your eb project repo on commit 6d0af07d883a239279d4984ce1785debabcf795d (which appears to still be the linked submodule in zero-epwing currently).
Below are several notes I recorded as I was going through the process of creating the narrow and wide tables. Despite the fact that there are many unused characters I spent a lot of making sure that the table is correct and for places I've left notes below for more information about particular glyphs where I thought it justified mention. I primarily used utf8-chartable.de, weblio.jp, kotobank.com, glyphwiki.org, and mdbg.net/chindict/chindict.php?page=radicals to hunt down obscure kanji and other weird characters in the wide table, and extended latin/greek/other characters in the narrow table.
I determined whether or not a glyph is unused based on whether or not dictionary entries came up after grepping for the inline markers on the output of the bundled zero-epwing binary with yomichan-import @ 816e9e6. I didn't look too closely at the zero-epwing code so in case it unintentionally filters out entries or some kind of text referring to those glyphs I tried to determine the mappings anyway in case it became relevant in later versions zero-epwing. Given how nonsensical this format is I wouldn't be surprised if those glyphs just never got used either though.
Notes on the narrow table
Unused characters: 41249-41257, 41259-41289, 41291-41312, 41314, 41316, 41318-41319, 41325-41327, 41329-41331, 41333-41334, 41336-41340, 41342, 41505-41507, 41509-41583, 41585-41589, 41591-41597, 41598, 41761-41775, 41777-41830, 41841, 41848-41850, 42021.
41847 is used in the dictionary definition for 踊り字; I am fairly sure this character is actually intended to be the wide character 〻 (U+303B). It doesn't quite visually match the character though (thin backwards S) but the dictionary definition suggests that it should indeed be 〻. I have mapped it to 〻 but I am not sure if you are okay so I will leave it to your discretion.
41851 is used in the dictionary definition for クレッシェンド and デクレッシェンド but based on what I looked up the image file it points to is not actually the correct symbol for crescendo. However, "correcting" the entry would mean using U+1D192 which is not too well supported by fonts from what I can tell so I decided to go with a character (U+2227) that visually matched the bitmap glyph given.
41817 I cannot find a matching character for this but it is not used in any dictionary entries so I have left it unmapped.
41852 I have no idea what this is supposed to be but it is also not used in any dictionary entries so I have left it unmapped.
Notes on the wide table
Unused characters: 45089-45094, 45110-45111, 45113-45114, 45133-45138, 45140-45141, 45149, 45345, 45376, 45378, 45388, 45418, 45431, 45637-45638, 45685-45687, 45689-45690, 45858, 45862, 45864.
45120 and 45121 are \/ but it is obvious from the dictionary entry they appear in, 踊り字, that they are meant to actually be the single character ゝ. I believe this is a consequence of the dictionary entry being intended to be read vertically (related http://soudan1.biglobe.ne.jp/qa6874321.html including JIS misery). In any case I've mapped them to \ and / respectively.
45149 I cannot find a matching character for this; I'm not even sure this is actually a CJK character. Since it does not appear in any dictionary entries, I have left it unmapped.
45676 it hard to determine from the glyph bitmap I believe it should be 爛. This is the entry for reference:
りゅう【隆】(造)\n{{w_45095}}もりあがる。もりあげる。「─起」\n{{w_45096}}さかんになる。「─盛」「興─」 {{w_45676}}\n
. 45095 and 45096 are ① and ② respectively. Looking online for dictionary entries for 爛 suggests that this is the intended character.45681 is the geta character 〓 but it shows up in a dictionary entry for しけいと which is:
しけ‐いと【▼{{w_45861}}糸】{{w_45118}}名{{w_45119}}\n繭の外皮からとった粗悪な生糸。しけのいと。しけ。多く織物の横糸などに用いる。「─織」\n
. Based on what I found online it looks like this should be 絓糸 and not 〓糸 but I have mapped the glyph bitmap to 〓 despite this since this is how it would show up.45863 is the character for a musical whole note. However, the actual whole note character is U+1D15D and since this is not too well supported by fonts I went with the white circle character ○ that looks similar enough to the bitmap (which itself does not look quite like the whole note).
45864 is a glyph of three sound waves contained in a box. I couldn't find a unicode character that matches this that closely but U+1F50A would correspond to essentially the same thing. Since it isn't well supported by fonts and it is does not appear in any dictionary entries I just left it unmapped.