-
Notifications
You must be signed in to change notification settings - Fork 230
Merging similar results #84
Comments
So first of all, I completely agree, the duplicated glossaries look pretty stupid. This is something I've been thinking about for a while but I'm not sure what would be a good way to represent the information is. As you mentioned in your first example, 毋れ and 无れ should be merged. The problem comes up where 无れ has a tag that the other result does not: Assuming that you can figure out a way to display this information in a more compact manner, the actual implementation should not be difficult. The database by design does not have the concept of grouping in it; that all happens in the |
That looks pretty good. It probably does mean that we can't properly support the Anki |
Ok, thanks for the advice! When I was re-reading your previous reply I realized that the database not having the concept of grouping could become a problem. Does it have If you're worrying about supporting markers like |
The database does not use That being said, I don't think I understand very well why this is required. If several entries have an identical glossary, couldn't grouping just happen on the glossary (or a hash of it)? For your example with 明鏡国語辞典, how would having a field from JMdict help for grouping? It's not like 明鏡国語辞典 has |
Regarding |
Sorry, the paragraph was left half way when I started to write the next one. What I meant was that 明鏡国語辞典 doesn't have an entry for やむを得ない although JMdict claims that it's the most common form of it, but instead there are 止むを得ない and 已むを得ない. My solution would have been that when scanning やむを得ない and finding a JMdict entry, all entries with the same Another way would be to index every way of writing the word from やむをえない through 止むをえない to whatever and have them all point to the same entry, but this could waste a lot of space. Example: my past experiment with デジタル大辞泉 https://pastebin.com/raw/PbGDFrav (I was trying to find record words with multiple readings for the same kanji) And yeah, for this particular case, you wouldn't have to generate all those but they can be useful in similar cases where somebody decides to write some word without some kanji. |
Also, the database would have to be regenerated should the user want to delete JMdict or whatever used to group the entries at that moment (if I understand it correctly). |
I see. Having a If you want to implement using the sequence, I could easily make a small update to Yomichan-Import to include this information and give you a modified converted ZIP file with this extra field included. |
That should solve the problem if such sequence came with every dictionary that you wanted to group by. Just to be clear, is the database currently storing duplicates of the glossaries (I assume it is)? Changing the database format to group based could free up some space but would also make it less flexible, for example if you wanted to keep the current way of grouping things working (or delete the main dictionary as said earlier). |
Yes, there is a certain degree of duplication right now in the database, but it's not too bad. I think that the database format should be kept as simple as possible and all of the complicated grouping and other operations should happen after relevant data is retrieved from the database. Making large changes to the database structure (more than simply adding a column or two), causes compatibility problems for users who have old versions of the DB imported. That's why by keeping the underlying format as simple as possible we can reduce the chance that future changes require changing it. |
I started working on the feature by adding some basic stuff, but I'm not sure what to do with the database format. I modified yomichan-import to add JMdict's Here's the commit (it currently assumes that JMdict is the only dictionary installed and doesn't display any definitions when result output mode is set to merged) |
Looks like a solid start. The database format probably does not have to be versioned since The only (minor) knitpick I have so far is the const searcher = {
'merge': translator.findTermsMerged,
'split': translator.findTermsSplit,
'group': translator.findTermsGrouped
}[options.general.resultOutputMode].bind(translator); or maybe simply a boring switch statement |
Also I would add a version function to const fixups = [
() => {},
() => {},
() => {},
() => {},
() => {
if (options.general.audioPlayback) {
options.general.audioSource = 'jpod101';
} else {
options.general.audioSource = 'disabled';
}
},
() => {
options.general.showGuide = false;
},
() => {
if (options.scanning.requireShift) {
options.scanning.modifier = 'shift';
} else {
options.scanning.modifier = 'none';
}
}
]; The versions run in sequence and the version number is the length of the array. The empty functions are for retired versions (I know nobody is running with them). |
Thanks! I'll have a look at the areas you mentioned. By the way, should the fixups array also remove the old options or is that handled automatically somewhere? Looking at the array you would imagine that the existence of EDIT: |
Yeah, it's a pretty good system to make sure that users have reasonable settings regardless of what version they are coming from and what big changes have been made on the current one. I just leave old options in place because meh. |
I haven't used Handlebars before but I thought that adding |
You are making it available for the Anki field templates not for the term template itself. Check out the |
I have to research the Anki functionality after hacking together something that works so that I don't break anything in it with my changes. Anyway, to get As for database stuff, looking up every definition with a certain sequence works, but I still need to combine them in a sane way and use the terms and readings to find related definitions from other dictionaries. Then I need to decide what to do with results that cannot be merged with the main dictionary (especially when they intersect with the main dictionary entries or do not have a sequence themselves). I should probably exclude names from this as well, and possibly merge them with each other, so that the result for 田中 would have Tanaka, Denchuu, ... and so on. |
Handlebars is kind of a pain in the ass to work with, but unfortunately there isn't anything better for templating. Even conditionals in it are very limited and difficult to work with. For your second point, I would just add a checkbox to every dictionary that is imported in settings that is called something like "Allow secondary searches". This checkbox would allow results from other dictionaries to be used to search this dictionary. |
Update: the feature now works partially. I still need to move some code from How do you feel about the way glossaries are merged? I think it's hacky but can't think of any other way. If you wonder what this is for, I'm going to use it to tag the glossaries with "(... only)". A similar thing is needed for tagging readings with (expressionX, expressionY only). |
I'll take a closer look at this tomorrow, but using |
You're right, |
Regarding the color coding, is light blue used for |
Yeah, light blue comes from the I haven't committed this yet because I'm still working on other areas, but these functions are related: function dictIsJmdictTermTag(tag) {
return [
'P',
'news',
'ichi',
'spec',
'gai',
'ik',
'iK',
'ok',
'oK',
'ek',
'eK',
'io',
'oik',
'ateji',
'gikun'
].includes(tag);
}
function dictJmdictTermTagsRare(tags) {
const rareTags = [
'ik',
'iK',
'ok',
'oK',
'ek',
'eK',
'io',
'oik'
];
for (const tag of tags) {
if (rareTags.includes(tag)) {
return true;
}
}
return false;
} About the 嗚呼 example, JMdict actually has an element called Before removing the katakana readings it's better to check that the "regular" (hiragana) ones exist. I checked and there are entries with katakana readings only (I guess they're mainly Chinese loan words): http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=1909284 |
I would prefer not to have any explicit references to Jmdict in the code; meta information about tags is already stored in the exported dictionary. I'm wondering if it would make sense to add a |
I pushed the changes to my fork again. I added the functions because I was not able to deduce from the meta information if the tag belonged to a definition or a term. Should this information be in Sure it would make sense to add |
@FooSoft |
@FooSoft I actually edited the HTML manually, but I was thinking something like this:
「ー」 can be tricky when it replaces う, though, so I must do some testing on (the) dictionary data to see if you can just treat it as う always when a hiragana reading would match. This method would filter out the (in my opinion) redundant reading 「あー」 and avoid leaving terms without readings, but in the case of JMdict just respecting I've also been thinking if the terms and definitions that match the scanned text should be highlighted somehow as well, but that can't be done right now because |
Highlighting the definitions that matched can be tricky because it's unclear what to do in cases where the connection is ambiguous. For example, imagine searching for a kana word that is normally written using Kanji: you search for はやい and get 早い and 速い. Which one should be highlighted? Both? Neither? Hard to tell. Although arguably there could be value in just doing the highlight when you know for sure. The problem is (as you mentioned) that by the time you go through all of the code in deinflection, Yomichan no longer has a good idea of what it was even searching for (other than |
This is looking really good. It does make issue #85 more apparent, but that is only because of all the space being saved on other UI. |
I'll try to start wrapping this up, so here's a list of things I still want to complete. Some of them are not related to this feature and some might be dropped (especially the ones tagged [bloat]). If you see something I forgot or something that shouldn't be done or should be done differently, please comment and I'll update this. Merged mode
Concerns merged mode
General
|
Just in case you didn't know, it's also true for 1.4.0, it's not necessarily something that happens due to this merged mode. As for くれ, in general some of the multitude of verb conjugations are not recognised by yomichan (especially multiple conjugations stacked together iirc). A blanket (X)(...)、(X)れ(...)、(X)ろ(...) filter showing you dictionary entry '(X)る' with explanation 'Base 1/2’ 'Base 4’, 'Base 5’ respectively (the 'base Y' info shown only when nothing better (e.g. ば form) is recognised) would solve that (as base 3 is just (X)る and た/て forms are recognised correctly). We would have full coverage, with all conjugations at least redirecting to the correct verb, even if some would have no helpful info what the conjugation you are point to is called. (Still better than not redirecting to the dictionary entry at all.) Just in case you don't know what I mean by bases (since some textbooks/courses etc. just teach how real conjugations work, without going into this mostly academic (but useful for this case) stuff: http://davidbruce.tripod.com/verbchart.html EDIT: Obviously there would also need to be filters for all 9 types of godan verbs (Xす, Xぬ, etc). |
@non-e-moose You're right, it doesn't make much sense to mention the lack of support for those conjugations under this issue. I just pretty much formatted my Yomichan todo list into Github markdown, and that happened to be one of the points. That verb chart seems useful, and there are indeed many ways to learn the verb conjugations. As far as I can tell, Yomichan is based on a parser and terminology similar to rikai* addons (which in turn reminds me of Tae Kim's grammar guide but that could be a coincidence). I learned most of my grammar reading text using MeCab+unidic which uses the Japanese terminology for verb forms but translated them to English at some point. The English wiki article on Japanese grammar was useful for that. |
Everything sounds good @siikamiika , some thoughts:
I wonder what would be the way to utilize a frequency list for terms that can be written in various ways? Would it be per every Kanji expression? What about looking up frequency for words that are mostly written in Kana, and rarely with Kanji? Is attempting to look up a word by Kana alone even a valid thing to do? Have no idea, just throwing these out there.
The first would probably be fine. Adding an additional selection mode sounds like it would make the code quite a bit more complicated for something most people don't even care about.
Hmm, since 行く is obviously tagged properly (things like 行きます works), this is probably a bug in
How come?
This is something I've been wanting to add for a while but it's annoying in terms of correctly highlighting the correct matched text length. |
I'm not really sure about this, but if you look at a Japanese language corpus like BCCWJ, they index by lemma. It doesn't matter what the actual written form has been when the occurrence has been recorded, only the canonical form (as defined by the corpus) matters. This isn't really compatible with the JMdict way of doing things, but usually JMdict at least includes the lemma (or some of them) used by a corpus like BCCWJ within an To generate a corpus like this from raw text, you will need a morphological analyzer (unreliable) or you can do it by hand. Edit: somebody has tried it automatically here: http://wiki.wareya.moe/Frequency%20lists
This is just something I have implemented in my own project. I've found that when looking up definitions for a word that is written in kana only, the definition in kana only is usually the one I want. It wouldn't take precedence over the length of the term, i.e. you wouldn't get 「か」 first when scanning 「かった」. |
The tag issue in yomichan-import should be fine now. I also added support for NoKanji but didn't implement any complicated filter for readings with katakana or long vowel mark. With that out of the way, implementing compact tags was pretty simple. Here's what it looks like when enabled (in merged mode with mouse over the term 「の」 to show the tags associated with it as well): I guess #85 can be closed when this is merged @non-e-moose ? |
@siikamiika |
changes related to FooSoft/yomichan#84
@non-e-moose how do you determine where line break is used and where it is not? I see that you still have the dictionary tag for the last entry on a new line. |
@FooSoft As every definition can be expected to have a dictionary tag (and a part of speech tag if the dictionary uses them), those are currently the only types tags compressed by the feature. If newlines should be added only in some cases, those tags could be the ones that warrant it. Seems like some JMdict definitions also have the [uk] (usually kana) tag for every definition, so maybe there could be a check that if there is a certain tag (that is not dictionary or part of speech) for every definition of a result in a certain dictionary, it can be "compressed" as well. @non-e-moose I didn't touch the bullet points yet, but I'm not sure what to do with them. Some users could want these space-saving features separately but splitting them into at least three different options (compact tags, use newline after certain tags or tags in general, have multi-part glossaries as bullet points or on the same line, also what separator to use etc...) can become confusing. The amount of options I change every time I reinstall the addon to reset it is already pretty large. But there are users who don't like if things change and want a way to keep the look and feel the same, and I respect that. |
@FooSoft |
@siikamiika that looks amazing! |
That generated card looks pretty sweet. I think color coding would be a good way to convey the same meaning without adding clutter. I wouldn't worry too much about making it possible to disable anything since advanced users can always dive into the Anki templates and do whatever they want. Novice users probably won't care too much about the default look as long as it is visually appealing. Speaking of templates, we are probably going to need to blow away existing Anki template settings if they are unchanged from the default. This has to be done so that people upgrading from previous versions of Yomichan are able to create definitions in merged mode. I'm not sure what the best course of action would be for users who customized their Anki templates; we probably don't want to override their preferences, but at the same time merged mode won't just "work" for them. |
@FooSoft Maybe the template should be written separately for each of the modes? That would require keeping track of three different versions of it but would definitely make it easier for the users. Users who have edited their template could get the old template for both grouped and split mode but a new one for merged mode. |
So while for this one case having split templates would be useful, I don't know how important it would be after the feature is deployed. I doubt that there would be huge changes that would require that the Anki card template has to be remade. I actually created the card template system in order to avoid having to add additional settings to the options page. Novice users would use the defaults while advanced users can just jump into the templates and do whatever they want (arguably at their own risk). Probably just checking to see if a user is running with the default template and if they are upgrading them is the best option. This is probably not the last time that there will be template breakage, so advanced users should get used to basic troubleshooting. The most important thing is to not break people who are running with the defaults (probably 99%+ of users). |
Another thing I thought of -- if the Anki templates are modified, just wrap them in a check to see if they are not in merged mode. Then append an else clause that does merged mode. |
Edit: https://github.com/siikamiika/yomichan/blob/b59980067a7698199d2466ecbeebc6ad5253ed02/ext/bg/js/options.js#L272 keeps the old templating for other modes even when it hasn't been edited, but I guess actually hard coding the old template for comparison would be overkill (and somebody could upgrade from an older version anyway). The only drawback is that old users don't get compact glossaries to grouped and split modes without resetting their templates first. |
Yup, that is exactly what I mean. I think for users who do not have a modified template (there has only been one version so far), it is probably better to just replace it outright without adding the conditionals. The reason behind this is that it creates a resulting template that is less cluttered and easier to customize should the user decide to do so. |
True, the template is already complicated enough with all three modes added to it so better not make it any worse for all current users. I missed that there haven't been different versions of it yet, so maybe this kind of pattern would work in the future: siikamiika@7e556e8 (the string hash function is ripped off https://stackoverflow.com/a/7616484/2444105). |
Yeah, I think checking for hashes is a good solution. Storing actual text for all previous versions would be pretty terrible. |
Sometimes, especially when scanning words that are written in kana only, you will get multiple results with almost the same information. This is because many words can be written with different kanji or have been written with different kanji in the history, and Yomichan groups the results by kanji+kana pairs (or kana only).
This makes sense when multiple dictionaries are used, but I believe that many use JMdict English only. In my opinion, the way JMdict groups the entries by default works very well, but you can't use that as-is in Yomichan unless JMdict is the only dictionary installed.
Some examples (I have JMdict and 明鏡国語辞典 installed):
なかれ
すばやい
For なかれ, it would be more optimal if at least 毋れ and 无れ were merged to one and it would be somehow indicated that 无れ is outdated kanji usage. All four (莫れ, 毋れ, 无れ, 勿れ), even the 明鏡国語辞典 entry, could be merged under the same result if the JMdict entry was used as the "pivot" entry, meaning that everything matching the JMdict entry's kanji and reading would be grouped together with it.
Anyway, this gets pretty complicated quickly and I don't know if these suggestions would require changes to the database format, for example. I'd like to know if this feature sounds feasible or crazy, because I want to implement it.
The text was updated successfully, but these errors were encountered: