Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Merging similar results #84

Closed
siikamiika opened this issue Sep 27, 2017 · 59 comments
Closed

Merging similar results #84

siikamiika opened this issue Sep 27, 2017 · 59 comments

Comments

@siikamiika
Copy link
Collaborator

Sometimes, especially when scanning words that are written in kana only, you will get multiple results with almost the same information. This is because many words can be written with different kanji or have been written with different kanji in the history, and Yomichan groups the results by kanji+kana pairs (or kana only).

This makes sense when multiple dictionaries are used, but I believe that many use JMdict English only. In my opinion, the way JMdict groups the entries by default works very well, but you can't use that as-is in Yomichan unless JMdict is the only dictionary installed.

Some examples (I have JMdict and 明鏡国語辞典 installed):

なかれ 2017-09-27_13-37-53
すばやい 2017-09-27_13-39-09

For なかれ, it would be more optimal if at least 毋れ and 无れ were merged to one and it would be somehow indicated that 无れ is outdated kanji usage. All four (莫れ, 毋れ, 无れ, 勿れ), even the 明鏡国語辞典 entry, could be merged under the same result if the JMdict entry was used as the "pivot" entry, meaning that everything matching the JMdict entry's kanji and reading would be grouped together with it.

Anyway, this gets pretty complicated quickly and I don't know if these suggestions would require changes to the database format, for example. I'd like to know if this feature sounds feasible or crazy, because I want to implement it.

@FooSoft
Copy link
Owner

FooSoft commented Sep 27, 2017

So first of all, I completely agree, the duplicated glossaries look pretty stupid. This is something I've been thinking about for a while but I'm not sure what would be a good way to represent the information is.

As you mentioned in your first example, 毋れ and 无れ should be merged. The problem comes up where 无れ has a tag that the other result does not: oK. It therefore stands to reason that we have to have some sort of a way to show an "intersection" of the tags that the definitions have in common, as well as the differences. Maybe you have some ideas, but I haven't figured out a good way to show this visually. We should also have a way to include this information in generated cards, although I think if we make this form of grouping optional, it would be fine to have things like {expression} be a comma-separated list of strings.

Assuming that you can figure out a way to display this information in a more compact manner, the actual implementation should not be difficult. The database by design does not have the concept of grouping in it; that all happens in the Translator layer. The files you would be interested are translator.js, dictionary.js, and the terms.html template.

@siikamiika
Copy link
Collaborator Author

I have a similar project that uses JMdict and this is how I display JMdict entries:
2017-09-28_01-21-34

Adding other dictionaries could look something like this:
2017-09-28_01-25-10

@FooSoft
Copy link
Owner

FooSoft commented Sep 27, 2017

That looks pretty good. It probably does mean that we can't properly support the Anki {tags} marker anymore, but I don't think anyone uses that anyway.

@FooSoft FooSoft mentioned this issue Sep 27, 2017
@siikamiika
Copy link
Collaborator Author

siikamiika commented Sep 27, 2017

Ok, thanks for the advice! When I was re-reading your previous reply I realized that the database not having the concept of grouping could become a problem. Does it have ent_seq (http://www.edrdg.org/jmdict/jmdict_dtd_h.html) or something similar for each JMdict entry? If not, it will not be possible to display all the words and readings associated with a group when you scan 勿れ. Why this can be important is because for expression like やむを得ない there is no entry in 明鏡国語辞典 but its definition could be displayed with the group (because it has 止むを得ない and 已むを得ない).

If you're worrying about supporting markers like {tags}, do you mean that this could become the new standard instead of just a checkbox in options? In that case, should there be an option for deciding which dictionary will be the pivot dictionary? Also, some people might not have JMdict installed so better not design everything in a JMdict-centric way.

@FooSoft
Copy link
Owner

FooSoft commented Sep 28, 2017

The database does not use ent_seq from JMdict -- it does have an auto-incremented primary key for every single row. Changing the database would certainly not be very difficult since Yomichan already has the concept of database version vs. dictionary version. Would ideally have to have some sort of fallback for users with old data in IndexedDB (at least until they reinstall dictionaries).

That being said, I don't think I understand very well why this is required. If several entries have an identical glossary, couldn't grouping just happen on the glossary (or a hash of it)? For your example with 明鏡国語辞典, how would having a field from JMdict help for grouping? It's not like 明鏡国語辞典 has ent_seq from JMdict in it right?

@FooSoft
Copy link
Owner

FooSoft commented Sep 28, 2017

Regarding {tags}, I'm not very much opposed to just killing it in general.

@siikamiika
Copy link
Collaborator Author

siikamiika commented Sep 28, 2017

Sorry, the paragraph was left half way when I started to write the next one. What I meant was that 明鏡国語辞典 doesn't have an entry for やむを得ない although JMdict claims that it's the most common form of it, but instead there are 止むを得ない and 已むを得ない.

My solution would have been that when scanning やむを得ない and finding a JMdict entry, all entries with the same ent_seq would be queried and the 明鏡国語辞典 entry matching 止むを得ない and 已むを得ない would be looked up recursively.

Another way would be to index every way of writing the word from やむをえない through 止むをえない to whatever and have them all point to the same entry, but this could waste a lot of space. Example: my past experiment with デジタル大辞泉 https://pastebin.com/raw/PbGDFrav (I was trying to find record words with multiple readings for the same kanji)

And yeah, for this particular case, you wouldn't have to generate all those but they can be useful in similar cases where somebody decides to write some word without some kanji.

@siikamiika
Copy link
Collaborator Author

Also, the database would have to be regenerated should the user want to delete JMdict or whatever used to group the entries at that moment (if I understand it correctly).

@FooSoft
Copy link
Owner

FooSoft commented Sep 28, 2017

I see. Having a sequence or something similar seems like a reasonable solution to me. We would just make sure sure that not having this field would not cause problems for users who have an older version of the DB w/o this column (should be pretty easy). We definitely would not want to include tons of combinations because IndexedDB is pure garbage and it's pretty much a miracle that it even somewhat works for the amount of data Yomichan is storing in it.

If you want to implement using the sequence, I could easily make a small update to Yomichan-Import to include this information and give you a modified converted ZIP file with this extra field included.

@siikamiika
Copy link
Collaborator Author

That should solve the problem if such sequence came with every dictionary that you wanted to group by.

Just to be clear, is the database currently storing duplicates of the glossaries (I assume it is)? Changing the database format to group based could free up some space but would also make it less flexible, for example if you wanted to keep the current way of grouping things working (or delete the main dictionary as said earlier).

@FooSoft
Copy link
Owner

FooSoft commented Sep 28, 2017

Yes, there is a certain degree of duplication right now in the database, but it's not too bad. I think that the database format should be kept as simple as possible and all of the complicated grouping and other operations should happen after relevant data is retrieved from the database. Making large changes to the database structure (more than simply adding a column or two), causes compatibility problems for users who have old versions of the DB imported. That's why by keeping the underlying format as simple as possible we can reduce the chance that future changes require changing it.

@siikamiika
Copy link
Collaborator Author

I started working on the feature by adding some basic stuff, but I'm not sure what to do with the database format. I modified yomichan-import to add JMdict's Sequence to the end of the JSON list but didn't touch the format which was set to 2.

Here's the commit (it currently assumes that JMdict is the only dictionary installed and doesn't display any definitions when result output mode is set to merged)
siikamiika@b24c705

@FooSoft
Copy link
Owner

FooSoft commented Sep 29, 2017

Looks like a solid start. The database format probably does not have to be versioned since sequence will just be undefined for old dictionaries. As long as the code handles missing sequences, everything should be good.

The only (minor) knitpick I have so far is the searcher function selection code being a little bit busy. I would do something like:

const searcher = {
    'merge': translator.findTermsMerged,
    'split': translator.findTermsSplit,
    'group': translator.findTermsGrouped
}[options.general.resultOutputMode].bind(translator);

or maybe simply a boring switch statement

@FooSoft
Copy link
Owner

FooSoft commented Sep 29, 2017

Also I would add a version function to options.js (there are already a bunch of them there in that array), that upgrade people who still have groupResults instead of resultOutputMode. Specifically this array:

    const fixups = [
        () => {},
        () => {},
        () => {},
        () => {},
        () => {
            if (options.general.audioPlayback) {
                options.general.audioSource = 'jpod101';
            } else {
                options.general.audioSource = 'disabled';
            }
        },
        () => {
            options.general.showGuide = false;
        },
        () => {
            if (options.scanning.requireShift) {
                options.scanning.modifier = 'shift';
            } else {
                options.scanning.modifier = 'none';
            }
        }
    ];

The versions run in sequence and the version number is the length of the array. The empty functions are for retired versions (I know nobody is running with them).

@siikamiika
Copy link
Collaborator Author

siikamiika commented Sep 29, 2017

Thanks! I'll have a look at the areas you mentioned. By the way, should the fixups array also remove the old options or is that handled automatically somewhere? Looking at the array you would imagine that the existence of options.general.requireShift would reset the modifier to shift on every startup.

EDIT:
Oh, each of them is only run once ever. Sneaky options.version++

@FooSoft
Copy link
Owner

FooSoft commented Sep 29, 2017

Yeah, it's a pretty good system to make sure that users have reasonable settings regardless of what version they are coming from and what big changes have been made on the current one. I just leave old options in place because meh.

@siikamiika
Copy link
Collaborator Author

siikamiika commented Sep 30, 2017

I haven't used Handlebars before but I thought that adding merged here would make it available in the terms.html template. It doesn't work, but removing grouped works as I'd expect ({{#if grouped}}...{{/if}} doesn't render even when mode is set to grouped). When I enable debug information in options, merged doesn't appear under results in the popup. What could be the problem?

@FooSoft
Copy link
Owner

FooSoft commented Sep 30, 2017

You are making it available for the Anki field templates not for the term template itself. Check out the termsShow method inside display.js.

@siikamiika
Copy link
Collaborator Author

I have to research the Anki functionality after hacking together something that works so that I don't break anything in it with my changes. Anyway, to get merged available to the template I had to add merged=../merged to this line. Didn't realize that the partial got grouped from the parent object like that.

As for database stuff, looking up every definition with a certain sequence works, but I still need to combine them in a sane way and use the terms and readings to find related definitions from other dictionaries. Then I need to decide what to do with results that cannot be merged with the main dictionary (especially when they intersect with the main dictionary entries or do not have a sequence themselves). I should probably exclude names from this as well, and possibly merge them with each other, so that the result for 田中 would have Tanaka, Denchuu, ... and so on.

@FooSoft
Copy link
Owner

FooSoft commented Sep 30, 2017

Handlebars is kind of a pain in the ass to work with, but unfortunately there isn't anything better for templating. Even conditionals in it are very limited and difficult to work with.

For your second point, I would just add a checkbox to every dictionary that is imported in settings that is called something like "Allow secondary searches". This checkbox would allow results from other dictionaries to be used to search this dictionary.

@siikamiika
Copy link
Collaborator Author

Update: the feature now works partially. I still need to move some code from translator.js to dictionary.js and fix some issues (handling other dictionaries, for instance).

How do you feel about the way glossaries are merged? I think it's hacky but can't think of any other way. If you wonder what this is for, I'm going to use it to tag the glossaries with "(... only)". A similar thing is needed for tagging readings with (expressionX, expressionY only).

@FooSoft
Copy link
Owner

FooSoft commented Oct 1, 2017

I'll take a closer look at this tomorrow, but using || as a delimiter for glossaries potentially invites unexpected behavior if a dictionary actually has those characters in it. Why not just have glossaries as a normal array and then when you need to key on it use JSON.stringify to build your index?

@siikamiika
Copy link
Collaborator Author

You're right, JSON.stringify would be better for this. I did check that || doesn't appear in the current JMdict, but it's not guaranteed in the future or with other dictionaries. I'll wait for the review before expanding on these changes too much.

@siikamiika
Copy link
Collaborator Author

Regarding an idea from the review, here's the worst case for listing all expression/reading combinations:
2017-10-03_00-03-27

For comparison, with separate expressions and readings:
2017-10-03_00-09-53

Cases like the following could be made better by filtering out katakana (and long vowel marks) or listing them separately without the kanji:
2017-10-03_00-23-16

I also noticed that there might be some incorrect part of speech taggings in the database, but I need read about the format before creating an issue about that (to yomichan-import). Example:
2017-10-03_00-14-39
I don't think that the second definition should have n-adv or n-t.

@FooSoft
Copy link
Owner

FooSoft commented Oct 2, 2017

Regarding the color coding, is light blue used for P terms and light gray for arch or something? Looks pretty sexy. I'm not too concerned about the number of results, it is something that can be reduced just by removing Katakana as you mentioned. I'm not sure that removing the long vowel mark is a good thing to do since it's showing that it is a legitimate way of writing the term's reading (or writing it in Kana, I guess).

@siikamiika
Copy link
Collaborator Author

Yeah, light blue comes from the P tag and uses the same color as the tag's background. Light gray is anything in the expression or its reading that is rare. arch is actually associated with the definitions instead.

I haven't committed this yet because I'm still working on other areas, but these functions are related:

function dictIsJmdictTermTag(tag) {
    return [
        'P',
        'news',
        'ichi',
        'spec',
        'gai',
        'ik',
        'iK',
        'ok',
        'oK',
        'ek',
        'eK',
        'io',
        'oik',
        'ateji',
        'gikun'
    ].includes(tag);
}

function dictJmdictTermTagsRare(tags) {
    const rareTags = [
        'ik',
        'iK',
        'ok',
        'oK',
        'ek',
        'eK',
        'io',
        'oik'
    ];
    for (const tag of tags) {
        if (rareTags.includes(tag)) {
            return true;
        }
    }
    return false;
}

About the 嗚呼 example, JMdict actually has an element called re_nokanji that is used for the katakana readings in the entry. Even so, listing 嗚呼[ああ] and 嗚呼[あー] separately feels redundant to me if you only consider the pronunciation and the fact that you're reading a dictionary entry, but ああ vs. あー written somewhere (without the kanji) can have a different "feeling" to it.

Before removing the katakana readings it's better to check that the "regular" (hiragana) ones exist. I checked and there are entries with katakana readings only (I guess they're mainly Chinese loan words): http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=1909284

@FooSoft
Copy link
Owner

FooSoft commented Oct 3, 2017

I would prefer not to have any explicit references to Jmdict in the code; meta information about tags is already stored in the exported dictionary. I'm wondering if it would make sense to add a score field in addition to order for tags in order to judge whether or not a term is frequent, normal, or obscure.

@siikamiika
Copy link
Collaborator Author

I pushed the changes to my fork again. I added the functions because I was not able to deduce from the meta information if the tag belonged to a definition or a term. Should this information be in Category or a new field?

Sure it would make sense to add score, I think I've seen some symbols in J-J dictionaries indicating that a term is out of fashion as well. But in case of JMdict, sometimes a term can be popular while it's tagged with oK. Some words also have only reading tagged popular but you will lose that information when you index by term+reading. (Not like that was consistent in JMdict, though. Even 擤む above has the headword tagged popular.)

@non-e-moose
Copy link

@FooSoft
I also think that removing the katakana is a bad idea, though for a different reason. It shows you that it is normal to write a word in katakana with no added meaning at all (as opposed to when katakana is used for words normally hiragana/kanji only, it adds emphasis).
Also, in the 二 example
31099538-fa69691e-a7ce-11e7-880a-8d13f83911f9
removing katakana wouldn't change anything. Not that it would be a big problem to me, but this could look really confusing to some people.
Though I admit that barring exceptions like this the current way of showing readings using furigana looks slick.

@siikamiika
Copy link
Collaborator Author

@FooSoft I actually edited the HTML manually, but I was thinking something like this:

  1. Generate all possible term+reading combinations
  2. For each term+reading combination, if it includes katakana or 「ー」 and there is another reading for the same term in plain hiragana, add the reading to a list of terms to show without kanji and remove it from the combinations

「ー」 can be tricky when it replaces う, though, so I must do some testing on (the) dictionary data to see if you can just treat it as う always when a hiragana reading would match.

This method would filter out the (in my opinion) redundant reading 「あー」 and avoid leaving terms without readings, but in the case of JMdict just respecting re_nokanji would also prevent most of the unwanted katakana readings and reduce some duplication in the database.


I've also been thinking if the terms and definitions that match the scanned text should be highlighted somehow as well, but that can't be done right now because definition.source is in the conjugated form. On the other hand, there is also the grouped mode exactly for that with no distractions (except having to look at multiple results in some cases), so it might not be a good idea.

@FooSoft
Copy link
Owner

FooSoft commented Oct 7, 2017

Highlighting the definitions that matched can be tricky because it's unclear what to do in cases where the connection is ambiguous. For example, imagine searching for a kana word that is normally written using Kanji: you search for はやい and get 早い and 速い. Which one should be highlighted? Both? Neither? Hard to tell. Although arguably there could be value in just doing the highlight when you know for sure.

The problem is (as you mentioned) that by the time you go through all of the code in deinflection, Yomichan no longer has a good idea of what it was even searching for (other than source). This is something I hope to address whenever I get around to dealing with #58 .

@siikamiika
Copy link
Collaborator Author

Ok, I'll keep the possible feature in mind. Anyway, I would vote for highlighting everything that matches what's scanned, which is all for はやい in JMdict. (Other way could be to dim everything that doesn't match.)

Meanwhile I added voice and tag support to merged mode. I don't like what happens when there are many tags, but it's manageable. A quick demo:
peek 2017-10-07 07-34

@FooSoft
Copy link
Owner

FooSoft commented Oct 7, 2017

This is looking really good. It does make issue #85 more apparent, but that is only because of all the space being saved on other UI.

@siikamiika
Copy link
Collaborator Author

siikamiika commented Oct 9, 2017

I'll try to start wrapping this up, so here's a list of things I still want to complete. Some of them are not related to this feature and some might be dropped (especially the ones tagged [bloat]).

If you see something I forgot or something that shouldn't be done or should be done differently, please comment and I'll update this.

Merged mode

  • associate popular/frequency/irregular tags with terms, rest with definitions (yomichan-import/JMdict)
  • allow secondary searches for specific dictionaries:
    (...)
    1.	勿れ (scan this)
    2.	勿れ、莫れ、毋れ、无れ <-- by sequence, already supported
    		1.[JMdict]...
    		2.(勿れ only)[明鏡国語辞典]なかれ【▼勿れ・▼莫れ】... <-- missing 莫れ
    3.	勿れ、莫れ、毋れ、无れ
    		1.[JMdict]...
    		2.(勿れ, 莫れ only)... <-- final result, look up 勿れ、莫れ、毋れ、无れ, possibly skipping 勿れ (original scan)
    

    AND

    1. やむを得ない (scan this)
    2. やむを得ない、止むを得ない、已むを得ない
      1.[JMdict]...
    3. やむを得ない、止むを得ない、已むを得ない
      1.[JMdict]...
      2.(止むを得ない, 已むを得ない only) [明鏡国語辞典]やむを‐えな・い【▼已むを得ない・▽止むを得ない】〘連語〙 <-- doesn't include the original scan
  • support innocent corpus (frequencies in general) in merged mode
  • voice hotkey Alt+P behavior: play first term or add another key for selecting the term?
  • hide superfluous readings (カタカナ and 長音符[ちょーおんぷ]) (but test with ぶち壊す) (respect re_nokanji)
  • check for regressions with Anki support and other modes; try to fit Anki to merged mode
  • [bloat] highlight matches in current entry (or dim non-matches)
  • [bloat/another feature-ish] group with the same dictionary, not others (kana scan -> group by kana; kanji scan -> group by kanji) (useful for names)

Concerns merged mode

  • [bloat] temporary access to other modes for easier flashcard creation
    in merged mode: click expression for temporary grouped mode; prompt expression when clicking on definition for split mode
    in grouped mode: click definition for split mode; click some button for grouped mode (has problems, how to select the correct sequence?)
    in split mode: click some button for grouped mode

General

  • [bug] 行ってまいります doesn't match 行く (v5k-s); also くれ doesn't match くれる (v1-s) <-- *-s problem only?
  • [bug] incorrect POS taggings (ポリ)
    when done, consider grouping tags; remember that dictionaries must be grouped for this (even when their priority is default 0)
  • sort kana only first (higher score)
  • support half width katakana (カタカナ)

@non-e-moose
Copy link

non-e-moose commented Oct 9, 2017

[bug] 行ってまいります doesn't match 行く (v5k-s)

Just in case you didn't know, it's also true for 1.4.0, it's not necessarily something that happens due to this merged mode.
As we can see from 行ってまいります, some irregular verbs are not recognised correctly, try highlighting 行って with 1.4.0. I think the other irregular verbs apart from いく are する and くる (in many conjugations) and ~っしゃる verbs changing to ~っしゃいます for the masu form. Not sure if that is a comprehensive list or if it is even worth adding them though.

As for くれ, in general some of the multitude of verb conjugations are not recognised by yomichan (especially multiple conjugations stacked together iirc). A blanket (X)(...)、(X)れ(...)、(X)ろ(...) filter showing you dictionary entry '(X)る' with explanation 'Base 1/2’ 'Base 4’, 'Base 5’ respectively (the 'base Y' info shown only when nothing better (e.g. ば form) is recognised) would solve that (as base 3 is just (X)る and た/て forms are recognised correctly). We would have full coverage, with all conjugations at least redirecting to the correct verb, even if some would have no helpful info what the conjugation you are point to is called. (Still better than not redirecting to the dictionary entry at all.)

Just in case you don't know what I mean by bases (since some textbooks/courses etc. just teach how real conjugations work, without going into this mostly academic (but useful for this case) stuff: http://davidbruce.tripod.com/verbchart.html

EDIT: Obviously there would also need to be filters for all 9 types of godan verbs (Xす, Xぬ, etc).

@siikamiika
Copy link
Collaborator Author

@non-e-moose You're right, it doesn't make much sense to mention the lack of support for those conjugations under this issue. I just pretty much formatted my Yomichan todo list into Github markdown, and that happened to be one of the points.

That verb chart seems useful, and there are indeed many ways to learn the verb conjugations. As far as I can tell, Yomichan is based on a parser and terminology similar to rikai* addons (which in turn reminds me of Tae Kim's grammar guide but that could be a coincidence). I learned most of my grammar reading text using MeCab+unidic which uses the Japanese terminology for verb forms but translated them to English at some point. The English wiki article on Japanese grammar was useful for that.

@FooSoft
Copy link
Owner

FooSoft commented Oct 9, 2017

Everything sounds good @siikamiika , some thoughts:

support innocent corpus (frequencies in general) in merged mode

I wonder what would be the way to utilize a frequency list for terms that can be written in various ways? Would it be per every Kanji expression? What about looking up frequency for words that are mostly written in Kana, and rarely with Kanji? Is attempting to look up a word by Kana alone even a valid thing to do? Have no idea, just throwing these out there.

voice hotkey Alt+P behavior: play first term or add another key for selecting the term?

The first would probably be fine. Adding an additional selection mode sounds like it would make the code quite a bit more complicated for something most people don't even care about.

[bug] 行ってまいります doesn't match 行く (v5k-s); also くれ doesn't match くれる (v1-s) <-- *-s problem only?

Hmm, since 行く is obviously tagged properly (things like 行きます works), this is probably a bug in deinflector.js where it does not properly descend deinflection tree in some cases.

sort kana only first (higher score)

How come?

support half width katakana (カタカナ)

This is something I've been wanting to add for a while but it's annoying in terms of correctly highlighting the correct matched text length.

@siikamiika
Copy link
Collaborator Author

siikamiika commented Oct 9, 2017

I wonder what would be the way to utilize a frequency list for terms that can be written in various ways?

I'm not really sure about this, but if you look at a Japanese language corpus like BCCWJ, they index by lemma. It doesn't matter what the actual written form has been when the occurrence has been recorded, only the canonical form (as defined by the corpus) matters. This isn't really compatible with the JMdict way of doing things, but usually JMdict at least includes the lemma (or some of them) used by a corpus like BCCWJ within an <entry>. Some examples:
2017-10-10_02-21-40
2017-10-10_02-22-19
2017-10-10_02-40-50

To generate a corpus like this from raw text, you will need a morphological analyzer (unreliable) or you can do it by hand.

Edit: somebody has tried it automatically here: http://wiki.wareya.moe/Frequency%20lists

How come?

This is just something I have implemented in my own project. I've found that when looking up definitions for a word that is written in kana only, the definition in kana only is usually the one I want. It wouldn't take precedence over the length of the term, i.e. you wouldn't get 「か」 first when scanning 「かった」.

@siikamiika
Copy link
Collaborator Author

The tag issue in yomichan-import should be fine now. I also added support for NoKanji but didn't implement any complicated filter for readings with katakana or long vowel mark.

With that out of the way, implementing compact tags was pretty simple. Here's what it looks like when enabled (in merged mode with mouse over the term 「の」 to show the tags associated with it as well):
2017-10-12_10-03-57

I guess #85 can be closed when this is merged @non-e-moose ?

@non-e-moose
Copy link

non-e-moose commented Oct 13, 2017

@siikamiika
Yeah, it's exactly what I wanted.
Thanks a lot.
EDIT: Out of curiosity, did you also change bullet points for | or whatever else (to do away with the unnecessary line breaks)?
EDIT2: I'm sure you already hate me, but what do you think about not using a ine break after the tags?
Something like:
yomi

FooSoft added a commit to FooSoft/yomichan-import that referenced this issue Oct 13, 2017
@FooSoft
Copy link
Owner

FooSoft commented Oct 13, 2017

@non-e-moose how do you determine where line break is used and where it is not? I see that you still have the dictionary tag for the last entry on a new line.

@siikamiika
Copy link
Collaborator Author

@FooSoft As every definition can be expected to have a dictionary tag (and a part of speech tag if the dictionary uses them), those are currently the only types tags compressed by the feature. If newlines should be added only in some cases, those tags could be the ones that warrant it.

Seems like some JMdict definitions also have the [uk] (usually kana) tag for every definition, so maybe there could be a check that if there is a certain tag (that is not dictionary or part of speech) for every definition of a result in a certain dictionary, it can be "compressed" as well.

@non-e-moose I didn't touch the bullet points yet, but I'm not sure what to do with them. Some users could want these space-saving features separately but splitting them into at least three different options (compact tags, use newline after certain tags or tags in general, have multi-part glossaries as bullet points or on the same line, also what separator to use etc...) can become confusing. The amount of options I change every time I reinstall the addon to reset it is already pretty large. But there are users who don't like if things change and want a way to keep the look and feel the same, and I respect that.

@siikamiika
Copy link
Collaborator Author

Update: secondary searches based on main dictionary entry and the rest of the space-saving features are now added. A screenshot to demonstrate both of them:
2017-10-15_10-08-58
There is a checkbox for "Compact glossaries" that both makes lists become horizontal and removes the line break after tags.

They were initially separate checkboxes but I removed the line break one after realizing that having compact glossaries off forced line breaks after tags. At least it's simpler this way.

Anki and regression testing next, I'll drop the rest of the features listed above (for now at least). Before them I want to take a look at other areas such as sorting, but I might need a break somewhere in between 😄 Anyway, it's been very educational to contribute and get code reviewed, thank you!

@non-e-moose
Copy link

non-e-moose commented Oct 15, 2017

@FooSoft
You are right, it was inconsistent, kinda thought that new dictionary could get a line break, but now I think what siikamiika did is better.
Also, it's completely unrelated, but I wanted to tell you that lately yomichan has gained a lot of traction in some language learning circles.
@siikamiika
Very good that you added the option not to use it, that way people already used to how it's now will also be satisfied. Anyway thanks, I really think that it is an important feature, I've seen people bitching about it (though why would they bitch about it on some forum instead of trying to ask for the feature is beyond me). For me it's very important because I use a bigass font, so with the line breaks barely anything fits on my screen.

@FooSoft
Copy link
Owner

FooSoft commented Oct 15, 2017

@siikamiika that looks amazing!

@siikamiika
Copy link
Collaborator Author

siikamiika commented Oct 17, 2017

Anki should work in merged mode now. Compact glossaries also change the note layout. The usual screenshot:
2017-10-17_11-20-48

One thing to note is that there is currently no way to get termTags such as [P] or [news] to the flashcards. I wonder what would be a good way to support them. Should they be after terms as text or should they use the same color coding instead, and should it be possible to disable them somehow?

@FooSoft
Copy link
Owner

FooSoft commented Oct 17, 2017

That generated card looks pretty sweet. I think color coding would be a good way to convey the same meaning without adding clutter. I wouldn't worry too much about making it possible to disable anything since advanced users can always dive into the Anki templates and do whatever they want. Novice users probably won't care too much about the default look as long as it is visually appealing.

Speaking of templates, we are probably going to need to blow away existing Anki template settings if they are unchanged from the default. This has to be done so that people upgrading from previous versions of Yomichan are able to create definitions in merged mode. I'm not sure what the best course of action would be for users who customized their Anki templates; we probably don't want to override their preferences, but at the same time merged mode won't just "work" for them.

@siikamiika
Copy link
Collaborator Author

@FooSoft Maybe the template should be written separately for each of the modes? That would require keeping track of three different versions of it but would definitely make it easier for the users. Users who have edited their template could get the old template for both grouped and split mode but a new one for merged mode.

@FooSoft
Copy link
Owner

FooSoft commented Oct 18, 2017

So while for this one case having split templates would be useful, I don't know how important it would be after the feature is deployed. I doubt that there would be huge changes that would require that the Anki card template has to be remade.

I actually created the card template system in order to avoid having to add additional settings to the options page. Novice users would use the defaults while advanced users can just jump into the templates and do whatever they want (arguably at their own risk).

Probably just checking to see if a user is running with the default template and if they are upgrading them is the best option. This is probably not the last time that there will be template breakage, so advanced users should get used to basic troubleshooting. The most important thing is to not break people who are running with the defaults (probably 99%+ of users).

@FooSoft
Copy link
Owner

FooSoft commented Oct 18, 2017

Another thing I thought of -- if the Anki templates are modified, just wrap them in a check to see if they are not in merged mode. Then append an else clause that does merged mode.

@siikamiika
Copy link
Collaborator Author

siikamiika commented Oct 18, 2017

So assuming Handlebars lets you wrap the partials ({{#*inline "..."}}) to {{if}}...{{else}}, (tested and it works) do you mean something like {{if merge}} + newTemplate + {{else}} + oldModifiedTemplate + {{/if}}?

Edit: https://github.com/siikamiika/yomichan/blob/b59980067a7698199d2466ecbeebc6ad5253ed02/ext/bg/js/options.js#L272 keeps the old templating for other modes even when it hasn't been edited, but I guess actually hard coding the old template for comparison would be overkill (and somebody could upgrade from an older version anyway). The only drawback is that old users don't get compact glossaries to grouped and split modes without resetting their templates first.

@FooSoft
Copy link
Owner

FooSoft commented Oct 18, 2017

Yup, that is exactly what I mean.

I think for users who do not have a modified template (there has only been one version so far), it is probably better to just replace it outright without adding the conditionals. The reason behind this is that it creates a resulting template that is less cluttered and easier to customize should the user decide to do so.

@siikamiika
Copy link
Collaborator Author

True, the template is already complicated enough with all three modes added to it so better not make it any worse for all current users. I missed that there haven't been different versions of it yet, so maybe this kind of pattern would work in the future: siikamiika@7e556e8 (the string hash function is ripped off https://stackoverflow.com/a/7616484/2444105).

@FooSoft
Copy link
Owner

FooSoft commented Oct 18, 2017

Yeah, I think checking for hashes is a good solution. Storing actual text for all previous versions would be pretty terrible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants