Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

changes related to FooSoft/yomichan#84 #11

Merged
merged 11 commits into from
Oct 13, 2017
Merged

changes related to FooSoft/yomichan#84 #11

merged 11 commits into from
Oct 13, 2017

Conversation

siikamiika
Copy link
Contributor

No description provided.

@@ -102,6 +103,7 @@ func (terms dbTermList) crush() dbRecordList {
strings.Join(t.Rules, " "),
t.Score,
t.Glossary,
t.Sequence,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know off-hand if the sequence values for JMDict are 0-based or 1-based, but if they are 0-based then this should probably be set to t.Sequence + 1. The reason for this is so that a sequence value of 0 can become a sentinel for "no sequence defined".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like they start at 1000000, but I don't know how it's determined. In Yomichan, the sentinel value is currently set to -1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -1 works as well, but in that case you'll have to make sure that all the other dictionaries that do not have the concept of Sequence have it initialized to that value.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thinking about it more -1 is probably most explicit (since it's obviously an invalid sequence). You'll just have to make sure that the EPWING and ENAMDICT parsers explicitly set this value (since the default of 0 would be a valid value).

common.go Outdated
if len(term.TermTags) == 0 {
return tags
} else {
return tags + "\t" + strings.Join(term.TermTags, " ")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that tags and termTags are separated by a tab and then the extension decides what is which is which at runtime? I think it would be cleaner to add an actual field to the export (similar to Sequence). The output of yomichan-import should be ready for consumption by the extension and there should not be any additional parsing taking place after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I decided for this to enable easier backwards compatibility with dictionaries exported with older version of yomichan-import. This is what's done in Yomichan: siikamiika/yomichan@4fb983a. If a new field for termTags is added, should it come after tags or at the end like sequence?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding a field at the end will allow us to keep compatibility since older versions of Yomichan will just ignore it when destructuring the array.

edict.go Outdated
term.Score += 100
case "P":
term.Score += 500
case "arch", "iK":
case "iK", "ik":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, there are both iK and ik? Do they mean different things? Oh crazy EDICT.

edict.go Outdated
"v2r-s", "v2s-s", "v2t-k", "v2t-s", "v2w-s", "v2y-k", "v2y-s", "v2z-s", "v4b", "v4h", "v4k", "v4m", "v4r", "v4s", "v4t", "v5aru",
"v5b", "v5g", "v5k", "v5k-s", "v5m", "v5n", "v5r-i", "v5r", "v5s", "v5t", "v5u", "v5u-s", "vi", "vk", "vn", "vr", "vs-c", "vs-i",
"vs", "vs-s", "vt", "vz":
tag.Category = "pos"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does pos mean? Is it an abbreviation for something? Prefer the full name since it makes it easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are currently no categories that have multiple words, so should I use partOfSpeech part-of-speech or something else? Anyway, I think it must be compatible with CSS classes so it can't have spaces.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having spaces would definitely not be good. I think partOfSpeech would be the most consistent way to name this.

common.go Outdated
@@ -77,6 +77,7 @@ type dbTerm struct {
Expression string
Reading string
Tags []string
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point where we have termTags, it would probably make sense to rename tags to be be something that more specifically reflects what they are tagging.

@siikamiika
Copy link
Contributor Author

Instead of hard coding Sequence = -1 for everything, I added sequence to epwing.go that increments every time extractor.extractTerms is called. JMnedict had Sequence already. The rikai solution isn't perfect, but I'm not familiar enough with the format to say if it's possible to infer the original "sequence" from it (unless you hash by glossary or something).

@FooSoft
Copy link
Owner

FooSoft commented Oct 13, 2017

Looks good, thanks for the updates!

@FooSoft FooSoft merged commit cc4140f into FooSoft:dev Oct 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants