You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ambiguous multiword expessions with ambiguous tokenisation
Seems to work – represented within lexc now; hfst-tokenise also
supports forms on the analyses now
Ambiguous multiword expessions need reorganising after CG
The module cg-mwesplit takes wordforms from readings and turns them into
new cohorts
Unknown words
The set-difference method only works for words without
flag diacritics (even though we should be working only on the form-side?)
and leads to binary blow-up: With only lower unknowns, we get 45M;
lower+upper gives 67M, while no unknowns gives 27M
Fixed instead by treating empty analyses as unknown-tokens in
hfst-tokenise, and outputting unmatched strings with a prefix
Treat input that's within superblanks as unmatched
probably requires a change in hfst-tokenise itself
Try >1 space for ambiguous MWE's? – represented within lexc now
Try set-difference-unknowns method with regular hfst commands?
Moved here from top of gramcheck tokeniser header.
Issues:
supports forms on the analyses now
new cohorts
flag diacritics (even though we should be working only on the form-side?)
and leads to binary blow-up: With only lower unknowns, we get 45M;
lower+upper gives 67M, while no unknowns gives 27M
hfst-tokenise, and outputting unmatched strings with a prefix
Moved here from top of gramcheck tokeniser header.
@unhammer, @lynnda-hill - til info
The text was updated successfully, but these errors were encountered: