Status for pmatch-based analysis/tokenisation #52

snomos · 2022-02-09T13:37:44Z

Issues:

Ambiguous input
- Seems to work fine
Ambiguous multiword expessions with ambiguous tokenisation
- Seems to work – represented within lexc now; hfst-tokenise also
  supports forms on the analyses now
Ambiguous multiword expessions need reorganising after CG
- The module cg-mwesplit takes wordforms from readings and turns them into
  new cohorts
Unknown words
- The set-difference method only works for words without
  flag diacritics (even though we should be working only on the form-side?)
  and leads to binary blow-up: With only lower unknowns, we get 45M;
  lower+upper gives 67M, while no unknowns gives 27M
- Fixed instead by treating empty analyses as unknown-tokens in
  hfst-tokenise, and outputting unmatched strings with a prefix
Treat input that's within superblanks as unmatched
- probably requires a change in hfst-tokenise itself
Try >1 space for ambiguous MWE's? – represented within lexc now
Try set-difference-unknowns method with regular hfst commands?

Moved here from top of gramcheck tokeniser header.

unhammer changed the title ~~Status for libdivvun dev~~ Status for pmatch-based analysis/tokenisation Feb 10, 2022

Provide feedback