Skip to content

Added basic look-alike correction for mixed Latin/Cyrillic tokens.

Compare
Choose a tag to compare
@uhermjakob uhermjakob released this 30 Nov 07:56
· 127 commits to master since this release

New version 0.6.1 corrects characters with identical-looking letters in Latin/Cyrillic, but with different Unicode code points, e.g.

  • aйды -> айды (mixed Cyrillic/Latin to pure Cyrillic)
  • жəне -> және (mixed Cyrillic/Latin to pure Cyrillic)
  • Austіn -> Austin (mixed Latin/Cyrillic to pure Latin)
  • хабарлайдыTengrinews.kzтілшісі -> хабарлайды Tengrinews.kz тілшісі (split mixed Cyrillic/Latin token)
  • http://kokshetau.akmo.gov.kz/және (deliberately keeping mixed Cyrillic/Latin URL unchanged)
  • Мінбер.kz (deliberately keeping mixed Cyrillic/Latin URL unchanged)

About 4% of mixed Cyrillic/Latin remain unprocessed, some of which is at least theoretically correctable, some of which would require a language-specific language model.

  • Даkar
  • kүnge
  • Күнделік.кz
  • \r\nгаздандыру