Added basic look-alike correction for mixed Latin/Cyrillic tokens.
New version 0.6.1 corrects characters with identical-looking letters in Latin/Cyrillic, but with different Unicode code points, e.g.
- aйды -> айды (mixed Cyrillic/Latin to pure Cyrillic)
- жəне -> және (mixed Cyrillic/Latin to pure Cyrillic)
- Austіn -> Austin (mixed Latin/Cyrillic to pure Latin)
- хабарлайдыTengrinews.kzтілшісі -> хабарлайды Tengrinews.kz тілшісі (split mixed Cyrillic/Latin token)
- http://kokshetau.akmo.gov.kz/және (deliberately keeping mixed Cyrillic/Latin URL unchanged)
- Мінбер.kz (deliberately keeping mixed Cyrillic/Latin URL unchanged)
About 4% of mixed Cyrillic/Latin remain unprocessed, some of which is at least theoretically correctable, some of which would require a language-specific language model.
- Даkar
- kүnge
- Күнделік.кz
- \r\nгаздандыру