Releases: uhermjakob/wildebeest
New: wb_analysis.py
Added new wb_analysis.py
- In Python, inspired by old Perl script wb-analysis.pl, but with many improvements:
- Uses Unicode resources (more coverage, easily adopts new Unicode versions)
- More general handling of tokens with multiple scripts
- Offers Python function interface, access to wb.analysis result data structure
Restructured wb_normalize.py
- New default is a much more moderate normalization that focuses on character repair and UTF8 encoding normalization. The new default is generally suitable for applications that largely need to preserve the original text.
- The old default of "apply all normalization steps" is available by using option
--all
. That option settings is more suitable for many NLP applications. - Both the CLI and Python function interface offer à la carte control of which normalization steps to include or exclude.
Documentation
- Updated and extensively increased
- Installation options now include
pip install wildebeest-nlp
.
Added Georgian character normalization
Map 3 non-standard Georgian scripts to standard script Mkhedruli (U+10D0--U+10FF)
- Asomtavruli (historic, oldest, used by Orthodox Church; U+10A0--U+10CF)
- Nuskhuri (historic, second oldest, used by Orthodox Church; U+2D00--U+2D2F)
- Mkhedruli Mtavruli (sometimes used for emphasis and in headlines, corresponds to English ALL-CAPS or bold; added to Unicode in 2018; similar in appearance to standard Mkhedruli; U+1C90--U+1CBF)
Map 5 archaic characters (ჱ/he, ჲ/hie, ჳ/vie, ჴ/qari, ჵ/hoe) in standard Georgian script to non-archaic forms per https://en.wikipedia.org/wiki/Georgian_scripts#Letters_removed_from_the_Georgian_alphabet
Minor update: soft-hyphen, Cyrillic analysis
Delete soft hyphen (U+00AD) rather than map it to regular hyphen.
Added new tests to wildebeest analysis:
- MIXED_CYRILLIC_LATIN: Token contains mix of Cyrillic and Latin
- PUNCT_CYRILLIC: Token contains punctuation followed by Cyrillic
- CYRILLIC_PUNCT: Token contains Cyrillic followed by punctuation
- MIXED_CYRILLIC_PUNCT: Token contains mix of Cyrillic and Punctuation
- CYRILLIC_PLUS_PERIOD: Token contains Cyrillic and a period (possibly abbreviation)
Added basic look-alike correction for mixed Latin/Cyrillic tokens.
New version 0.6.1 corrects characters with identical-looking letters in Latin/Cyrillic, but with different Unicode code points, e.g.
- aйды -> айды (mixed Cyrillic/Latin to pure Cyrillic)
- жəне -> және (mixed Cyrillic/Latin to pure Cyrillic)
- Austіn -> Austin (mixed Latin/Cyrillic to pure Latin)
- хабарлайдыTengrinews.kzтілшісі -> хабарлайды Tengrinews.kz тілшісі (split mixed Cyrillic/Latin token)
- http://kokshetau.akmo.gov.kz/және (deliberately keeping mixed Cyrillic/Latin URL unchanged)
- Мінбер.kz (deliberately keeping mixed Cyrillic/Latin URL unchanged)
About 4% of mixed Cyrillic/Latin remain unprocessed, some of which is at least theoretically correctable, some of which would require a language-specific language model.
- Даkar
- kүnge
- Күнделік.кz
- \r\nгаздандыру