Releases · uhermjakob/wildebeest

20 Nov 05:20

uhermjakob

0.9.2

7e09dd4

New: wb_analysis.py Latest

Latest

Added new wb_analysis.py

In Python, inspired by old Perl script wb-analysis.pl, but with many improvements:
Uses Unicode resources (more coverage, easily adopts new Unicode versions)
More general handling of tokens with multiple scripts
Offers Python function interface, access to wb.analysis result data structure

Restructured wb_normalize.py

New default is a much more moderate normalization that focuses on character repair and UTF8 encoding normalization. The new default is generally suitable for applications that largely need to preserve the original text.
The old default of "apply all normalization steps" is available by using option --all. That option settings is more suitable for many NLP applications.
Both the CLI and Python function interface offer à la carte control of which normalization steps to include or exclude.

Documentation

Updated and extensively increased
Installation options now include pip install wildebeest-nlp.

Assets 2

22 Apr 04:59

uhermjakob

0.6.3

9ce9816

Added Georgian character normalization

Map 3 non-standard Georgian scripts to standard script Mkhedruli (U+10D0--U+10FF)

Asomtavruli (historic, oldest, used by Orthodox Church; U+10A0--U+10CF)
Nuskhuri (historic, second oldest, used by Orthodox Church; U+2D00--U+2D2F)
Mkhedruli Mtavruli (sometimes used for emphasis and in headlines, corresponds to English ALL-CAPS or bold; added to Unicode in 2018; similar in appearance to standard Mkhedruli; U+1C90--U+1CBF)

Map 5 archaic characters (ჱ/he, ჲ/hie, ჳ/vie, ჴ/qari, ჵ/hoe) in standard Georgian script to non-archaic forms per https://en.wikipedia.org/wiki/Georgian_scripts#Letters_removed_from_the_Georgian_alphabet

Assets 2

06 Dec 06:25

uhermjakob

0.6.2

d8b361a

Minor update: soft-hyphen, Cyrillic analysis

Delete soft hyphen (U+00AD) rather than map it to regular hyphen.

Added new tests to wildebeest analysis:

MIXED_CYRILLIC_LATIN: Token contains mix of Cyrillic and Latin
PUNCT_CYRILLIC: Token contains punctuation followed by Cyrillic
CYRILLIC_PUNCT: Token contains Cyrillic followed by punctuation
MIXED_CYRILLIC_PUNCT: Token contains mix of Cyrillic and Punctuation
CYRILLIC_PLUS_PERIOD: Token contains Cyrillic and a period (possibly abbreviation)

Assets 2

30 Nov 07:56

uhermjakob

0.6.1

f31a298

Added basic look-alike correction for mixed Latin/Cyrillic tokens.

New version 0.6.1 corrects characters with identical-looking letters in Latin/Cyrillic, but with different Unicode code points, e.g.

aйды -> айды (mixed Cyrillic/Latin to pure Cyrillic)
жəне -> және (mixed Cyrillic/Latin to pure Cyrillic)
Austіn -> Austin (mixed Latin/Cyrillic to pure Latin)
хабарлайдыTengrinews.kzтілшісі -> хабарлайды Tengrinews.kz тілшісі (split mixed Cyrillic/Latin token)
http://kokshetau.akmo.gov.kz/және (deliberately keeping mixed Cyrillic/Latin URL unchanged)
Мінбер.kz (deliberately keeping mixed Cyrillic/Latin URL unchanged)

About 4% of mixed Cyrillic/Latin remain unprocessed, some of which is at least theoretically correctable, some of which would require a language-specific language model.

Даkar
kүnge
Күнделік.кz
\r\nгаздандыру

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: uhermjakob/wildebeest

New: wb_analysis.py

Added Georgian character normalization

Minor update: soft-hyphen, Cyrillic analysis

Added basic look-alike correction for mixed Latin/Cyrillic tokens.