Skip to content

Releases: uhermjakob/wildebeest

New: wb_analysis.py

20 Nov 05:20
Compare
Choose a tag to compare

Added new wb_analysis.py

  • In Python, inspired by old Perl script wb-analysis.pl, but with many improvements:
  • Uses Unicode resources (more coverage, easily adopts new Unicode versions)
  • More general handling of tokens with multiple scripts
  • Offers Python function interface, access to wb.analysis result data structure

Restructured wb_normalize.py

  • New default is a much more moderate normalization that focuses on character repair and UTF8 encoding normalization. The new default is generally suitable for applications that largely need to preserve the original text.
  • The old default of "apply all normalization steps" is available by using option --all. That option settings is more suitable for many NLP applications.
  • Both the CLI and Python function interface offer à la carte control of which normalization steps to include or exclude.

Documentation

  • Updated and extensively increased
  • Installation options now include pip install wildebeest-nlp.

Added Georgian character normalization

22 Apr 04:59
Compare
Choose a tag to compare

Map 3 non-standard Georgian scripts to standard script Mkhedruli (U+10D0--U+10FF)

  • Asomtavruli (historic, oldest, used by Orthodox Church; U+10A0--U+10CF)
  • Nuskhuri (historic, second oldest, used by Orthodox Church; U+2D00--U+2D2F)
  • Mkhedruli Mtavruli (sometimes used for emphasis and in headlines, corresponds to English ALL-CAPS or bold; added to Unicode in 2018; similar in appearance to standard Mkhedruli; U+1C90--U+1CBF)

Map 5 archaic characters (ჱ/he, ჲ/hie, ჳ/vie, ჴ/qari, ჵ/hoe) in standard Georgian script to non-archaic forms per https://en.wikipedia.org/wiki/Georgian_scripts#Letters_removed_from_the_Georgian_alphabet

Minor update: soft-hyphen, Cyrillic analysis

06 Dec 06:25
Compare
Choose a tag to compare

Delete soft hyphen (U+00AD) rather than map it to regular hyphen.

Added new tests to wildebeest analysis:

  • MIXED_CYRILLIC_LATIN: Token contains mix of Cyrillic and Latin
  • PUNCT_CYRILLIC: Token contains punctuation followed by Cyrillic
  • CYRILLIC_PUNCT: Token contains Cyrillic followed by punctuation
  • MIXED_CYRILLIC_PUNCT: Token contains mix of Cyrillic and Punctuation
  • CYRILLIC_PLUS_PERIOD: Token contains Cyrillic and a period (possibly abbreviation)

Added basic look-alike correction for mixed Latin/Cyrillic tokens.

30 Nov 07:56
Compare
Choose a tag to compare

New version 0.6.1 corrects characters with identical-looking letters in Latin/Cyrillic, but with different Unicode code points, e.g.

  • aйды -> айды (mixed Cyrillic/Latin to pure Cyrillic)
  • жəне -> және (mixed Cyrillic/Latin to pure Cyrillic)
  • Austіn -> Austin (mixed Latin/Cyrillic to pure Latin)
  • хабарлайдыTengrinews.kzтілшісі -> хабарлайды Tengrinews.kz тілшісі (split mixed Cyrillic/Latin token)
  • http://kokshetau.akmo.gov.kz/және (deliberately keeping mixed Cyrillic/Latin URL unchanged)
  • Мінбер.kz (deliberately keeping mixed Cyrillic/Latin URL unchanged)

About 4% of mixed Cyrillic/Latin remain unprocessed, some of which is at least theoretically correctable, some of which would require a language-specific language model.

  • Даkar
  • kүnge
  • Күнделік.кz
  • \r\nгаздандыру