Releases: adbar/simplemma
Releases · adbar/simplemma
simplemma-1.1.1
simplemma-1.1.0
simplemma-1.0.0
Extensive refactoring by @juanjoDiaz:
- Series of modular classes
- Different lemmatization strategies available
- Customization of dictionary loading and handling (
DictionaryFactory
) LanguageDetector
class with extended options- See readme and detailed documentation
Breaking changes:
- The
extensive
argument is nowgreedy
- The
langdetect
submodule is nowlanguage_detector
from simplemma.langdetect import ...
→from simplemma.language_detector import ...
Fixes and improvements:
is_known()
function now restored to its state in v0.9.0 (full dictionary)- More languages and better rules (with @juanjoDiaz)
- Use binary strings in dictionaries to save memory
- Dictionary sort before compression by @1over137
Documentation:
- Classes and general doc pages by @juanjoDiaz
- Section on classes in the readme by @osma
simplemma-0.9.1
What's Changed
- smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
- unsupervised approach to affixes activated by default for some languages
- reviewed rules for English and German (less greedy)
- added rules for Dutch, Finnish, Polish and Russian
- improved Russian and Ukrainian language data (#3)
- improved tokenizer
Full Changelog: v0.9.0...v0.9.1
simplemma-0.9.0
simplemma-0.8.2
- languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
- fix for slow language detection introduced in 0.7.0
Full Changelog: v0.8.1...v0.8.2
simplemma-0.8.1
- better rules for English and German
- inconsistencies fixed for cy, de, en, ga, sv (#16)
- docs: added language detection and citation info
Full Changelog: v0.8.0...v0.8.1
simplemma-0.8.0
- code fully type checked, optional pre-compilation with
mypyc
- fixes: logging error (#11), input type (#12)
- code style: black
Full Changelog: v0.7.0...v0.8.0
simplemma-0.7.0
- breaking change: language data pre-loading now occurs internally, language codes are now directly provided in
lemmatize()
call, e.g.simplemma.lemmatize("test", lang="en")
- faster lemmatization and result cache
- sentence-aware
text_lemmatizer()
- optional iterators for tokenization and lemmatization
Full Changelog: v0.6.0...v0.7.0
simplemma-0.6.0
- improved language models
- improved tokenizer
- maintenance and code efficiency
- added basic language detection (undocumented)
Full Changelog: v0.5.0...v0.6.0