All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Upload full models to Hugging Face Hub.
- Automatic download of full models.
- Hide Tensorflow and Transformers logging messages in executable scripts.
- Redirect Keras prediction progress bar to stderr.
- Huge memory improvements during training.
- Speed improvements using pading
longest
instead ofmax_length
- Models are more insensitive to the presence of capital letter at the start of the sentence.
- Improved performance on HBS Cyrillic transliterating in models which had poor training on cyrillic text.
- Basic test suite.
- Allow changing the base model for XLMR. Any XLMRoberta model can be used.
- Migrate to
pyproject.toml
andsrc/
tree structure, comply with PEP517, PEP518 and PEP621. - Update to Hardrules 2.6
- Rules can be parametrized with
--rules_config config.yaml
- Some rules have been refactored with better names.
--run_all_rules
mode to run each rule instead of stoppping at first discard- Language identification with FastSpell
- Better Serbo-Croatian and Slovene language detection.
- Easier installation! Now KenLM comes pre-compiled.
- Rules can be parametrized with
- Now BICLEANER_AI_THREADS environment variable controls the number of threads.
- Update HF Transformers.
- Update TensorFlow minimum version.
- Removed
glove-python
dependency and use own custom compilation. - Improved download scripts, easier to install and use.
- Set inter/intra_op parallelism to 0 by default.
- Block size by default to 10k, a bit faster.
- Faster noise generation for small datasets with lower block size.
- Model argument can be provided with or without 'metadata.yaml'.
- Add citation info to README.
- Avoid generating empty sentences in omit noise.
- Restore capital letters at the beggining of the sentennce in frequency noise.
- Retrocompatibility with older models.
- Compatibility of
glove
with Python>=3.7. - Fix loading lite models in other Python versions than 3.8.
- Fix unbound variable
lm_stats
. - Other minor fixes.
- Update hardrules to 1.2: adds score only mode.
- Bicleaner train changes:
- Separate most of the training logic in the BaseModel class.
- Re-factor synthetic noise build function.
- Parallelize synthetic noise generation.
- Add fuzzy matching noise and neighbour noise.
- Add Decomposable Attention model.
- Add Transkformer-like model.
- Add XLMRoberta model.
- Bicleaner classify changes:
- Change old classifier by new neural models.
- Move hardrules into a separate package.