0.4.0 (2024-11-01)
A significant update with various issues fixed and new enhancements. Provides scripts for automated corpus updates (see README).
- add stanza secondary_pipeline (1524851)
- various
- various
0.3.1 (2024-06-21)
- stanza pipeline fix bad var name, refactor (a43c170)
0.3.0 (2024-06-20)
- redo stanza pipeline (070f60a)
- use stanza, deprecated fasttext for langid (2e2d5e5)
- working stanza pipeline (aa52528)
- add mupdf exception for pdf extraction (5f51358)
- conll to vert fix mwt handling (6731d04)
- improve stanza pipeline (e84e8bf)
- reduce export_text chunksize to 10000 (caac056)
- remove old stanza pipeline (86a77d1)
- update df.applymap to df.map (2a5393f)
- update gitignore (60ca73b)
- update pd.Timestamp format (f00606e)
- wip redo stanza pipeline (aa681b3)
- fix docstring (fc43853)
- update main deps (aad5949)
- update readme (5657223)
- update readme (b35e30c)
- update readme (fd0cc1f)
0.2.2 (2023-09-16)
- add date args for export_text (ea10b93)
- move corpus attributes to config yml (e0594cd)
- update freeling pipeline init_locale func (7d81c49)
- update readme (0199f41)
0.2.1 (2023-07-17)
- add FreeLing EN pipeline (783ac38)
- add pipeline/compare_vert script (c0db3e4)
- fix changelog release number (10eebe7)
0.2.0 (2023-07-07)
This release has various significant changes and is not backwards compatible with previous versions. See README.md
for current workflow.
Version 0.2.0
has pipelines for building Spanish and French corpora with FreeLing. An English pipeline is currently being redesigned and will be integrated soon.
Corpora can now be built using both HTML and PDF content on ReliefWeb.
- added FreeLing NLP
- added language identification with fastText
- added PDF extraction module
pipeline/
is now used for the final steps of corpus creation
- Various bug fixes and small improvements
0.1.1 (2022-12-14)
Includes various bug fixes and incremental improvements for making/managing a corpus.
- corpus: drop empty vert content before insert (1c1a2c4)
- corpus: export_attribute 'parameters' arg (c908447)
- corpus: remove drop_attr arg (879b441)
- corpus: update vertical content when outdated (2641324)
- corpus: use quoteattr, fix sql query syntax (70db0c6)
- corpus: vertical docstrings, 'update' arg (866eddf)
- db: add _about table (e437afd)
- db: add_missing_columns method (b80181a)
- source: add manual override to _set_wait (56ca19c)
- source: add_missing_cols & drop fields_id (f64682e)
- source: date.changed:asc - he-alike params (6481199)
- source: rw - replace run method with one (fc95364)
- source: rw, abort insert if empty df (701fd38)
- source: rw, add all, new methods (03f8ef6)
- source: rw, automatically set wait (9bfc159)
- source: rw, improve set limit behavior (d2ab0fd)
- source: rw, set default limit to 1000 (526aacc)
- source: update rw-en, rw-es API parameters (93bbc38)
- source: update variables (06bdadf)
- source: use SystemExit, fix if/else behavior (26e99ab)
- util: add clean_xml and xml_quoteattr methods (e835d66)
- util: add logging to convert.py (108b052)
- util: nan_to_none return a series of [None] (a37fa62)
- util: use UTC time for timestamps (9d1504e)
- corpus: standardize docstrings (dfc2ad1)
- db: standardize docstrings (593c790)
- source: standardize docstrings (511a1d1)
- standardize docstrings (e222e06)
- util: standardize docstrings (9b172e6)
Initial release