Releases: engisalor/corpusama
Releases · engisalor/corpusama
v0.4.0
v0.3.1
v0.3.0
0.3.0 (2024-06-20)
Features
- redo stanza pipeline (070f60a)
- use stanza, deprecated fasttext for langid (2e2d5e5)
- working stanza pipeline (aa52528)
Bug Fixes
- add mupdf exception for pdf extraction (5f51358)
- conll to vert fix mwt handling (6731d04)
- improve stanza pipeline (e84e8bf)
- reduce export_text chunksize to 10000 (caac056)
- remove old stanza pipeline (86a77d1)
- update df.applymap to df.map (2a5393f)
- update gitignore (60ca73b)
- update pd.Timestamp format (f00606e)
- wip redo stanza pipeline (aa681b3)
Documentation
v0.2.2
v0.2.1
v0.2.0
0.2.0 (2023-07-07)
This release has various significant changes and is not backwards compatible with previous versions. See README.md
for current workflow.
Version 0.2.0
has pipelines for building Spanish and French corpora with FreeLing. An English pipeline is currently being redesigned and will be integrated soon.
Corpora can now be built using both HTML and PDF content on ReliefWeb.
Features
- added FreeLing NLP
- added language identification with fastText
- added PDF extraction module
pipeline/
is now used for the final steps of corpus creation
Bug Fixes
- Various bug fixes and small improvements
v0.1.1
0.1.1 (2022-12-14)
Includes various bug fixes and incremental improvements for making/managing a corpus.
Bug Fixes
- corpus: drop empty vert content before insert (1c1a2c4)
- corpus: export_attribute 'parameters' arg (c908447)
- corpus: remove drop_attr arg (879b441)
- corpus: update vertical content when outdated (2641324)
- corpus: use quoteattr, fix sql query syntax (70db0c6)
- corpus: vertical docstrings, 'update' arg (866eddf)
- db: add _about table (e437afd)
- db: add_missing_columns method (b80181a)
- source: add manual override to _set_wait (56ca19c)
- source: add_missing_cols & drop fields_id (f64682e)
- source: date.changed:asc - he-alike params (6481199)
- source: rw - replace run method with one (fc95364)
- source: rw, abort insert if empty df (701fd38)
- source: rw, add all, new methods (03f8ef6)
- source: rw, automatically set wait (9bfc159)
- source: rw, improve set limit behavior (d2ab0fd)
- source: rw, set default limit to 1000 (526aacc)
- source: update rw-en, rw-es API parameters (93bbc38)
- source: update variables (06bdadf)
- source: use SystemExit, fix if/else behavior (26e99ab)
- util: add clean_xml and xml_quoteattr methods (e835d66)
- util: add logging to convert.py (108b052)
- util: nan_to_none return a series of [None] (a37fa62)
- util: use UTC time for timestamps (9d1504e)
Documentation
v0.1.0
0.1.0 (2022-11-29)
Initial release