Skip to content

Releases: engisalor/corpusama

v0.4.0

01 Nov 18:11
fa7f603
Compare
Choose a tag to compare

0.4.0 (2024-11-01)

A significant update with various issues fixed and new enhancements. Provides scripts for automated corpus updates (see README).

Features

  • add stanza secondary_pipeline (1524851)

Bug Fixes

  • various

Documentation

  • various

v0.3.1

21 Jun 11:13
de1e03d
Compare
Choose a tag to compare

0.3.1 (2024-06-21)

Bug Fixes

  • stanza pipeline fix bad var name, refactor (a43c170)

v0.3.0

20 Jun 19:20
5de6446
Compare
Choose a tag to compare

0.3.0 (2024-06-20)

Features

  • redo stanza pipeline (070f60a)
  • use stanza, deprecated fasttext for langid (2e2d5e5)
  • working stanza pipeline (aa52528)

Bug Fixes

  • add mupdf exception for pdf extraction (5f51358)
  • conll to vert fix mwt handling (6731d04)
  • improve stanza pipeline (e84e8bf)
  • reduce export_text chunksize to 10000 (caac056)
  • remove old stanza pipeline (86a77d1)
  • update df.applymap to df.map (2a5393f)
  • update gitignore (60ca73b)
  • update pd.Timestamp format (f00606e)
  • wip redo stanza pipeline (aa681b3)

Documentation

v0.2.2

16 Sep 10:19
7c0ad70
Compare
Choose a tag to compare

0.2.2 (2023-09-16)

Bug Fixes

  • add date args for export_text (ea10b93)
  • move corpus attributes to config yml (e0594cd)
  • update freeling pipeline init_locale func (7d81c49)

Documentation

v0.2.1

17 Jul 16:25
9f879db
Compare
Choose a tag to compare

0.2.1 (2023-07-17)

Bug Fixes

  • add FreeLing EN pipeline (783ac38)
  • add pipeline/compare_vert script (c0db3e4)
  • fix changelog release number (10eebe7)

v0.2.0

07 Jul 18:14
dcecf9c
Compare
Choose a tag to compare

0.2.0 (2023-07-07)

This release has various significant changes and is not backwards compatible with previous versions. See README.md for current workflow.

Version 0.2.0 has pipelines for building Spanish and French corpora with FreeLing. An English pipeline is currently being redesigned and will be integrated soon.

Corpora can now be built using both HTML and PDF content on ReliefWeb.

Features

  • added FreeLing NLP
  • added language identification with fastText
  • added PDF extraction module
  • pipeline/ is now used for the final steps of corpus creation

Bug Fixes

  • Various bug fixes and small improvements

v0.1.1

14 Dec 17:57
d824fce
Compare
Choose a tag to compare

0.1.1 (2022-12-14)

Includes various bug fixes and incremental improvements for making/managing a corpus.

Bug Fixes

  • corpus: drop empty vert content before insert (1c1a2c4)
  • corpus: export_attribute 'parameters' arg (c908447)
  • corpus: remove drop_attr arg (879b441)
  • corpus: update vertical content when outdated (2641324)
  • corpus: use quoteattr, fix sql query syntax (70db0c6)
  • corpus: vertical docstrings, 'update' arg (866eddf)
  • db: add _about table (e437afd)
  • db: add_missing_columns method (b80181a)
  • source: add manual override to _set_wait (56ca19c)
  • source: add_missing_cols & drop fields_id (f64682e)
  • source: date.changed:asc - he-alike params (6481199)
  • source: rw - replace run method with one (fc95364)
  • source: rw, abort insert if empty df (701fd38)
  • source: rw, add all, new methods (03f8ef6)
  • source: rw, automatically set wait (9bfc159)
  • source: rw, improve set limit behavior (d2ab0fd)
  • source: rw, set default limit to 1000 (526aacc)
  • source: update rw-en, rw-es API parameters (93bbc38)
  • source: update variables (06bdadf)
  • source: use SystemExit, fix if/else behavior (26e99ab)
  • util: add clean_xml and xml_quoteattr methods (e835d66)
  • util: add logging to convert.py (108b052)
  • util: nan_to_none return a series of [None] (a37fa62)
  • util: use UTC time for timestamps (9d1504e)

Documentation

  • corpus: standardize docstrings (dfc2ad1)
  • db: standardize docstrings (593c790)
  • source: standardize docstrings (511a1d1)
  • standardize docstrings (e222e06)
  • util: standardize docstrings (9b172e6)

v0.1.0

29 Nov 11:09
d3b0668
Compare
Choose a tag to compare

0.1.0 (2022-11-29)

Initial release