-
Notifications
You must be signed in to change notification settings - Fork 3
/
CHANGES.txt
127 lines (89 loc) · 4.24 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# CHANGELOG #
## Version 1.8.1, 2022-10-26 ##
- Prefer the 'fork' method for creating the worker processes for
parallel tagging, if it is supported by the operating system. This
is much faster than the 'spawn' method that is the default on some
non-Linux systems (issue #14).
## Version 1.8.0, 2021-08-03 ##
- Add option --use-nfkc to the command line interface and option
use_nfkc to the constructor of ASPTagger (issue #11). If this option
is used, the internal representation of the input data uses Unicode
normalization form NFKC. This can be useful for social media input
that misuses mathematical symbols for their typographic effects
(e.g. “𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘” instead of “Impfausweis”).
- Add option --sentence-tag to specify an XML tag in the input data
that marks sentence boundaries (issue #12). This is particularly
useful in combination with the --sentence-tag option of SoMaJo.
## Version 1.7.3, 2021-03-18 ##
- Use less memory when loading a model if the ijson library is present
and the Python version is at least 3.7 (at least 3.6 for CPython)
(issue #9).
- Restructured code for parallel tagging (issue #8).
## Version 1.7.2, 2021-03-05 ##
- Bugfix: Do not choke on chunks of XML that do not contain actual
word tokens (usually at the end of a file).
- Updated regular expressions for emojis, emoticons, numbers and URLs.
## Version 1.7.1, 2019-11-07 ##
- Fixed an XML-related bug in STTS_IBK_postprocessor
- Fixed a minor bug in emoticon regex
## Version 1.7.0, 2019-11-07 ##
- Added Reddit links and Reddit-specific emoticons
- Moved command-line interface to cli.py
- Helper script for tagging multiple files (somewe-tagger-multifile)
- Postprocessing script for some deterministic tagging decisions in
STTS_IBK, e.g. URLs, Emoticons, etc. (STTS_IBK_postprocessor)
## Version 1.6.2, 2019-10-17 ##
- Sanity-check input: Warn if there are extremely long sentences (≥
500 words) in the input as this might indicate missing sentence
boundaries.
- Use np.frombuffer() instead of np.fromstring() to fix a
DeprecationWarning.
## Version 1.6.1, 2019-10-02 ##
- New option -v/--version to output version information.
- Explicitly specify input encoding as UTF-8.
- Fixed a bug in progress display.
## Version 1.6.0, 2019-07-02 ##
- New method tag_xml_sentence for simplified processing of SoMaJo's
output for XML files.
- Updated regular expressions for emojis (taken from SoMaJo).
- Fixed a bug where SoMeWeTa could not be installed when numpy was not
already there.
## Version 1.5.1, 2019-06-19 ##
- Got rid of FutureWarning about possible nested sets in regular
expression.
## Version 1.5.0, 2019-04-12 ##
- Added support for parallel tagging of XML input.
- New option --progress for showing tagging progress and remaining
time.
- Fixed calculation of confidence interval when reporting
crossvalidation results.
## Version 1.4.0, 2018-11-14 ##
- Replaced XML parsing with a shallower approach. When tagging an XML
file, we do no longer have to keep the whole file in memory.
- Minor improvements regarding URLs and emojis.
## Version 1.3.1, 2018-03-27 ##
Bugfix: Sentence boundaries are correctly recognized when reading an
XML file.
## Version 1.3.0, 2018-03-23 ##
- SoMeWeTa has now XML support. To tag an XML file, use the option
-x/--xml. It is assumed that each XML tag is on a separate line.
- The implementation of the beam search algorithm has been slightly
improved.
## Version 1.2.0, 2018-02-23 ##
It is now possible to use the option --ignore-tag to specify a tag
that will not be learned during training and that will be ignored
during evaluation. Use case: Partially annotated data that use a
pseudo-tag for tokens without annotation.
## Version 1.1.2, 2017-10-27 ##
Bugfix: Using the --parallel option does no longer change the order of
the sentences.
## Version 1.1.1, 2017-10-25 ##
This version fixes a bug that made it impossible to use the --parallel
option when reading from STDIN.
## Version 1.1.0, 2017-10-24 ##
- Bugfix: Removed trailing space from last tag in sentence.
- The new option --parallel makes it possible to use a pool of worker
processes to speed up tagging.
- We also print a log message that indicates tagging speed.
## Version 1.0.0, 2017-09-15 ##
First release.