Huge memory usage for some languages, e.g. Finnish #19

osma · 2022-09-29T13:21:37Z

While testing NatLibFi/Annif#626, I noticed that Simplemma needs a lot of memory for Finnish language lemmatization and language detection. I did a little comparison to see how the choice of language affects memory consumption.

First the baseline, English:

#!/usr/bin/env python

from simplemma import text_lemmatizer

print(text_lemmatizer("she sells seashells on a seashore", lang='en'))

Results (partial) of running this with /usr/bin/time -v:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14
Maximum resident set size (kbytes): 41652

So far so good. Then other languages I know.

Estonian:

print(text_lemmatizer("jääääre kuuuurija töööö", lang='et'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
Maximum resident set size (kbytes): 41252

practically same as English

Swedish:

print(text_lemmatizer("sju sköna sjuksköterskor skötte sju sjösjuka sjömän", lang='sv'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.70
Maximum resident set size (kbytes): 190920

somewhat slower, needs 150MB more memory

...

Finnish:

print(text_lemmatizer("mustan kissan paksut posket", lang='fi'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.44
Maximum resident set size (kbytes): 1059328

oops, now takes at least an order of magnitude longer and 1GB of memory! Why?

I checked the sizes of the data files for these four languages:

 14M fi.plzma
2.6M sv.plzma
604K et.plzma
479K en.plzma

There seems to be a correlation with memory usage here. The data file for Finnish is much bigger than the others, and Swedish is also big compared to English and Estonian. But I think something must be wrong if a data file that is ~10MB larger than the others leads to 1GB of extra memory usage.

The text was updated successfully, but these errors were encountered:

osma · 2022-09-29T14:07:45Z

Looking closer, I see that this is probably just due to the way Simplemma stores token/lemma pairs in a plain dict that is pickled and compressed. There are a lot of entries for Finnish (5 million I think) so the dict becomes huge when loaded into memory. The lzma compression is quite efficient so the data files end up having quite similar sizes despite the huge differences in the number of entries.

I wonder if some other in-memory representation would be more efficient?

Another idea would be to use something like lmdb to store the entries in a memory-mapped file instead of as a plain dict.

osma · 2022-10-04T07:23:46Z

Some more ideas for reducing the memory usage:

Generating and/or hand-crafting a set of rules for languages like Finnish and Polish for which the data dictionary is huge due to the morphology of the language and then dropping the items that can be handled by the rules from the dictionary, which should reduce its size to a fraction. There are already rules for English and German and I think it shouldn't be that hard to come up with similar rules for other languages.
Using a more space-efficient data structure such as a trie to store the dictionary. This would be a bit slower though, as the Python built-in dict is so heavily optimized.

adbar · 2022-10-04T11:56:44Z

Hi @osma, thanks for the thorough analysis. This issue is indeed related to the way Simplemma stores information. It is very fast and also very convenient, however the dictionaries can use a significant amount of memory. My impression is that it is still fine on modern systems.

We could work on rules for languages such as Finnish, feel free to work on a draft which I could integrate into the software.
A trie structure would result in a loss in terms of performance, and a key-value store would require external libraries which could become a maintenance issue in the long term.

osma · 2022-10-04T12:33:02Z

Thanks @adbar, I agree with your assessment. Though the README states this:

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

I don't think 1GB of memory usage is compatible with "low-resource contexts", so it would be good to be able to reduce the memory requirements of Simplemma. Probably the idea of creating more rules makes the most sense here, as it shouldn't cause any significant loss of performance, especially for unrelated languages.

Are the word/lemma lists that you've used to build the data dictionaries available somewhere or are they only available as serialized dicts?

osma · 2022-10-04T14:27:31Z

I've looked a bit closer at the dictionary of Finnish words, and also experimented with rule generation with some partial success.

While doing that, I noticed that quite many (around 14-15%) of the keys in the dictionary actually consist of two words. Here are some random examples:

'ovat nieleksineet' -> 'nieleksiä'
'olimme tuivertaneet' -> 'tuivertaa'
'eivät rohtuisi' -> 'rohtua'
'ei passattane' -> 'passata'
'et kultaisi' -> 'kullata'
'ei mässättäne' -> 'mässätä'
'ette putkahdelleet' -> 'putkahdella'

Is this intentional? Many of these are either negations (ei, et, eivät, ette) or forms of the verb olla ('to be', 'to have'), for example the first one ovat nieleksineet means '[they] have swallowed'. These are all valid expressions, but are these actually being used by Simplemma? I thought Simplemma looks at only one word at a time, so these entries would never match.

adbar · 2022-10-04T16:59:15Z

Thanks @osma! These keys are indeed erroneous, I will remove them to make some space.

Another idea would be to remove too long words from the dictionary (i.e. words larger than 18 characters), the longest words are arguably not so frequent but the strings take the most space in memory.

So far I don't plan to release the word lists since they can be derived from the dictionary anyway. Feel free to look further into it.

osma · 2022-10-05T09:18:42Z

Some preliminary results from rule generation.

Starting point: the current Finnish dictionary contains 5082701 entries. Removing multiword keys reduces this to 4348028 (around 15% less).

I generated simple suffix rules, similar in spirit to those currently implemented for English, except that the suffix rules are stored in a dictionary instead of Python if/else statements. Currently I have generated a total of around 600 simple rules with either 3-, 4-, 5- or 6-character suffixes and the corresponding replacement. For example 'kot' -> 'kko' and 'toina' -> 'to'. Then I've applied these rules to the dictionary and dropped all entries where the rule produces the correct result; if no rules match, or the result of the rule is incorrect, the entry is left in the dictionary. This reduces the number of entries to around 1.8 million, a reduction of 65% related to the current situation (5.1M) and 59% of the 4.3M entries with single-word keys.

I haven't yet tried to add these rules to the Simplemma code base, just experimented in a Google Colab notebook that you may want to check out. I still need to fiddle around to find the best constraints for rule generation, but I don't expect to find a significantly better set than this.

Do you think this going in the right direction?

osma · 2022-10-05T09:25:17Z

Note that I've intentionally kept the rules extremely simple and pretty much language agnostic - the same rule generation approach could be used for other languages where most of the lemmas can be found by changing the suffix of the word. Of course I know that Finnish has complicated consonant gradation mechanisms but I thought that I'd try to generate simple rules that cover the majority of "easy" cases and leave the difficult cases to be handled by the dictionary.

adbar · 2022-10-05T10:31:24Z

I addressed the first issue and saved some space, especially for Finnish.

Concerning the rules, your data-driven approach is quite interesting. However, I would stick to cases for which one can be absolutely sure that the transformation is correct, I used such lists to get inspiration for German and English, maybe you will find them useful as well:

osma · 2022-10-05T10:48:10Z

I'm afraid that going by such lists would be a non-starter from my perspective. There are simply too many possible suffixes and their combinations in Finnish to do this manually. I'm sure someone else has done it (in fact I know - the advanced lemmatizers for Finnish such as Voikko and LMC do something like this) but that's beyond my skills and resources. Also, I think such an approach would be out of scope for Simplemma which, after all, aims to be very simple.

However, one of the criteria I've used in the rule generation is the accuracy of a potential rule, as measured on the original dictionary. There's a minimum threshold for how accurate the rule should be when applied on the matching dictionary entries. So far I've experimented with different thresholds from 0.2 up to 0.9. I could also require perfect 100% accuracy from the generated rules, although it would probably reduce their number quite drastically and thus also their overall effectiveness in reducing the size of the dictionary. What do you think? This wouldn't be absolutely sure, but the dictionary is quite large already, so it's unlikely that there would be many rules with serious side effects.

osma · 2022-10-05T11:40:11Z

Just an illustrative example: the word unissammekaan translates roughly as 'even in our dreams'. It consists of the noun uni (dream) and the suffixes -ssa (in), -mme (our) and -kaan (even). So most of the word is actually just suffixes tacked at the end :) It's not super rare either, an exact Google search gives more than 300 results - probably because it's part of an idiom (even in our dreams we didn't imagine...).

So can you just use simple rules to strip each suffix in turn? Well maybe in this case, you would end up with uni which happens to be the correct lemma here. But then consider haudassammekaan (even in our graves). If you do the same with this word you end up with hauda. But this is wrong, the correct lemma is hauta (grave)! This happens because of consonant gradation - both d and t are used in this word.

There are rule engines that can deal with this, for example HFST, and corresponding sets of rules for e.g. Finnish. But reimplementing all that in Simplemma would be very difficult and IMHO pointless.

adbar · 2022-10-05T11:56:54Z

Thanks for the example, this looks like a typical FST problem indeed...

I'll try to experiment with limiting the length of the entries in the dictionary, if it doesn't impact the results, then we could save some more space. I'm still not sure about the threshold you mention, even if 0.9 seems OK. If you have time we could run experiments.

Yet another idea would be to remove infrequent words from the list.

As for general use of Finnish data for now, unpacking the archive is slower, dictionary search is slightly slower, and it uses 1 GB of memory indeed. But the code remains quite portable, so all things considered I believe it is fine.

osma · 2022-10-05T16:26:08Z

I've now generated 1631 rules which all have a minimum accuracy of 0.9. Together they allow dropping 39% of the remaining 4.3M dictionary entries, down to 2.7M entries. I also tried requiring perfect accuracy, but the result was pretty bad, only 106 rules and a reduction of a bit over 2%.

I think it's either doing something like this (I can still try tweaking the rule generation a little...), or just forget about the idea. If it helps, I can try turning this into a pull request that adds the generated rules to rules.py and drops the now-unnecessary dictionary entries. Then you could benchmark the result. What do you think?

I don't have a strong opinion about the infrequent words. I don't have access to the original word lists, only the dictionaries which don't provide any frequency information. I suspect there are going to be diminishing returns adding new entries to the dictionary, but also quite small reductions in accuracy if you remove the most infrequent word forms. Having the rules in place would perhaps also allow removing more of the infrequent words, as the rules may compensate for some of the lost word forms.

adbar · 2022-10-06T10:48:15Z

1631 seems like a lot, maybe we could start with the 15-20 most frequent rules?

I have no clue about frequency either but there are ways to find that information. As a side note, it usually correlates strongly with word length.

I ran a few experiments and my best guess is to reduce the maximum length of the entries in the dictionary, that way we set an upper bound on vocabulary size and tackle frequency indirectly. From a computer science perspective, it also makes sense not to carry long strings along.

I can put an upper limit at 16 characters and save about 10-20% memory without impacting performance too much. For Finnish the "greedy" mode partly compensates the loss in vocabulary size, which means everything works as expected. If we only take even shorter words into account the lemmatizer will not work as well, I think it would be a pity to lose information.

osma · 2022-10-06T14:49:11Z

1631 seems like a lot, maybe we could start with the 15-20 most frequent rules?

Whatever you say...these 18 suffix rules can together handle around 6.5% of the dictionary:

{'isille': 'inen',
 'isiksi': 'inen',
 'isemme': 'inen',
 'iseksi': 'inen',
 'iselle': 'inen',
 'isenne': 'inen',
 'isten': 'inen',
 'iseen': 'inen',
 'isien': 'inen',
 'iseni': 'inen',
 'isesi': 'inen',
 'inne': 'i',
 'insa': 'i',
 'isen': 'inen',
 'iset': 'inen',
 'ini': 'i',
 'ain': 'a',
 'eja': 'i'}

I can turn those into a PR which adds them into rules.py and drops the matching entries from the dictionary. OK?

As for reducing the vocabulary size - fine with me, if you think you can find a better trade-off than the current status quo.

adbar · 2022-10-06T15:27:39Z

I wouldn't drop the entries from the dictionary, I'd simply add the rules to increase potential coverage. In that sense, you can make a PR to modify rules.py if you have time, I'll test it then.

osma · 2022-10-07T05:55:26Z

Sure, I will prepare a PR with the new rules. But my original motivation for doing this was also to reduce the size of the dictionary, in order to preserve memory. I hope we will get to that eventually.

As I said above, these 18 rules will only cover around 6% of the dictionary entries, and these are some of the best I could find . You need a lot more rules to increase the coverage significantly. There is very little cost in adding more rules, as they will all be stored in a dictionary instead of complicated if/else structures. But I'll start with these 18 as a proof of concept. If it turns out that there is no benefit or that there are unwanted consequences, we can abandon the whole idea of these kind of rules before spending too much time on it.

adbar · 2022-10-07T16:40:14Z

I would prefer to add rules in order to extend coverage but your logic makes perfect sense. As you say we could try these 18 rules first in order to see if they bring anything. Feel free to write a PR.

…#19)

adbar · 2022-12-21T12:17:12Z

I'm closing the issue since some progress has been made. We can look into it again in the future if necessary.

osma mentioned this issue Sep 29, 2022

Replace pycld3 dependency? NatLibFi/Annif#593

Closed

adbar added the enhancement New feature or request label Oct 4, 2022

adbar added a commit that referenced this issue Oct 5, 2022

data: remove entries containing spaces (#19)

6102563

osma mentioned this issue Oct 7, 2022

Add some simple suffix rules for Finnish #23

Closed

adbar added a commit that referenced this issue Oct 10, 2022

vocabulary: set upper limit on word length for fi, la, pl, pt, sk, tr (…

459adfb

…#19)

adbar closed this as completed Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge memory usage for some languages, e.g. Finnish #19

Huge memory usage for some languages, e.g. Finnish #19

osma commented Sep 29, 2022

osma commented Sep 29, 2022

osma commented Oct 4, 2022

adbar commented Oct 4, 2022 •

edited

Loading

osma commented Oct 4, 2022

osma commented Oct 4, 2022 •

edited

Loading

adbar commented Oct 4, 2022

osma commented Oct 5, 2022

osma commented Oct 5, 2022

adbar commented Oct 5, 2022

osma commented Oct 5, 2022 •

edited

Loading

osma commented Oct 5, 2022

adbar commented Oct 5, 2022

osma commented Oct 5, 2022

adbar commented Oct 6, 2022 •

edited

Loading

osma commented Oct 6, 2022

adbar commented Oct 6, 2022

osma commented Oct 7, 2022

adbar commented Oct 7, 2022 via email

adbar commented Dec 21, 2022

Huge memory usage for some languages, e.g. Finnish #19

Huge memory usage for some languages, e.g. Finnish #19

Comments

osma commented Sep 29, 2022

osma commented Sep 29, 2022

osma commented Oct 4, 2022

adbar commented Oct 4, 2022 • edited Loading

osma commented Oct 4, 2022

osma commented Oct 4, 2022 • edited Loading

adbar commented Oct 4, 2022

osma commented Oct 5, 2022

osma commented Oct 5, 2022

adbar commented Oct 5, 2022

osma commented Oct 5, 2022 • edited Loading

osma commented Oct 5, 2022

adbar commented Oct 5, 2022

osma commented Oct 5, 2022

adbar commented Oct 6, 2022 • edited Loading

osma commented Oct 6, 2022

adbar commented Oct 6, 2022

osma commented Oct 7, 2022

adbar commented Oct 7, 2022 via email

adbar commented Dec 21, 2022

adbar commented Oct 4, 2022 •

edited

Loading

osma commented Oct 4, 2022 •

edited

Loading

osma commented Oct 5, 2022 •

edited

Loading

adbar commented Oct 6, 2022 •

edited

Loading