Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge memory usage for some languages, e.g. Finnish #19

Closed
osma opened this issue Sep 29, 2022 · 19 comments
Closed

Huge memory usage for some languages, e.g. Finnish #19

osma opened this issue Sep 29, 2022 · 19 comments
Labels
enhancement New feature or request

Comments

@osma
Copy link
Contributor

osma commented Sep 29, 2022

While testing NatLibFi/Annif#626, I noticed that Simplemma needs a lot of memory for Finnish language lemmatization and language detection. I did a little comparison to see how the choice of language affects memory consumption.

First the baseline, English:

#!/usr/bin/env python

from simplemma import text_lemmatizer

print(text_lemmatizer("she sells seashells on a seashore", lang='en'))

Results (partial) of running this with /usr/bin/time -v:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14
Maximum resident set size (kbytes): 41652

So far so good. Then other languages I know.

Estonian:

print(text_lemmatizer("jääääre kuuuurija töööö", lang='et'))
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
Maximum resident set size (kbytes): 41252
  • practically same as English

Swedish:

print(text_lemmatizer("sju sköna sjuksköterskor skötte sju sjösjuka sjömän", lang='sv'))
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.70
Maximum resident set size (kbytes): 190920
  • somewhat slower, needs 150MB more memory

...

Finnish:

print(text_lemmatizer("mustan kissan paksut posket", lang='fi'))
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.44
Maximum resident set size (kbytes): 1059328
  • oops, now takes at least an order of magnitude longer and 1GB of memory! Why?

I checked the sizes of the data files for these four languages:

 14M fi.plzma
2.6M sv.plzma
604K et.plzma
479K en.plzma

There seems to be a correlation with memory usage here. The data file for Finnish is much bigger than the others, and Swedish is also big compared to English and Estonian. But I think something must be wrong if a data file that is ~10MB larger than the others leads to 1GB of extra memory usage.

@osma
Copy link
Contributor Author

osma commented Sep 29, 2022

Looking closer, I see that this is probably just due to the way Simplemma stores token/lemma pairs in a plain dict that is pickled and compressed. There are a lot of entries for Finnish (5 million I think) so the dict becomes huge when loaded into memory. The lzma compression is quite efficient so the data files end up having quite similar sizes despite the huge differences in the number of entries.

I wonder if some other in-memory representation would be more efficient?

Another idea would be to use something like lmdb to store the entries in a memory-mapped file instead of as a plain dict.

@osma
Copy link
Contributor Author

osma commented Oct 4, 2022

Some more ideas for reducing the memory usage:

  1. Generating and/or hand-crafting a set of rules for languages like Finnish and Polish for which the data dictionary is huge due to the morphology of the language and then dropping the items that can be handled by the rules from the dictionary, which should reduce its size to a fraction. There are already rules for English and German and I think it shouldn't be that hard to come up with similar rules for other languages.
  2. Using a more space-efficient data structure such as a trie to store the dictionary. This would be a bit slower though, as the Python built-in dict is so heavily optimized.

@adbar adbar added the enhancement New feature or request label Oct 4, 2022
@adbar
Copy link
Owner

adbar commented Oct 4, 2022

Hi @osma, thanks for the thorough analysis. This issue is indeed related to the way Simplemma stores information. It is very fast and also very convenient, however the dictionaries can use a significant amount of memory. My impression is that it is still fine on modern systems.

  1. We could work on rules for languages such as Finnish, feel free to work on a draft which I could integrate into the software.
  2. A trie structure would result in a loss in terms of performance, and a key-value store would require external libraries which could become a maintenance issue in the long term.

@osma
Copy link
Contributor Author

osma commented Oct 4, 2022

Thanks @adbar, I agree with your assessment. Though the README states this:

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

I don't think 1GB of memory usage is compatible with "low-resource contexts", so it would be good to be able to reduce the memory requirements of Simplemma. Probably the idea of creating more rules makes the most sense here, as it shouldn't cause any significant loss of performance, especially for unrelated languages.

Are the word/lemma lists that you've used to build the data dictionaries available somewhere or are they only available as serialized dicts?

@osma
Copy link
Contributor Author

osma commented Oct 4, 2022

I've looked a bit closer at the dictionary of Finnish words, and also experimented with rule generation with some partial success.

While doing that, I noticed that quite many (around 14-15%) of the keys in the dictionary actually consist of two words. Here are some random examples:

'ovat nieleksineet' -> 'nieleksiä'
'olimme tuivertaneet' -> 'tuivertaa'
'eivät rohtuisi' -> 'rohtua'
'ei passattane' -> 'passata'
'et kultaisi' -> 'kullata'
'ei mässättäne' -> 'mässätä'
'ette putkahdelleet' -> 'putkahdella'

Is this intentional? Many of these are either negations (ei, et, eivät, ette) or forms of the verb olla ('to be', 'to have'), for example the first one ovat nieleksineet means '[they] have swallowed'. These are all valid expressions, but are these actually being used by Simplemma? I thought Simplemma looks at only one word at a time, so these entries would never match.

@adbar
Copy link
Owner

adbar commented Oct 4, 2022

Thanks @osma! These keys are indeed erroneous, I will remove them to make some space.

Another idea would be to remove too long words from the dictionary (i.e. words larger than 18 characters), the longest words are arguably not so frequent but the strings take the most space in memory.

So far I don't plan to release the word lists since they can be derived from the dictionary anyway. Feel free to look further into it.

@osma
Copy link
Contributor Author

osma commented Oct 5, 2022

Some preliminary results from rule generation.

Starting point: the current Finnish dictionary contains 5082701 entries. Removing multiword keys reduces this to 4348028 (around 15% less).

I generated simple suffix rules, similar in spirit to those currently implemented for English, except that the suffix rules are stored in a dictionary instead of Python if/else statements. Currently I have generated a total of around 600 simple rules with either 3-, 4-, 5- or 6-character suffixes and the corresponding replacement. For example 'kot' -> 'kko' and 'toina' -> 'to'. Then I've applied these rules to the dictionary and dropped all entries where the rule produces the correct result; if no rules match, or the result of the rule is incorrect, the entry is left in the dictionary. This reduces the number of entries to around 1.8 million, a reduction of 65% related to the current situation (5.1M) and 59% of the 4.3M entries with single-word keys.

I haven't yet tried to add these rules to the Simplemma code base, just experimented in a Google Colab notebook that you may want to check out. I still need to fiddle around to find the best constraints for rule generation, but I don't expect to find a significantly better set than this.

Do you think this going in the right direction?

@osma
Copy link
Contributor Author

osma commented Oct 5, 2022

Note that I've intentionally kept the rules extremely simple and pretty much language agnostic - the same rule generation approach could be used for other languages where most of the lemmas can be found by changing the suffix of the word. Of course I know that Finnish has complicated consonant gradation mechanisms but I thought that I'd try to generate simple rules that cover the majority of "easy" cases and leave the difficult cases to be handled by the dictionary.

@adbar
Copy link
Owner

adbar commented Oct 5, 2022

I addressed the first issue and saved some space, especially for Finnish.

Concerning the rules, your data-driven approach is quite interesting. However, I would stick to cases for which one can be absolutely sure that the transformation is correct, I used such lists to get inspiration for German and English, maybe you will find them useful as well:

@osma
Copy link
Contributor Author

osma commented Oct 5, 2022

I'm afraid that going by such lists would be a non-starter from my perspective. There are simply too many possible suffixes and their combinations in Finnish to do this manually. I'm sure someone else has done it (in fact I know - the advanced lemmatizers for Finnish such as Voikko and LMC do something like this) but that's beyond my skills and resources. Also, I think such an approach would be out of scope for Simplemma which, after all, aims to be very simple.

However, one of the criteria I've used in the rule generation is the accuracy of a potential rule, as measured on the original dictionary. There's a minimum threshold for how accurate the rule should be when applied on the matching dictionary entries. So far I've experimented with different thresholds from 0.2 up to 0.9. I could also require perfect 100% accuracy from the generated rules, although it would probably reduce their number quite drastically and thus also their overall effectiveness in reducing the size of the dictionary. What do you think? This wouldn't be absolutely sure, but the dictionary is quite large already, so it's unlikely that there would be many rules with serious side effects.

@osma
Copy link
Contributor Author

osma commented Oct 5, 2022

Just an illustrative example: the word unissammekaan translates roughly as 'even in our dreams'. It consists of the noun uni (dream) and the suffixes -ssa (in), -mme (our) and -kaan (even). So most of the word is actually just suffixes tacked at the end :) It's not super rare either, an exact Google search gives more than 300 results - probably because it's part of an idiom (even in our dreams we didn't imagine...).

So can you just use simple rules to strip each suffix in turn? Well maybe in this case, you would end up with uni which happens to be the correct lemma here. But then consider haudassammekaan (even in our graves). If you do the same with this word you end up with hauda. But this is wrong, the correct lemma is hauta (grave)! This happens because of consonant gradation - both d and t are used in this word.

There are rule engines that can deal with this, for example HFST, and corresponding sets of rules for e.g. Finnish. But reimplementing all that in Simplemma would be very difficult and IMHO pointless.

@adbar
Copy link
Owner

adbar commented Oct 5, 2022

Thanks for the example, this looks like a typical FST problem indeed...

I'll try to experiment with limiting the length of the entries in the dictionary, if it doesn't impact the results, then we could save some more space. I'm still not sure about the threshold you mention, even if 0.9 seems OK. If you have time we could run experiments.

Yet another idea would be to remove infrequent words from the list.

As for general use of Finnish data for now, unpacking the archive is slower, dictionary search is slightly slower, and it uses 1 GB of memory indeed. But the code remains quite portable, so all things considered I believe it is fine.

@osma
Copy link
Contributor Author

osma commented Oct 5, 2022

I've now generated 1631 rules which all have a minimum accuracy of 0.9. Together they allow dropping 39% of the remaining 4.3M dictionary entries, down to 2.7M entries. I also tried requiring perfect accuracy, but the result was pretty bad, only 106 rules and a reduction of a bit over 2%.

I think it's either doing something like this (I can still try tweaking the rule generation a little...), or just forget about the idea. If it helps, I can try turning this into a pull request that adds the generated rules to rules.py and drops the now-unnecessary dictionary entries. Then you could benchmark the result. What do you think?

I don't have a strong opinion about the infrequent words. I don't have access to the original word lists, only the dictionaries which don't provide any frequency information. I suspect there are going to be diminishing returns adding new entries to the dictionary, but also quite small reductions in accuracy if you remove the most infrequent word forms. Having the rules in place would perhaps also allow removing more of the infrequent words, as the rules may compensate for some of the lost word forms.

@adbar
Copy link
Owner

adbar commented Oct 6, 2022

1631 seems like a lot, maybe we could start with the 15-20 most frequent rules?

I have no clue about frequency either but there are ways to find that information. As a side note, it usually correlates strongly with word length.

I ran a few experiments and my best guess is to reduce the maximum length of the entries in the dictionary, that way we set an upper bound on vocabulary size and tackle frequency indirectly. From a computer science perspective, it also makes sense not to carry long strings along.

I can put an upper limit at 16 characters and save about 10-20% memory without impacting performance too much. For Finnish the "greedy" mode partly compensates the loss in vocabulary size, which means everything works as expected. If we only take even shorter words into account the lemmatizer will not work as well, I think it would be a pity to lose information.

@osma
Copy link
Contributor Author

osma commented Oct 6, 2022

1631 seems like a lot, maybe we could start with the 15-20 most frequent rules?

Whatever you say...these 18 suffix rules can together handle around 6.5% of the dictionary:

{'isille': 'inen',
 'isiksi': 'inen',
 'isemme': 'inen',
 'iseksi': 'inen',
 'iselle': 'inen',
 'isenne': 'inen',
 'isten': 'inen',
 'iseen': 'inen',
 'isien': 'inen',
 'iseni': 'inen',
 'isesi': 'inen',
 'inne': 'i',
 'insa': 'i',
 'isen': 'inen',
 'iset': 'inen',
 'ini': 'i',
 'ain': 'a',
 'eja': 'i'}

I can turn those into a PR which adds them into rules.py and drops the matching entries from the dictionary. OK?

As for reducing the vocabulary size - fine with me, if you think you can find a better trade-off than the current status quo.

@adbar
Copy link
Owner

adbar commented Oct 6, 2022

I wouldn't drop the entries from the dictionary, I'd simply add the rules to increase potential coverage. In that sense, you can make a PR to modify rules.py if you have time, I'll test it then.

@osma
Copy link
Contributor Author

osma commented Oct 7, 2022

Sure, I will prepare a PR with the new rules. But my original motivation for doing this was also to reduce the size of the dictionary, in order to preserve memory. I hope we will get to that eventually.

As I said above, these 18 rules will only cover around 6% of the dictionary entries, and these are some of the best I could find . You need a lot more rules to increase the coverage significantly. There is very little cost in adding more rules, as they will all be stored in a dictionary instead of complicated if/else structures. But I'll start with these 18 as a proof of concept. If it turns out that there is no benefit or that there are unwanted consequences, we can abandon the whole idea of these kind of rules before spending too much time on it.

@adbar
Copy link
Owner

adbar commented Oct 7, 2022 via email

@adbar
Copy link
Owner

adbar commented Dec 21, 2022

I'm closing the issue since some progress has been made. We can look into it again in the future if necessary.

@adbar adbar closed this as completed Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants