-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge memory usage for some languages, e.g. Finnish #19
Comments
Looking closer, I see that this is probably just due to the way Simplemma stores token/lemma pairs in a plain dict that is pickled and compressed. There are a lot of entries for Finnish (5 million I think) so the dict becomes huge when loaded into memory. The lzma compression is quite efficient so the data files end up having quite similar sizes despite the huge differences in the number of entries. I wonder if some other in-memory representation would be more efficient? Another idea would be to use something like lmdb to store the entries in a memory-mapped file instead of as a plain dict. |
Some more ideas for reducing the memory usage:
|
Hi @osma, thanks for the thorough analysis. This issue is indeed related to the way Simplemma stores information. It is very fast and also very convenient, however the dictionaries can use a significant amount of memory. My impression is that it is still fine on modern systems.
|
Thanks @adbar, I agree with your assessment. Though the README states this:
I don't think 1GB of memory usage is compatible with "low-resource contexts", so it would be good to be able to reduce the memory requirements of Simplemma. Probably the idea of creating more rules makes the most sense here, as it shouldn't cause any significant loss of performance, especially for unrelated languages. Are the word/lemma lists that you've used to build the data dictionaries available somewhere or are they only available as serialized dicts? |
I've looked a bit closer at the dictionary of Finnish words, and also experimented with rule generation with some partial success. While doing that, I noticed that quite many (around 14-15%) of the keys in the dictionary actually consist of two words. Here are some random examples:
Is this intentional? Many of these are either negations ( |
Thanks @osma! These keys are indeed erroneous, I will remove them to make some space. Another idea would be to remove too long words from the dictionary (i.e. words larger than 18 characters), the longest words are arguably not so frequent but the strings take the most space in memory. So far I don't plan to release the word lists since they can be derived from the dictionary anyway. Feel free to look further into it. |
Some preliminary results from rule generation. Starting point: the current Finnish dictionary contains 5082701 entries. Removing multiword keys reduces this to 4348028 (around 15% less). I generated simple suffix rules, similar in spirit to those currently implemented for English, except that the suffix rules are stored in a dictionary instead of Python if/else statements. Currently I have generated a total of around 600 simple rules with either 3-, 4-, 5- or 6-character suffixes and the corresponding replacement. For example 'kot' -> 'kko' and 'toina' -> 'to'. Then I've applied these rules to the dictionary and dropped all entries where the rule produces the correct result; if no rules match, or the result of the rule is incorrect, the entry is left in the dictionary. This reduces the number of entries to around 1.8 million, a reduction of 65% related to the current situation (5.1M) and 59% of the 4.3M entries with single-word keys. I haven't yet tried to add these rules to the Simplemma code base, just experimented in a Google Colab notebook that you may want to check out. I still need to fiddle around to find the best constraints for rule generation, but I don't expect to find a significantly better set than this. Do you think this going in the right direction? |
Note that I've intentionally kept the rules extremely simple and pretty much language agnostic - the same rule generation approach could be used for other languages where most of the lemmas can be found by changing the suffix of the word. Of course I know that Finnish has complicated consonant gradation mechanisms but I thought that I'd try to generate simple rules that cover the majority of "easy" cases and leave the difficult cases to be handled by the dictionary. |
I addressed the first issue and saved some space, especially for Finnish. Concerning the rules, your data-driven approach is quite interesting. However, I would stick to cases for which one can be absolutely sure that the transformation is correct, I used such lists to get inspiration for German and English, maybe you will find them useful as well: |
I'm afraid that going by such lists would be a non-starter from my perspective. There are simply too many possible suffixes and their combinations in Finnish to do this manually. I'm sure someone else has done it (in fact I know - the advanced lemmatizers for Finnish such as Voikko and LMC do something like this) but that's beyond my skills and resources. Also, I think such an approach would be out of scope for Simplemma which, after all, aims to be very simple. However, one of the criteria I've used in the rule generation is the accuracy of a potential rule, as measured on the original dictionary. There's a minimum threshold for how accurate the rule should be when applied on the matching dictionary entries. So far I've experimented with different thresholds from 0.2 up to 0.9. I could also require perfect 100% accuracy from the generated rules, although it would probably reduce their number quite drastically and thus also their overall effectiveness in reducing the size of the dictionary. What do you think? This wouldn't be absolutely sure, but the dictionary is quite large already, so it's unlikely that there would be many rules with serious side effects. |
Just an illustrative example: the word So can you just use simple rules to strip each suffix in turn? Well maybe in this case, you would end up with There are rule engines that can deal with this, for example HFST, and corresponding sets of rules for e.g. Finnish. But reimplementing all that in Simplemma would be very difficult and IMHO pointless. |
Thanks for the example, this looks like a typical FST problem indeed... I'll try to experiment with limiting the length of the entries in the dictionary, if it doesn't impact the results, then we could save some more space. I'm still not sure about the threshold you mention, even if 0.9 seems OK. If you have time we could run experiments. Yet another idea would be to remove infrequent words from the list. As for general use of Finnish data for now, unpacking the archive is slower, dictionary search is slightly slower, and it uses 1 GB of memory indeed. But the code remains quite portable, so all things considered I believe it is fine. |
I've now generated 1631 rules which all have a minimum accuracy of 0.9. Together they allow dropping 39% of the remaining 4.3M dictionary entries, down to 2.7M entries. I also tried requiring perfect accuracy, but the result was pretty bad, only 106 rules and a reduction of a bit over 2%. I think it's either doing something like this (I can still try tweaking the rule generation a little...), or just forget about the idea. If it helps, I can try turning this into a pull request that adds the generated rules to I don't have a strong opinion about the infrequent words. I don't have access to the original word lists, only the dictionaries which don't provide any frequency information. I suspect there are going to be diminishing returns adding new entries to the dictionary, but also quite small reductions in accuracy if you remove the most infrequent word forms. Having the rules in place would perhaps also allow removing more of the infrequent words, as the rules may compensate for some of the lost word forms. |
1631 seems like a lot, maybe we could start with the 15-20 most frequent rules? I have no clue about frequency either but there are ways to find that information. As a side note, it usually correlates strongly with word length. I ran a few experiments and my best guess is to reduce the maximum length of the entries in the dictionary, that way we set an upper bound on vocabulary size and tackle frequency indirectly. From a computer science perspective, it also makes sense not to carry long strings along. I can put an upper limit at 16 characters and save about 10-20% memory without impacting performance too much. For Finnish the "greedy" mode partly compensates the loss in vocabulary size, which means everything works as expected. If we only take even shorter words into account the lemmatizer will not work as well, I think it would be a pity to lose information. |
Whatever you say...these 18 suffix rules can together handle around 6.5% of the dictionary:
I can turn those into a PR which adds them into As for reducing the vocabulary size - fine with me, if you think you can find a better trade-off than the current status quo. |
I wouldn't drop the entries from the dictionary, I'd simply add the rules to increase potential coverage. In that sense, you can make a PR to modify |
Sure, I will prepare a PR with the new rules. But my original motivation for doing this was also to reduce the size of the dictionary, in order to preserve memory. I hope we will get to that eventually. As I said above, these 18 rules will only cover around 6% of the dictionary entries, and these are some of the best I could find . You need a lot more rules to increase the coverage significantly. There is very little cost in adding more rules, as they will all be stored in a dictionary instead of complicated if/else structures. But I'll start with these 18 as a proof of concept. If it turns out that there is no benefit or that there are unwanted consequences, we can abandon the whole idea of these kind of rules before spending too much time on it. |
I would prefer to add rules in order to extend coverage but your logic
makes perfect sense.
As you say we could try these 18 rules first in order to see if they
bring anything. Feel free to write a PR.
|
I'm closing the issue since some progress has been made. We can look into it again in the future if necessary. |
While testing NatLibFi/Annif#626, I noticed that Simplemma needs a lot of memory for Finnish language lemmatization and language detection. I did a little comparison to see how the choice of language affects memory consumption.
First the baseline, English:
Results (partial) of running this with
/usr/bin/time -v
:So far so good. Then other languages I know.
Estonian:
Swedish:
...
Finnish:
I checked the sizes of the data files for these four languages:
There seems to be a correlation with memory usage here. The data file for Finnish is much bigger than the others, and Swedish is also big compared to English and Estonian. But I think something must be wrong if a data file that is ~10MB larger than the others leads to 1GB of extra memory usage.
The text was updated successfully, but these errors were encountered: