-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polish language #234
Add Polish language #234
Conversation
Thank you for your contribution. It is a great dictionary, but indeed, 60 Mb is an overkill. I was thinking of breaking the dictionaries into a list of word roots, prefixes and suffixes, hoping to make it possible to use much more complete ones. Or even using a Trie for that matter. I guess Polish would be a great starting point for experiments. 🙂 So let's hold on with this PR until we find a better solution. |
Hi @rjolina, Im not a subject matter expert, but I believe that polish grammar is quite similar to ukrainian/bulgarian/russian, so if those languages work with tt9 polish should work as well. For each of this langagues you can find dictionaries of various size (also huge one). How about something smaller? Maybe we can just use for example openboard its 190k popular words (with declensions) and frequencies (I guess this is the place you took frequences from). In comparision currently in tt9 bulgarian has 230k, ukrainian 290k, and russian just 89k. I think this should be enough. |
Not at all... 🙂 Bulgarian has no declensions and I still feel I am missing words every now and then. I suppose the optimum would be at around 350k words. Ukrainian grammar resembles Polish and the dictionnary may be ok-ish, but Russian probably feels quite incomplete. I think less than 200k is fine for a language like English, where words do not change much. I do use it and I am speaking from personal experience. I don't mind adding one of the Google dictionaries, but you will probably find yourselves adding new words all the time. Of course, I may be wrong. I'm fine with testing this option, if you want to. I would appreciate your feedback, too. Or, if you could find another word list of, say, 350k-500k words, we can go for that. But we can not use this dictionary now, it is too big. Some lower spec phones may crash trying to load it. Either way, as I said in my previous comment, I want to find a way of improving the dictionary storage mechanism and expanding the dictionaries for better experience. |
This one looks good https://mirrors.tuna.tsinghua.edu.cn/ctan/systems/windows/winedt/dict/pl.zip, 450k words. What do you think ? Just has fancy windows cp1250 encoding that needs to be converted to utf-8 |
It should be OK. And the encoding is fine, I may even keep it to save some disk space. |
This dictionary looks OK also for me. Should I make a new pull request or somehow modify this one? |
I'm not a contributor of this project, just want to have good T9 keyboard on my phone :) How about adding word frequencies from openboard ? |
This is a must in order to have good experience. If you have Nodejs installed, you could use |
I've noticed some suposingly nonsense words, such as: "fr", "gd", "nm", "Md", "np", "xi", "xv", "iii", "xxx" and so on. I would appreciate someone who speaks the language to review clean up the two- and three-letter words. Otherwise, I'll just remove anything that looks weird. |
So, I've cleaned up the dictionary and tried it out by typing a couple of sample sentences from Wikipedia. It is mostly fine, even though I found one word was missing. I suppose the experience is good enough, so I will merge the PR. I suppose the optimal word count for Polish would be around 800k-1M words. @rjolina, @mikep-dev, @samurex, if you could find such a dictionary, feel free to open a new pull request. |
Thank you guys for your cooperation! Can't wait for the release with my native language 🥰 |
i added polish language according to your instructions.
i am aware that the dictionary file is 10 times larger than other dictionaries, but the Polish language has a lot of declensions and I do not know if it is possible to "trim" it somehow. This is also my first pull request on github, please be understanding :)