Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Polish language #234

Merged
merged 6 commits into from
May 16, 2023
Merged

Add Polish language #234

merged 6 commits into from
May 16, 2023

Conversation

rjolina
Copy link
Contributor

@rjolina rjolina commented Apr 1, 2023

i added polish language according to your instructions.
i am aware that the dictionary file is 10 times larger than other dictionaries, but the Polish language has a lot of declensions and I do not know if it is possible to "trim" it somehow. This is also my first pull request on github, please be understanding :)

@sspanak
Copy link
Owner

sspanak commented Apr 3, 2023

Thank you for your contribution. It is a great dictionary, but indeed, 60 Mb is an overkill.

I was thinking of breaking the dictionaries into a list of word roots, prefixes and suffixes, hoping to make it possible to use much more complete ones. Or even using a Trie for that matter. I guess Polish would be a great starting point for experiments. 🙂

So let's hold on with this PR until we find a better solution.

@sspanak sspanak added the languages Dictionary or language related issues label Apr 3, 2023
@samurex
Copy link

samurex commented Apr 4, 2023

Hi @rjolina, Im not a subject matter expert, but I believe that polish grammar is quite similar to ukrainian/bulgarian/russian, so if those languages work with tt9 polish should work as well. For each of this langagues you can find dictionaries of various size (also huge one). How about something smaller? Maybe we can just use for example openboard its 190k popular words (with declensions) and frequencies (I guess this is the place you took frequences from). In comparision currently in tt9 bulgarian has 230k, ukrainian 290k, and russian just 89k. I think this should be enough.

@sspanak
Copy link
Owner

sspanak commented Apr 4, 2023

I believe that polish grammar is quite similar to ukrainian/bulgarian/russian

Not at all... 🙂 Bulgarian has no declensions and I still feel I am missing words every now and then. I suppose the optimum would be at around 350k words.

Ukrainian grammar resembles Polish and the dictionnary may be ok-ish, but Russian probably feels quite incomplete. I think less than 200k is fine for a language like English, where words do not change much. I do use it and I am speaking from personal experience.

I don't mind adding one of the Google dictionaries, but you will probably find yourselves adding new words all the time. Of course, I may be wrong. I'm fine with testing this option, if you want to. I would appreciate your feedback, too. Or, if you could find another word list of, say, 350k-500k words, we can go for that.

But we can not use this dictionary now, it is too big. Some lower spec phones may crash trying to load it.

Either way, as I said in my previous comment, I want to find a way of improving the dictionary storage mechanism and expanding the dictionaries for better experience.

@samurex
Copy link

samurex commented Apr 4, 2023

if you could find another word list of, say, 350k-500k words, we can go for that.

This one looks good https://mirrors.tuna.tsinghua.edu.cn/ctan/systems/windows/winedt/dict/pl.zip, 450k words. What do you think ? Just has fancy windows cp1250 encoding that needs to be converted to utf-8

@sspanak
Copy link
Owner

sspanak commented Apr 4, 2023

if you could find another word list of, say, 350k-500k words, we can go for that.

This one looks good https://mirrors.tuna.tsinghua.edu.cn/ctan/systems/windows/winedt/dict/pl.zip, 450k words. What do you think ? Just has fancy windows cp1250 encoding that needs to be converted to utf-8

It should be OK. And the encoding is fine, I may even keep it to save some disk space.

@rjolina
Copy link
Contributor Author

rjolina commented Apr 4, 2023

if you could find another word list of, say, 350k-500k words, we can go for that.

This one looks good https://mirrors.tuna.tsinghua.edu.cn/ctan/systems/windows/winedt/dict/pl.zip, 450k words. What do you think ? Just has fancy windows cp1250 encoding that needs to be converted to utf-8

It should be OK. And the encoding is fine, I may even keep it to save some disk space.

This dictionary looks OK also for me. Should I make a new pull request or somehow modify this one?

@samurex
Copy link

samurex commented Apr 4, 2023

I'm not a contributor of this project, just want to have good T9 keyboard on my phone :) How about adding word frequencies from openboard ?

@sspanak
Copy link
Owner

sspanak commented Apr 5, 2023

I'm not a contributor of this project, just want to have good T9 keyboard on my phone :) How about adding word frequencies from openboard ?

This is a must in order to have good experience. If you have Nodejs installed, you could use scripts/inject-dictionary-frequencies.js for that. If not, I will do so before merging.

@sspanak sspanak self-requested a review April 5, 2023 07:25
@sspanak
Copy link
Owner

sspanak commented Apr 10, 2023

I've noticed some suposingly nonsense words, such as: "fr", "gd", "nm", "Md", "np", "xi", "xv", "iii", "xxx" and so on.

I would appreciate someone who speaks the language to review clean up the two- and three-letter words. Otherwise, I'll just remove anything that looks weird.

@sspanak sspanak self-requested a review May 16, 2023 07:53
@sspanak
Copy link
Owner

sspanak commented May 16, 2023

So, I've cleaned up the dictionary and tried it out by typing a couple of sample sentences from Wikipedia. It is mostly fine, even though I found one word was missing. I suppose the experience is good enough, so I will merge the PR.

I suppose the optimal word count for Polish would be around 800k-1M words. @rjolina, @mikep-dev, @samurex, if you could find such a dictionary, feel free to open a new pull request.

@sspanak sspanak merged commit adeae33 into sspanak:master May 16, 2023
@rjolina
Copy link
Contributor Author

rjolina commented May 17, 2023

Thank you guys for your cooperation! Can't wait for the release with my native language 🥰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
languages Dictionary or language related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants