Diacritics reconstruction (restoration) for Slovak text based on finding best match in n-grams (n-gram = group of n words usually occurring together in language). This program was created for Bachelor's thesis at Faculty of Management Science and Informatics, University of Žilina.
The program uses data from Slovak National Corpus from Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. We used data set/language corpus prim-8.0-public-all made out of 1.5 billion of tokens (namely subcorpuses of 4-grams, 3-grams, 2-grams and words). Yout can find them all here. Algorithm reconstructs every single word separately. It uses data structure trie for fastest access to the list of appropriate n-grams for each non-diacritics word. List of appropriate n-grams for non-diacritics word consists only of n-grams containing that word. In addition, the list is grouped by n (from 4-grams to 1-gram) and sorted by absolute occurrence in language. Then all n-grams are compared with the word and it's surrouding words one by one until there is match. After then the word is being replaced with found diacritics form.
- Bachelor's thesis in Slovak language: Automatická rekonštrukcia diakritiky pre slovenčinu or in this repo here
- Conference Paper in English: Automatic restoration of diacritics based on word n-grams for Slovak texts or on IEEE Xplore
- Article in English: Diacritics restoration based on word n-grams for Slovak texts or on De Gruyter
- C#
- ASP.NET Core
- PBCD.DataStructures.Trie
There are two final versions of the program: The first - faster one (0.4ms per word), using RAM only, with the success rate 98.07%. The second - slower one (4ms per word), using hard disk, with success rate 98.17%. Here you will find:
- DLL ready to use
- Simple web-site for easy, user-friendly interacting with the program
https://www.dropbox.com/s/7uraxif4ocfay8k/diacritics-reconstructor-necessary-files.zip?dl=0