You are going to need a static webserver to see anything useful, so try something like mongoose or whatever floats your boat. It's a GitHub page anyway, so you can have a look that way.
Have a look at A Mathematical Theory of Communication By C. E. Shannon published in 1948. It's a good read and should tell you all you need to know.
Ok, so we use a Markov Chain to guess the next letter based on the previous n
letters. Our guesses are based on what we've seen before in some text that we've previously analysed. How big n
is determines the accuracy. From the paper:
- Zero-order approximation (symbols independent and equiprobable). XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.
- First-order approximation (symbols independent but with frequencies of English text). OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.
- Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.
- Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.
Not as easy as that, sadly. Recording all the probabilities given the previous n
letters doesn't scale well. Say we have 26
letters, and we need to record every possible path for a 10
letter string of characters. That would be 26^10 = 141167095653376
steps in the paths, and a byte for each letter would total up to just over 1 petabyte
.
5
, but we only record the top 10 most probable letters for each path. This halves the number of steps and takes the craziness out of the writer (no more weird words ending in 'q'), but does lead to a bit more repetition. This gives us files of json data each with a size of around 2 megabytes
.
Of course! And if you find something cool or make the app a little nicer, send a pull request. I'll probably accept it right away. :)