Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaces word list with a new, slightly longer list #17

Merged
merged 2 commits into from
May 10, 2023

Conversation

sts10
Copy link
Contributor

@sts10 sts10 commented May 5, 2023

Thought I'd take a shot at replacing the word list with a new set of words. The current list has 1,633 words. My new list has a few more: 1,700 words.

Word lists are obviously subjective, but I did notice that the existing word list has some strange words on it, like names ("margo", "joshua", "freddie", "othello").

Some words this PR adds:

usual call nature girls free unable
tests occur leads advance review paid
shortly real gained slowly heat divine
purpose rules towards female assets engines
truth better failed science nature army

Some it removes:

erosion aspirin marion kansas bison philips
elastic cubic beatles sharon graph nato
harvest mike pupil phoenix russian lima
helium topic cool charm ramirez alert
zebra caramel logo austin lotus mystic

Prefix words and unique decodability

I noticed that the current list is free of prefix words and thus uniquely decodable. Seeing as Buttercup puts a hyphen between each word, this is unnecessary. A word list with prefix words included would be able to include shorter and more common words and fewer "rare" words.

That said, my new list was made uniquely decodable via a process I created.

Comparing current list to new, proposed list

The current list's mean word length is 5.75 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.67 bits of entropy.

The new proposed list's mean word length is 5.50 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.73 bits of entropy.

Where the words on new list came from

The words contained in this word list were taken from two sources: Google Books Ngram data (2012 data) and Wikipedia, via a Wikipedia word frequency project, taken on April 13, 2023.

I'd be happy to suggest/provide a longer list if we want a bit more entropy per word.

@sts10
Copy link
Contributor Author

sts10 commented May 5, 2023

What lists other password managers use

So we might conclude that a list of 1,633 -- and even 1,700 -- words for passphrase generation is a bit short compared to the competition.

As I linked to above, I'm working on a few word lists of varying lengths that I'd pitch if we wanted a 7,776-word list or even a 17,567-word list.

@perry-mitchell
Copy link
Member

@sts10 I'm honestly blown away by the depth to which you've gone in just writing this issue up.. Massive cheers for that.

I'm going to merge this and release it shortly, as I can see the improvement, but to elaborate on your comparisons: It seems like we're trailing behind here in terms of word count. Would you recommend that we go to something like the EFF long list? Or longer? What would be your reasoning behind hitting 17K vs just 7K words? I understand the entropy will be that bit higher but would it be worth the huge increase in file size? I somehow doubt it..

Regardless I'd probably want a larger list to be loaded in asynchronously and via a separate file, meaning that I'd probably want to release them in separate bundles. If you had such a list handy, you might kindly consider adding it as words2.json so I might go about authoring a new release type to handle improve delivery of the larger list.

Anywho, thanks again for this!

@perry-mitchell perry-mitchell merged commit 3da5699 into buttercup:master May 10, 2023
@sts10 sts10 deleted the new_short_word_list branch May 10, 2023 19:01
@sts10
Copy link
Contributor Author

sts10 commented May 10, 2023

Would you recommend that we go to something like the EFF long list? Or longer? What would be your reasoning behind hitting 17K vs just 7K words? I understand the entropy will be that bit higher but would it be worth the huge increase in file size? I somehow doubt it..

I'm not familiar with Buttercup's internals, but I doubt that loading up an extra 10k words would cause a measurable issue? The way I think of it, the downside to using a larger list is that it usually introduces less common words into the passphrases (while, of course, the upside is that it creates stronger passphrases, at least theoretically).

For a related project, @atoponce calculated the practical entropy differences between passphrases from a 4,000-word list and an 8,000-word list. I've added two columns below: one for a 7,776-word list and another for a 17,576-word list.

Min entropy 4,000 words 7,776 words 8,000 words 17,576 words
55 bits 5 words 5 words 5 words 4 words
60 bits 6 words 5 words 5 words 5 words
65 bits 6 words 6 words 6 words 5 words
70 bits 6 words 6 words 6 words 5 words
75 bits 7 words 6 words 6 words 6 words
80 bits 7 words 7 words 7 words 6 words

As you can see, going beyond 4,000 words to 8k or 17k isn't a silver bullet, but a 6-word passphrase from a 17k list is a nice 84.6 bits, compared to just 71.8 bits if from a 4k list.

What would be your reasoning behind hitting 17K vs just 7K words?

As for why I settled on 17,576 for my Orchard Street Long List: it was partially in competition with 1Password's 18k list, and partially because 263 fit nicely into another issue I was worried about, which I call the brute force line.

All this said, due to the mechanics of diceware, 7,776 (65) has become somewhat of a "standard" list length I'd say, with the 7,776-word EFF long list being a popular choice (as mentioned above, both KeePassXC and BitWarden use it). I'd say you can't go wrong defaulting to EFF long list for Buttercup (though you might consider the slightly modified version that KeePassXC uses).

The EFF long list, as explained in this blog post, has a nice property: "We also ensured that no word is an exact prefix of any other word." This means the list in uniquely decodable, which means that words on the list can be safely combined without a punctuation delimiter, e.g. appendixextraditedreamlessconnectorhumiliatewilt.

Adhering to this standard, I created my own 7,776-word list called the Orchard Street Medium list, which is free to use under the CC 3.0 BY-SA license. Like the EFF long list, it is uniquely decodable, however rather than removing all prefix words, I employed a technique based on the Sardinas-Patterson algorithm, a technique that I argue generally preserves more words when cutting down a non-uniquely decodable list into a uniquely decodable one. As mentioned above, this is the process I used on the 1,700-word list in this PR that we just merged into Buttercup.

As a bit of a disclaimer: Having worked out the information theory and written the code myself, I will say that I hope for more eyeballs to check my work soon to ensure that the lists are definitely uniquely decodable.

I hope this helps!

@atoponce
Copy link

On a bit of a tangent, Arnold Reinhold released an 8k word list specifically for software. Because RNGs typically have state sizes that fall on boundaries of powers of 2 (32-bits, 64-bits, etc.), this means that you don't need to test for random output and discard it if it falls outside of a uniform range of a multiple of the number of words in your word list. 7,776 words is specific to five 6-sided dice, but 8,192 words should be chosen if you're deploying a Diceware software application, as it simplifies the code and reduces the risk of bias bugs. It's unfortunate the EFF didn't also supply 8k word lists. Shrug.

@sts10
Copy link
Contributor Author

sts10 commented May 10, 2023

Because RNGs typically have state sizes that fall on boundaries of powers of 2 (32-bits, 64-bits, etc.), this means that you don't need to test for random output and discard it if it falls outside of a uniform range of a multiple of the number of words in your word list.... 8,192 words should be chosen if you're deploying a Diceware software application,

Dang, I hadn't thought about this before! But I understand the logic.

I've whipped up a 8,192-word list for us to look at/consider. 13 bits per word is indeed a nice round number...

Attributes:

List length               : 8192 words
Mean word length          : 7.07 characters
Length of shortest word   : 3 characters (add)
Length of longest word    : 10 characters (worthwhile)
Free of prefix words?     : false
Free of suffix words?     : false
Uniquely decodable?       : true
Entropy per word          : 13.000 bits
Efficiency per character  : 1.838 bits
Assumed entropy per char  : 4.333 bits
Above brute force line?   : true
Shortest edit distance    : 1
Mean edit distance        : 6.969
Longest shared prefix     : 9
Unique character prefix   : 10

Word samples
------------
display informs embassy secretion sought recorded
membranes realistic fun softly introduced thesis
digging things modes hierarchy magnesium minute
hired later finally beats finds republican
disruption majesty eternity elephant lake retention

Happy to create a fresh PR with this list as words2.json if you like.

@sts10
Copy link
Contributor Author

sts10 commented May 12, 2023

Ah, I think the 1,633-word list Buttercup used before this PR is the Mnemonicode word list. I believe that list is optimized for distinct sounding words, which would explain the inclusion of words like "margo", "joshua", "freddie", "othello".

But I think I still stand by my criticism that while those words may be easy to say and hear, they're not the best for using to create a little story in your head, as advised by the classic xkcd cartoon (I think of this subjective metric as "storyability").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants