-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaces word list with a new, slightly longer list #17
Conversation
…hink are more common.
What lists other password managers use
So we might conclude that a list of 1,633 -- and even 1,700 -- words for passphrase generation is a bit short compared to the competition. As I linked to above, I'm working on a few word lists of varying lengths that I'd pitch if we wanted a 7,776-word list or even a 17,567-word list. |
@sts10 I'm honestly blown away by the depth to which you've gone in just writing this issue up.. Massive cheers for that. I'm going to merge this and release it shortly, as I can see the improvement, but to elaborate on your comparisons: It seems like we're trailing behind here in terms of word count. Would you recommend that we go to something like the EFF long list? Or longer? What would be your reasoning behind hitting 17K vs just 7K words? I understand the entropy will be that bit higher but would it be worth the huge increase in file size? I somehow doubt it.. Regardless I'd probably want a larger list to be loaded in asynchronously and via a separate file, meaning that I'd probably want to release them in separate bundles. If you had such a list handy, you might kindly consider adding it as words2.json so I might go about authoring a new release type to handle improve delivery of the larger list. Anywho, thanks again for this! |
I'm not familiar with Buttercup's internals, but I doubt that loading up an extra 10k words would cause a measurable issue? The way I think of it, the downside to using a larger list is that it usually introduces less common words into the passphrases (while, of course, the upside is that it creates stronger passphrases, at least theoretically). For a related project, @atoponce calculated the practical entropy differences between passphrases from a 4,000-word list and an 8,000-word list. I've added two columns below: one for a 7,776-word list and another for a 17,576-word list.
As you can see, going beyond 4,000 words to 8k or 17k isn't a silver bullet, but a 6-word passphrase from a 17k list is a nice 84.6 bits, compared to just 71.8 bits if from a 4k list.
As for why I settled on 17,576 for my Orchard Street Long List: it was partially in competition with 1Password's 18k list, and partially because 263 fit nicely into another issue I was worried about, which I call the brute force line. All this said, due to the mechanics of diceware, 7,776 (65) has become somewhat of a "standard" list length I'd say, with the 7,776-word EFF long list being a popular choice (as mentioned above, both KeePassXC and BitWarden use it). I'd say you can't go wrong defaulting to EFF long list for Buttercup (though you might consider the slightly modified version that KeePassXC uses). The EFF long list, as explained in this blog post, has a nice property: "We also ensured that no word is an exact prefix of any other word." This means the list in uniquely decodable, which means that words on the list can be safely combined without a punctuation delimiter, e.g. Adhering to this standard, I created my own 7,776-word list called the Orchard Street Medium list, which is free to use under the CC 3.0 BY-SA license. Like the EFF long list, it is uniquely decodable, however rather than removing all prefix words, I employed a technique based on the Sardinas-Patterson algorithm, a technique that I argue generally preserves more words when cutting down a non-uniquely decodable list into a uniquely decodable one. As mentioned above, this is the process I used on the 1,700-word list in this PR that we just merged into Buttercup. As a bit of a disclaimer: Having worked out the information theory and written the code myself, I will say that I hope for more eyeballs to check my work soon to ensure that the lists are definitely uniquely decodable. I hope this helps! |
On a bit of a tangent, Arnold Reinhold released an 8k word list specifically for software. Because RNGs typically have state sizes that fall on boundaries of powers of 2 (32-bits, 64-bits, etc.), this means that you don't need to test for random output and discard it if it falls outside of a uniform range of a multiple of the number of words in your word list. 7,776 words is specific to five 6-sided dice, but 8,192 words should be chosen if you're deploying a Diceware software application, as it simplifies the code and reduces the risk of bias bugs. It's unfortunate the EFF didn't also supply 8k word lists. Shrug. |
Dang, I hadn't thought about this before! But I understand the logic. I've whipped up a 8,192-word list for us to look at/consider. 13 bits per word is indeed a nice round number... Attributes:
Happy to create a fresh PR with this list as |
Ah, I think the 1,633-word list Buttercup used before this PR is the Mnemonicode word list. I believe that list is optimized for distinct sounding words, which would explain the inclusion of words like "margo", "joshua", "freddie", "othello". But I think I still stand by my criticism that while those words may be easy to say and hear, they're not the best for using to create a little story in your head, as advised by the classic xkcd cartoon (I think of this subjective metric as "storyability"). |
Thought I'd take a shot at replacing the word list with a new set of words. The current list has 1,633 words. My new list has a few more: 1,700 words.
Word lists are obviously subjective, but I did notice that the existing word list has some strange words on it, like names ("margo", "joshua", "freddie", "othello").
Some words this PR adds:
Some it removes:
Prefix words and unique decodability
I noticed that the current list is free of prefix words and thus uniquely decodable. Seeing as Buttercup puts a hyphen between each word, this is unnecessary. A word list with prefix words included would be able to include shorter and more common words and fewer "rare" words.
That said, my new list was made uniquely decodable via a process I created.
Comparing current list to new, proposed list
The current list's mean word length is 5.75 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.67 bits of entropy.
The new proposed list's mean word length is 5.50 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.73 bits of entropy.
Where the words on new list came from
The words contained in this word list were taken from two sources: Google Books Ngram data (2012 data) and Wikipedia, via a Wikipedia word frequency project, taken on April 13, 2023.
I'd be happy to suggest/provide a longer list if we want a bit more entropy per word.