Replaces word list with a new, slightly longer list #17

sts10 · 2023-05-05T13:49:19Z

Thought I'd take a shot at replacing the word list with a new set of words. The current list has 1,633 words. My new list has a few more: 1,700 words.

Word lists are obviously subjective, but I did notice that the existing word list has some strange words on it, like names ("margo", "joshua", "freddie", "othello").

Some words this PR adds:

usual call nature girls free unable
tests occur leads advance review paid
shortly real gained slowly heat divine
purpose rules towards female assets engines
truth better failed science nature army

Some it removes:

erosion aspirin marion kansas bison philips
elastic cubic beatles sharon graph nato
harvest mike pupil phoenix russian lima
helium topic cool charm ramirez alert
zebra caramel logo austin lotus mystic

Prefix words and unique decodability

I noticed that the current list is free of prefix words and thus uniquely decodable. Seeing as Buttercup puts a hyphen between each word, this is unnecessary. A word list with prefix words included would be able to include shorter and more common words and fewer "rare" words.

That said, my new list was made uniquely decodable via a process I created.

Comparing current list to new, proposed list

The current list's mean word length is 5.75 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.67 bits of entropy.

The new proposed list's mean word length is 5.50 characters. Its shortest word is 3 characters and its longest is 7. Each word from the list gives 10.73 bits of entropy.

Where the words on new list came from

The words contained in this word list were taken from two sources: Google Books Ngram data (2012 data) and Wikipedia, via a Wikipedia word frequency project, taken on April 13, 2023.

I'd be happy to suggest/provide a longer list if we want a bit more entropy per word.

…hink are more common.

sts10 · 2023-05-05T18:19:11Z

What lists other password managers use

BitWarden uses the EFF long list (source), a classic choice. It's 7,776-words long and free of prefix words, and thus uniquely decodable. You can read more about the EFF word lists here.
KeePassXC uses the EFF long list with some minor modifications. It's still 7,776-words long.
1Password has posted a few variations of their word list over the years, but they're all about 18,200 words. Here's one version that's public, and here's another. I personally don't think it's a great list, for reasons I outline here in my pitch to replace it.
Dashlane apparently doesn't allow users to generate passphrases at all (reasons).
Enpass claims they use a 14,400-word list that doesn't appear to be public
NSA's RandPassGenerator uses a massive 117,828-word list.
I'll also mention the BIPS39 English word list, which is 2,048 words. I don't love it because it has the British spelling of "artefact" on it, and it is not uniquely decodable.

So we might conclude that a list of 1,633 -- and even 1,700 -- words for passphrase generation is a bit short compared to the competition.

As I linked to above, I'm working on a few word lists of varying lengths that I'd pitch if we wanted a 7,776-word list or even a 17,567-word list.

perry-mitchell · 2023-05-10T18:23:37Z

@sts10 I'm honestly blown away by the depth to which you've gone in just writing this issue up.. Massive cheers for that.

I'm going to merge this and release it shortly, as I can see the improvement, but to elaborate on your comparisons: It seems like we're trailing behind here in terms of word count. Would you recommend that we go to something like the EFF long list? Or longer? What would be your reasoning behind hitting 17K vs just 7K words? I understand the entropy will be that bit higher but would it be worth the huge increase in file size? I somehow doubt it..

Regardless I'd probably want a larger list to be loaded in asynchronously and via a separate file, meaning that I'd probably want to release them in separate bundles. If you had such a list handy, you might kindly consider adding it as words2.json so I might go about authoring a new release type to handle improve delivery of the larger list.

Anywho, thanks again for this!

sts10 · 2023-05-10T19:50:09Z

Would you recommend that we go to something like the EFF long list? Or longer? What would be your reasoning behind hitting 17K vs just 7K words? I understand the entropy will be that bit higher but would it be worth the huge increase in file size? I somehow doubt it..

I'm not familiar with Buttercup's internals, but I doubt that loading up an extra 10k words would cause a measurable issue? The way I think of it, the downside to using a larger list is that it usually introduces less common words into the passphrases (while, of course, the upside is that it creates stronger passphrases, at least theoretically).

For a related project, @atoponce calculated the practical entropy differences between passphrases from a 4,000-word list and an 8,000-word list. I've added two columns below: one for a 7,776-word list and another for a 17,576-word list.

Min entropy	4,000 words	7,776 words	8,000 words	17,576 words
55 bits	5 words	5 words	5 words	4 words
60 bits	6 words	5 words	5 words	5 words
65 bits	6 words	6 words	6 words	5 words
70 bits	6 words	6 words	6 words	5 words
75 bits	7 words	6 words	6 words	6 words
80 bits	7 words	7 words	7 words	6 words

As you can see, going beyond 4,000 words to 8k or 17k isn't a silver bullet, but a 6-word passphrase from a 17k list is a nice 84.6 bits, compared to just 71.8 bits if from a 4k list.

What would be your reasoning behind hitting 17K vs just 7K words?

As for why I settled on 17,576 for my Orchard Street Long List: it was partially in competition with 1Password's 18k list, and partially because 26³ fit nicely into another issue I was worried about, which I call the brute force line.

All this said, due to the mechanics of diceware, 7,776 (6⁵) has become somewhat of a "standard" list length I'd say, with the 7,776-word EFF long list being a popular choice (as mentioned above, both KeePassXC and BitWarden use it). I'd say you can't go wrong defaulting to EFF long list for Buttercup (though you might consider the slightly modified version that KeePassXC uses).

The EFF long list, as explained in this blog post, has a nice property: "We also ensured that no word is an exact prefix of any other word." This means the list in uniquely decodable, which means that words on the list can be safely combined without a punctuation delimiter, e.g. appendixextraditedreamlessconnectorhumiliatewilt.

Adhering to this standard, I created my own 7,776-word list called the Orchard Street Medium list, which is free to use under the CC 3.0 BY-SA license. Like the EFF long list, it is uniquely decodable, however rather than removing all prefix words, I employed a technique based on the Sardinas-Patterson algorithm, a technique that I argue generally preserves more words when cutting down a non-uniquely decodable list into a uniquely decodable one. As mentioned above, this is the process I used on the 1,700-word list in this PR that we just merged into Buttercup.

As a bit of a disclaimer: Having worked out the information theory and written the code myself, I will say that I hope for more eyeballs to check my work soon to ensure that the lists are definitely uniquely decodable.

I hope this helps!

atoponce · 2023-05-10T20:42:00Z

On a bit of a tangent, Arnold Reinhold released an 8k word list specifically for software. Because RNGs typically have state sizes that fall on boundaries of powers of 2 (32-bits, 64-bits, etc.), this means that you don't need to test for random output and discard it if it falls outside of a uniform range of a multiple of the number of words in your word list. 7,776 words is specific to five 6-sided dice, but 8,192 words should be chosen if you're deploying a Diceware software application, as it simplifies the code and reduces the risk of bias bugs. It's unfortunate the EFF didn't also supply 8k word lists. Shrug.

sts10 · 2023-05-10T21:16:23Z

Because RNGs typically have state sizes that fall on boundaries of powers of 2 (32-bits, 64-bits, etc.), this means that you don't need to test for random output and discard it if it falls outside of a uniform range of a multiple of the number of words in your word list.... 8,192 words should be chosen if you're deploying a Diceware software application,

Dang, I hadn't thought about this before! But I understand the logic.

I've whipped up a 8,192-word list for us to look at/consider. 13 bits per word is indeed a nice round number...

Attributes:

List length               : 8192 words
Mean word length          : 7.07 characters
Length of shortest word   : 3 characters (add)
Length of longest word    : 10 characters (worthwhile)
Free of prefix words?     : false
Free of suffix words?     : false
Uniquely decodable?       : true
Entropy per word          : 13.000 bits
Efficiency per character  : 1.838 bits
Assumed entropy per char  : 4.333 bits
Above brute force line?   : true
Shortest edit distance    : 1
Mean edit distance        : 6.969
Longest shared prefix     : 9
Unique character prefix   : 10

Word samples
------------
display informs embassy secretion sought recorded
membranes realistic fun softly introduced thesis
digging things modes hierarchy magnesium minute
hired later finally beats finds republican
disruption majesty eternity elephant lake retention

Happy to create a fresh PR with this list as words2.json if you like.

sts10 · 2023-05-12T17:08:10Z

Ah, I think the 1,633-word list Buttercup used before this PR is the Mnemonicode word list. I believe that list is optimized for distinct sounding words, which would explain the inclusion of words like "margo", "joshua", "freddie", "othello".

But I think I still stand by my criticism that while those words may be easy to say and hear, they're not the best for using to create a little story in your head, as advised by the classic xkcd cartoon (I think of this subjective metric as "storyability").

replaces word list with a new, slightly longer list of words that I t…

437b38b

…hink are more common.

uses 4 spaces for json indent, in attempt to match existing code style

6316df1

sts10 mentioned this pull request May 6, 2023

replaces wordlist-5-dice with a new word list dmuth/diceware#39

Open

perry-mitchell merged commit 3da5699 into buttercup:master May 10, 2023

sts10 deleted the new_short_word_list branch May 10, 2023 19:01

sts10 mentioned this pull request May 12, 2023

Adds a longer, 8,192-word word list as words2.json #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaces word list with a new, slightly longer list #17

Replaces word list with a new, slightly longer list #17

sts10 commented May 5, 2023 •

edited

Loading

sts10 commented May 5, 2023 •

edited

Loading

perry-mitchell commented May 10, 2023

sts10 commented May 10, 2023

atoponce commented May 10, 2023

sts10 commented May 10, 2023 •

edited

Loading

sts10 commented May 12, 2023 •

edited

Loading

Replaces word list with a new, slightly longer list #17

Replaces word list with a new, slightly longer list #17

Conversation

sts10 commented May 5, 2023 • edited Loading

Prefix words and unique decodability

Comparing current list to new, proposed list

Where the words on new list came from

sts10 commented May 5, 2023 • edited Loading

What lists other password managers use

perry-mitchell commented May 10, 2023

sts10 commented May 10, 2023

atoponce commented May 10, 2023

sts10 commented May 10, 2023 • edited Loading

sts10 commented May 12, 2023 • edited Loading

sts10 commented May 5, 2023 •

edited

Loading

sts10 commented May 5, 2023 •

edited

Loading

sts10 commented May 10, 2023 •

edited

Loading

sts10 commented May 12, 2023 •

edited

Loading