Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French word list quality is subpar #6652

Closed
eaon opened this issue Oct 18, 2022 · 9 comments
Closed

French word list quality is subpar #6652

eaon opened this issue Oct 18, 2022 · 9 comments
Labels
hackathon help wanted Issues we would definitely appreciate volunteer help with i18n Anything related to translation or internationalization of SecureDrop UX

Comments

@eaon
Copy link
Contributor

eaon commented Oct 18, 2022

Description

We do not allow our users to set their own password, no matter whether they're journalists or sources. Instead we generate dice ware passphrases for them. This means we rely on word lists to set passphrases, and we want those to be good. Unfortunately, when compared to its English counterpart, the French word list isn't very good.

All of our users that selected to use French automatically get a passphrase derived from this list.

Distribution of word lengths

en.txt

{
  3: 82,
  4: 466,
  5: 902,
  6: 1347,
  7: 1563,
  8: 1739,
  9: 1502,
  10: 2
}

fr.txt

{
  1: 26,
  2: 676,
  3: 318,
  4: 1055,
  5: 2689,
  6: 2620
}

3-letter words make up 1% of en.txt, while 3-or-less-letter words make up 15% of fr.txt. We also have 26 letters of the alphabet represented and various permutations of characters (for example vwx) that are neither easy to remember nor a word. But the worst bit is that on the whole, you're more likely to get a passphrase with worse quality from the French word list than the English one.

Comments

I don't think the quality is so terrible that it warrants immediate removal, but it'd be great if we found a free (AGPL compatible) dictionary that we could use to generate a new one.

@eaon eaon added UX i18n Anything related to translation or internationalization of SecureDrop labels Oct 18, 2022
@eaon eaon added this to the Long Term Product Backlog milestone Oct 18, 2022
@legoktm
Copy link
Member

legoktm commented Oct 18, 2022

It would be interesting to see if we could use the list of pages on the French Wiktionary, maybe with some more filtering to achieve our desired word length ratios: https://dumps.wikimedia.org/frwiktionary/20221001/frwiktionary-20221001-all-titles-in-ns0.gz.

This would make it straightforward to add in more languages (that have Wiktionaries at least :))

@legoktm
Copy link
Member

legoktm commented Oct 19, 2022

I was surprised to learn today that we actually do have a bunch of checks at runtime(!) to supposedly verify the quality of wordlists: https://github.com/freedomofpress/securedrop/blob/6e4a4363b0da489e77d326d72208ddf56065e8f7/securedrop/passphrases.py. One important note from this is that 1 character "words" are dropped.

I think re-evaluating those checks should be part of this. For example, there's a check to verify generated passphrases are long enough, but it does so by taking the shortest word in the list, multiplying the length by the number of words (7), adding in the spaces, and then verifying it's less than the minimum passphrase length (20). In practice, this is a static calculation since (2 * 7) + 7 = 21. Since we drop all "words" that are 1 character, every possible word list will pass this check (note that there's a separate check that each list has at least 7300 words).

There's a check in the opposite direction making sure the word list doesn't have too-long words; again, it's a static calculation, the cap of a 128 character passphrase means your longest word can't be more than 17 characters.

On top of that I see no reason they should happen at runtime, a unit test seems more appropriate.

I think looking at the distribution of word lengths is more interesting, as it indicates how short/long the average passphrase will be.

@nabla-c0d3
Copy link
Contributor

One reason to check at runtime, as it is done now, is to prevent the app from even starting at all if the words list are super insecure/bad (for example, the lists files were somehow modified, etc.).

@legoktm
Copy link
Member

legoktm commented Oct 24, 2022

One reason to check at runtime, as it is done now, is to prevent the app from even starting at all if the words list are super insecure/bad (for example, the lists files were somehow modified, etc.).

Indeed. I think that protects us against two situations:

  1. A instance administrator wants to use a different word list for whatever silly reason, and swaps it in.
  2. An attacker wants to weaken our word lists, somehow has access to write to the filesystem and swaps in their own word list.

No 2. isn't stoppable since the attacker (who somehow already has write access) could easily disable whatever runtime checks exist as they insert their own word list.

So then we're just protecting from situation No 1., which I'm not sure is worth it (e.g. we don't do integrity verification of other files AFAIK).

That said, I doubt any new checks we come up with are going to be noticeably slow, so there's certainly no harm in continuing to check some things at runtime (they should just actually check useful things...).

@nabla-c0d3
Copy link
Contributor

Yeah I was mainly thinking about 1. as 2. is a "game-over" scenario, as you mentioned.

It looks like being able to swap the words lists used to be a feature (as it's a configuration key in config.py.example), but perhaps not anymore? In which case, the runtime checks might not be needed; at the same time, I don't think they add a ton of overhead either.

@zenmonkeykstop zenmonkeykstop added help wanted Issues we would definitely appreciate volunteer help with good first issue and removed good first issue labels Nov 2, 2022
@rmol
Copy link
Contributor

rmol commented Nov 3, 2022

The entropy of Diceware passphrases is interesting, and came up in a very similar discussion a while back. The Diceware FAQ says not only do short words not decrease the quality of the passphrases, but they were included by design for usability. The French word list is weaker than the English, but because it has fewer words, not because there are more short ones.

I'm not sure what the current state of support for using other wordlists is, but I don't think it's unreasonable for admins to want sources to get passphrases in their preferred languages. Keeping the runtime checks might help prevent an attempt to support that from weakening those passphrases.

@epociask
Copy link

@legoktm I did a light feasibility assessment on the French Wiktionary word list to understand how easy it'd be to incorporate as a viable improvement solution. After running some translation logic in a light python script to sanitize non-conformational data entries it became apparent that there was still over 2 million words in the list. Upon manual investigation of the word entries it became apparent that the French Wiktionary list contained entries in an assortment of other languages as well. Unfortunately, this word list is likely an insufficient solution given its respective incomprehensibility and lack of complete French. The only way it could properly be sanitized would be through some use of NLP which brings unnecessary complexity.

Also the relevant validation logic for a word list appears to be done in the SeedPhrase constructor whereas it should likely be done within some heuristic testing that runs pre-deployment to avoid potential runtime failures.

NOTE

  • There appears to be an inverse relationship between comprehensibility and entropy given relative size of a given language word list.

@gonzalo-bulnes
Copy link
Contributor

I had a chat with @epociask yesterday, and just caught up with the conversation in this issue.

First, I'd like to highlight what @rmol said: the security properties of a passphrase that's created from words randomly picked from a word list doesn't depend on the words themselves, it depends on the size of the list (how many combinations of n words can be created). I'd recommend an EFF's Deeplink article to anyone interested in that: EFF's New Wordlists for Random Passphrases.
Playing with the composition of the word list can be used to provide other desirable properties. (E.g. but not limited to: memorability, brevity...)

Because they're meant to be memorized them over time, I do believe that there is value in people having access to code names (passphrases) in their own language. I have thought extensively about how such word lists can be created, and I am currently convinced that it is something that should be done in coordination with language specialists. (By that I mean people very familiar with the target language and how it is commonly used.)

For illustration, I've written down my thoughts on creating a Spanish word list in the past. (Ongoing project.) Some of which may be useful inspiration as we think about non-English word lists for SecureDrop use. Keep in mind, though, that while a word list designed for use with dice to generate passphrases can be used for SecureDrop, a word list for SecureDrop doesn't necessarily have to respond to all the constraints that may be desirable in a word list made to be used with dice.

@epociask
Copy link

@gonzalo-bulnes Somewhat irrelevant but do you think it'd be valuable to add word list specific tests within the repository instead of performing data validation checks inside of the Passphrase constructor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hackathon help wanted Issues we would definitely appreciate volunteer help with i18n Anything related to translation or internationalization of SecureDrop UX
Projects
None yet
Development

No branches or pull requests

7 participants