Find some way to tolerate hexadecimal better #326

nmathewson · 2021-08-03T19:22:32Z

I have a program that uses a fair amount of hex strings in order to represent binary sequences in its tests. But this doesn't sit so well with typos, which gives lots of false positives like:

error: `ba` should be `by`, `be`
  --> ./tor-llcrypto/src/pk.rs:77:21
   |
77 |              0816ed13ba3303ac5deb911548908025"

It seems that this is a byproduct of treading any sequence of non-numeric wordlike characters as a possible word, even if they occur adjacent to numbers, not specified by spaces.

The text was updated successfully, but these errors were encountered:

epage · 2021-08-03T19:36:01Z

0816ed13ba3303ac5deb911548908025"

Is there anything that can help identify these as hex?

For example, we exclude

0x<hexdigit>
UUIDs
Hashes between 40 (sha1) and 64 (sha256) characters
Base64 that is at least 90 characters long and contain the appropriate padding

We could exclude number prefixed but that will only help in 10/16 cases.

It seems that this is a byproduct of treading any sequence of non-numeric wordlike characters as a possible word, even if they occur adjacent to numbers, not specified by spaces.

This is because we are focused on spell checking code and want to get all words within identifiers, regardless of whether they split on case (e.g. PascalCase) or not (e.g. snake_case)

nmathewson · 2021-08-04T14:31:23Z

Hashes between 40 (sha1) and 64 (sha256) characters

Thinking aloud:

How about "strings of at least 32 characters containing only hexadecimal characters"?

Alternatively, excluding r"xxx" strings would probably work for me, and might make regex-users happier. (Though it might make somebody else unhappy.)

Alternatively, "strings within hex!()" would work here, though that is probably less general.

In the most general case, it might be okay to have a way to annotate something as "not text".

These would all work for my use case; do you think any of them are clean enough to use in practice?

epage · 2021-08-04T14:37:13Z

How about "strings of at least 32 characters containing only hexadecimal characters"?

While its unlikely for someone to spell out an identifier 32 characters long with only hex digits, I don't think its impossible. I guess the question is how likely it would happen since we do err on the side of not spell checking vs reporting false positives.

In the most general case, it might be okay to have a way to annotate something as "not text".

See #316

nmathewson · 2021-08-04T14:59:33Z

While its unlikely for someone to spell out an identifier 32 characters long with only hex digits, I don't think its impossible. I guess the question is how likely it would happen since we do err on the side of not spell checking vs reporting false positives.

I think this is rare in practice. To try to measure this empirically, I used this ripgrep command pipe:

rg -I -o '[a-fA-F0-9]{32,}' | sort |uniq > all_hex_strings

I then ran this command against my ~/src directory (where I put software that I work on), and found 105204 distinct hex strings. That's too many to scan by hand, so I took a random sample of 1000. I then read all of those 1000 strings, and did not find any that were real text.

I reproduced this same result in my ~/build directory (where I put software that I build but do not work on) and got the same results. (Sampled 1000 strings out of 189683 and found none with real text.)

So I think that this rule is probably safe? You might want to double-check it on our own collection of software, though, to be sure.

For more frequent updates, more and quicker checks. Not that for a project this size speed of spellcheck would matter at all. Stick with 1.0.11 due to crate-ci/typos#326 for now.

nmathewson · 2021-08-19T20:56:50Z

Friendly ping. If you agree with my analysis above, I'm happy to try to write a patch here. Alternatively, if the analysis isn't convincing, I'd be happy to re-run a similar experiment with a corpus, or sample size of your choosing.

epage · 2021-08-20T01:17:38Z

I think restricting ourselves to 32 same-case hex characters would work.

By experimentation (see ticket), it seems that same-case hexadecimal strings of 32 characters or longer are almost never intended to hold text. By treating such strings as ignored, we can resist a larger category of false positives. Closes crate-ci#326.

epage added the enhancement Improve the expected label Aug 20, 2021

nmathewson mentioned this issue Aug 20, 2021

Ignore hexadecimal "hashes" of length 32 or greater. #330

Merged

epage closed this as completed in #330 Aug 20, 2021

repi mentioned this issue May 1, 2022

/, _, ., etc before or after natively ignored item (guid, hash, etc) causes it to be spell checked #481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find some way to tolerate hexadecimal better #326

Find some way to tolerate hexadecimal better #326

nmathewson commented Aug 3, 2021

epage commented Aug 3, 2021

nmathewson commented Aug 4, 2021 •

edited

Loading

epage commented Aug 4, 2021

nmathewson commented Aug 4, 2021

nmathewson commented Aug 19, 2021

epage commented Aug 20, 2021

Find some way to tolerate hexadecimal better #326

Find some way to tolerate hexadecimal better #326

Comments

nmathewson commented Aug 3, 2021

epage commented Aug 3, 2021

nmathewson commented Aug 4, 2021 • edited Loading

epage commented Aug 4, 2021

nmathewson commented Aug 4, 2021

nmathewson commented Aug 19, 2021

epage commented Aug 20, 2021

nmathewson commented Aug 4, 2021 •

edited

Loading