-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find some way to tolerate hexadecimal better #326
Comments
Is there anything that can help identify these as hex? For example, we exclude
We could exclude number prefixed but that will only help in 10/16 cases.
This is because we are focused on spell checking code and want to get all words within identifiers, regardless of whether they split on case (e.g. |
Thinking aloud: How about "strings of at least 32 characters containing only hexadecimal characters"? Alternatively, excluding Alternatively, "strings within In the most general case, it might be okay to have a way to annotate something as "not text". These would all work for my use case; do you think any of them are clean enough to use in practice? |
While its unlikely for someone to spell out an identifier 32 characters long with only hex digits, I don't think its impossible. I guess the question is how likely it would happen since we do err on the side of not spell checking vs reporting false positives.
See #316 |
I think this is rare in practice. To try to measure this empirically, I used this ripgrep command pipe:
I then ran this command against my ~/src directory (where I put software that I work on), and found 105204 distinct hex strings. That's too many to scan by hand, so I took a random sample of 1000. I then read all of those 1000 strings, and did not find any that were real text. I reproduced this same result in my ~/build directory (where I put software that I build but do not work on) and got the same results. (Sampled 1000 strings out of 189683 and found none with real text.) So I think that this rule is probably safe? You might want to double-check it on our own collection of software, though, to be sure. |
For more frequent updates, more and quicker checks. Not that for a project this size speed of spellcheck would matter at all. Stick with 1.0.11 due to crate-ci/typos#326 for now.
Friendly ping. If you agree with my analysis above, I'm happy to try to write a patch here. Alternatively, if the analysis isn't convincing, I'd be happy to re-run a similar experiment with a corpus, or sample size of your choosing. |
I think restricting ourselves to 32 same-case hex characters would work. |
By experimentation (see ticket), it seems that same-case hexadecimal strings of 32 characters or longer are almost never intended to hold text. By treating such strings as ignored, we can resist a larger category of false positives. Closes crate-ci#326.
By experimentation (see ticket), it seems that same-case hexadecimal strings of 32 characters or longer are almost never intended to hold text. By treating such strings as ignored, we can resist a larger category of false positives. Closes crate-ci#326.
By experimentation (see ticket), it seems that same-case hexadecimal strings of 32 characters or longer are almost never intended to hold text. By treating such strings as ignored, we can resist a larger category of false positives. Closes crate-ci#326.
By experimentation (see ticket), it seems that same-case hexadecimal strings of 32 characters or longer are almost never intended to hold text. By treating such strings as ignored, we can resist a larger category of false positives. Closes crate-ci#326.
I have a program that uses a fair amount of hex strings in order to represent binary sequences in its tests. But this doesn't sit so well with
typos
, which gives lots of false positives like:It seems that this is a byproduct of treading any sequence of non-numeric wordlike characters as a possible word, even if they occur adjacent to numbers, not specified by spaces.
The text was updated successfully, but these errors were encountered: