ci: run spell check in CI, fix remaining issues #4799

jsoref · 2020-03-04T04:47:44Z

Summary of the Pull Request

This adds an action (currently pinned to my latest release 0.0.10-alpha) which performs spell checking.

Spell checking can occur:

on commits (pushing to a branch)
on PRs (creating a pull request)
on commits in repository forks if the forked repository enables Actions (this does not happen by default)
asynchronously for PRs created from foreign repositories (this addresses the former)

References

PR Checklist

Closes Consider adding spell checking to CI #4473
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: Consider adding spell checking to CI #4473

Detailed Description of the Pull Request / Additional comments

Comments are added to commits/pull requests

such as:
jsoref@31d73b0#commitcomment-37624165

Unrecognized words will be annotated

When a new commit is added while the action is active, as in:
check-spelling/examples-testing@54bcdf0

Whitelisting words

The details in the comment provide commands one can run to update the whitelist for the cases where a word should be accepted.

Excluding files

The excludes.txt file allows one to exclude files by directory, extension, or other patterns -- each line in the file is a perl regular expression.

Scheduled runs

Due to GitHub constraints, PRs from forks don't have permission to comment/annotate, so instead, I'm trying using a schedule to check pull requests. -- Currently configured to look for PRs that have changed in the last hour (running hourly).

Quirks

Roughly doubled action runs

The use of both push and pull_request generally means that for commits w/in a repository, one will get at least duplicate annotations. The comments should appear in two different places (one should end up on the PR and the other should end up on the commit).

I haven't figured out the best way to address this, as while one is working on a branch, immediate feedback is best. -- But if one has a PR, I think it's best to be able to see the output in the PR...

Miscellaneous

I've included a couple of additional spelling commits. If the changes are wrong, we can either whitelist the words or add the files to excludes.

Adjustment

I'm happy to take feedback and make adjustments :-)

Validation Steps Performed

This is an alpha release, and this repository would be an early adopter.
We're using a version of this and I'm slowly seeding it to projects.

You can test this code by applying it to your own fork in GitHub and then adding a new commit with a word that isn't in the dictionary ...

cc: @miniksa, @DHowett-MSFT

miniksa · 2020-03-04T18:08:00Z

Interesting. So we're going to have to whitelist all of our variable names and short-hands in here, it looks like. Have you noticed that being a big deal between commits? Or do folks tend to reuse the same shorthands in a given project? If on average people only have to add 0-5 "words" for their variables to the whitelist when they make a PR, then I think it's worth it.

I'm still all for trying this, but if it becomes too onerous, we might have to come up with some more rules or patterns on how we can identify things to exclude from the scanner INSIDE a given file type.

It's hard for me to mentally visualize what this is going to look like in practice from your description. I get the rough idea from some of your links what the annotations look like, but the whole end-to-end I'm struggling to understand without working through it myself. I think we're just going to have to try it and refine it from there.

The additional spelling fixes look correct to me as well.

jsoref · 2020-03-04T19:20:47Z

I think projects tend to reuse shorthands. And certainly w/in a commit or series.

I'm also open to ideas for how to write rules for extra exclusions. e.g. Ignore strings shorter than x characters, or longer than y characters.

fwiw, there are 71 two-character (676 possible, 612 not in a dictionary*) and 399 three-character (17576 possible, 16991 not in a dictionary) tokens in the whitelist:

$ perl -pne 's/././g' .github/actions/spell-check/whitelist/*.txt|sort |uniq -c|sort -n|grep -A100 39
  39 ................
  47 ...............
  48 ..............
  71 ..
  93 .............
 105 ............
 150 ...........
 182 ..........
 240 .........
 309 ........
 311 ....
 324 .....
 348 .......
 356 ......
 399 ...

-- collectively, much more of the whitelist is 4+5+6 character sequences

Most of the annoying sequences I run into are hashes / base64 encodings (e.g. PKI or data: or git shas). My general response to those is to just ignore the file entirely -- the most common types are either .key/.pem/.crt or things like package.json/package-lock.json.

One thing you should do is review the whitelist, in reviewing it just now, I realized that there were a couple of misspellings in it... which I'm addressing by fixing & removing.

One example is defing which could easily be written out as defining if one wanted to force the word (the context is #define / #undef).

There's also the odd case of TfEditses.cpp -- that resulted in Editses being in the whitelist, but the thing is EditSession and it probably should be TfEditSes.cpp. -- Addressing that would be a distinct PR as it's a significant code change (renaming a file) and I'd definitely want pre-approval before considering it.

Note that you'll probably see things like egistry in the whitelist, since \r and \t are often used as escape characters, my code makes some assumptions at the expense of forcing people to whitelist some things which aren't actually misspelled in context. -- Windows unfortunately (for my code) tends to use \ as a path delimiter which means that when my code looks at the string it's making the wrong guess and lopping off that r or t.

Anyway, please feel free to try it for a bit. I'm definitely eager to improve the experience.

miniksa · 2020-03-06T18:08:09Z

I think projects tend to reuse shorthands. And certainly w/in a commit or series.

Agreed, yeah probably.

I'm also open to ideas for how to write rules for extra exclusions. e.g. Ignore strings shorter than x characters, or longer than y characters.

I would immediately gravitate toward regex, but some people hate that.

One thing you should do is review the whitelist, in reviewing it just now, I realized that there were a couple of misspellings in it... which I'm addressing by fixing & removing.

OK. I'll go read it as fully as I can.

One example is defing which could easily be written out as defining if one wanted to force the word (the context is #define / #undef).

I don't think I'd be that critical, personally. But someone else might care more than me.

There's also the odd case of TfEditses.cpp -- that resulted in Editses being in the whitelist, but the thing is EditSession and it probably should be TfEditSes.cpp. -- Addressing that would be a distinct PR as it's a significant code change (renaming a file) and I'd definitely want pre-approval before considering it.

It could even change straight up into TfEditSession.cpp. That's fine with me.

Note that you'll probably see things like egistry in the whitelist, since \r and \t are often used as escape characters, my code makes some assumptions at the expense of forcing people to whitelist some things which aren't actually misspelled in context. -- Windows unfortunately (for my code) tends to use \ as a path delimiter which means that when my code looks at the string it's making the wrong guess and lopping off that r or t.

Eh, that's OK.

Anyway, please feel free to try it for a bit. I'm definitely eager to improve the experience.

Yes, I think we'll refine it further after it's in.