Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: run spell check in CI, fix remaining issues #4799

Merged
merged 48 commits into from
Mar 25, 2020

Conversation

jsoref
Copy link
Contributor

@jsoref jsoref commented Mar 4, 2020

Summary of the Pull Request

This adds an action (currently pinned to my latest release 0.0.10-alpha) which performs spell checking.

Spell checking can occur:

  • on commits (pushing to a branch)
  • on PRs (creating a pull request)
  • on commits in repository forks if the forked repository enables Actions (this does not happen by default)
  • asynchronously for PRs created from foreign repositories (this addresses the former)

References

PR Checklist

Detailed Description of the Pull Request / Additional comments

Comments are added to commits/pull requests

such as:
jsoref@31d73b0#commitcomment-37624165

Unrecognized words will be annotated

When a new commit is added while the action is active, as in:
check-spelling/examples-testing@54bcdf0

Whitelisting words

The details in the comment provide commands one can run to update the whitelist for the cases where a word should be accepted.

Excluding files

The excludes.txt file allows one to exclude files by directory, extension, or other patterns -- each line in the file is a perl regular expression.

Scheduled runs

Due to GitHub constraints, PRs from forks don't have permission to comment/annotate, so instead, I'm trying using a schedule to check pull requests. -- Currently configured to look for PRs that have changed in the last hour (running hourly).

Quirks

Roughly doubled action runs

The use of both push and pull_request generally means that for commits w/in a repository, one will get at least duplicate annotations. The comments should appear in two different places (one should end up on the PR and the other should end up on the commit).

I haven't figured out the best way to address this, as while one is working on a branch, immediate feedback is best. -- But if one has a PR, I think it's best to be able to see the output in the PR...

Miscellaneous

I've included a couple of additional spelling commits. If the changes are wrong, we can either whitelist the words or add the files to excludes.

Adjustment

I'm happy to take feedback and make adjustments :-)

Validation Steps Performed

This is an alpha release, and this repository would be an early adopter.
We're using a version of this and I'm slowly seeding it to projects.

You can test this code by applying it to your own fork in GitHub and then adding a new commit with a word that isn't in the dictionary ...


cc: @miniksa, @DHowett-MSFT

@miniksa
Copy link
Member

miniksa commented Mar 4, 2020

Interesting. So we're going to have to whitelist all of our variable names and short-hands in here, it looks like. Have you noticed that being a big deal between commits? Or do folks tend to reuse the same shorthands in a given project? If on average people only have to add 0-5 "words" for their variables to the whitelist when they make a PR, then I think it's worth it.

I'm still all for trying this, but if it becomes too onerous, we might have to come up with some more rules or patterns on how we can identify things to exclude from the scanner INSIDE a given file type.

It's hard for me to mentally visualize what this is going to look like in practice from your description. I get the rough idea from some of your links what the annotations look like, but the whole end-to-end I'm struggling to understand without working through it myself. I think we're just going to have to try it and refine it from there.

The additional spelling fixes look correct to me as well.

@miniksa miniksa added Area-CodeHealth Issues related to code cleanliness, linting, rules, warnings, errors, static analysis, etc. Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Second It's a PR that needs another sign-off Product-Meta The product is the management of the products. labels Mar 4, 2020
@jsoref
Copy link
Contributor Author

jsoref commented Mar 4, 2020

I think projects tend to reuse shorthands. And certainly w/in a commit or series.

I'm also open to ideas for how to write rules for extra exclusions. e.g. Ignore strings shorter than x characters, or longer than y characters.

fwiw, there are 71 two-character (676 possible, 612 not in a dictionary*) and 399 three-character (17576 possible, 16991 not in a dictionary) tokens in the whitelist:

$ perl -pne 's/././g' .github/actions/spell-check/whitelist/*.txt|sort |uniq -c|sort -n|grep -A100 39
  39 ................
  47 ...............
  48 ..............
  71 ..
  93 .............
 105 ............
 150 ...........
 182 ..........
 240 .........
 309 ........
 311 ....
 324 .....
 348 .......
 356 ......
 399 ...

-- collectively, much more of the whitelist is 4+5+6 character sequences

Most of the annoying sequences I run into are hashes / base64 encodings (e.g. PKI or data: or git shas). My general response to those is to just ignore the file entirely -- the most common types are either .key/.pem/.crt or things like package.json/package-lock.json.

One thing you should do is review the whitelist, in reviewing it just now, I realized that there were a couple of misspellings in it... which I'm addressing by fixing & removing.

One example is defing which could easily be written out as defining if one wanted to force the word (the context is #define / #undef).

There's also the odd case of TfEditses.cpp -- that resulted in Editses being in the whitelist, but the thing is EditSession and it probably should be TfEditSes.cpp. -- Addressing that would be a distinct PR as it's a significant code change (renaming a file) and I'd definitely want pre-approval before considering it.

Note that you'll probably see things like egistry in the whitelist, since \r and \t are often used as escape characters, my code makes some assumptions at the expense of forcing people to whitelist some things which aren't actually misspelled in context. -- Windows unfortunately (for my code) tends to use \ as a path delimiter which means that when my code looks at the string it's making the wrong guess and lopping off that r or t.

Anyway, please feel free to try it for a bit. I'm definitely eager to improve the experience.

@miniksa
Copy link
Member

miniksa commented Mar 6, 2020

I think projects tend to reuse shorthands. And certainly w/in a commit or series.

Agreed, yeah probably.

I'm also open to ideas for how to write rules for extra exclusions. e.g. Ignore strings shorter than x characters, or longer than y characters.

I would immediately gravitate toward regex, but some people hate that.

One thing you should do is review the whitelist, in reviewing it just now, I realized that there were a couple of misspellings in it... which I'm addressing by fixing & removing.

OK. I'll go read it as fully as I can.

One example is defing which could easily be written out as defining if one wanted to force the word (the context is #define / #undef).

I don't think I'd be that critical, personally. But someone else might care more than me.

There's also the odd case of TfEditses.cpp -- that resulted in Editses being in the whitelist, but the thing is EditSession and it probably should be TfEditSes.cpp. -- Addressing that would be a distinct PR as it's a significant code change (renaming a file) and I'd definitely want pre-approval before considering it.

It could even change straight up into TfEditSession.cpp. That's fine with me.

Note that you'll probably see things like egistry in the whitelist, since \r and \t are often used as escape characters, my code makes some assumptions at the expense of forcing people to whitelist some things which aren't actually misspelled in context. -- Windows unfortunately (for my code) tends to use \ as a path delimiter which means that when my code looks at the string it's making the wrong guess and lopping off that r or t.

Eh, that's OK.

Anyway, please feel free to try it for a bit. I'm definitely eager to improve the experience.

Yes, I think we'll refine it further after it's in.

argc
args
argv
arigatoo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always seen arigatoo as arigatou given it comes from ありがとう and is usually to and is usually u.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// (hiragana doomo arigatoo)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. I would have written doomo as doumo too per -> do, -> u, -> mo.

BGRA
BHID
bhoj
Bhojwani
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, it might be nice to have another whitelist file that is just "names of contributors and their aliases" and make the true whitelist be a union of the files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. That's probably doable. If whitelist is a directory instead of a file, I can cat them all together. It'll take a bit to reason through, but ... yeah...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if it's a directory, I'd have like
whitelist\everythingelse.txt
whitelist\names.txt
whitelist\t.txt
whitelist\n.txt
whitelist\r.txt
whitelist\README.md <-- to explain WTF the files mean
or whatnot.

DFFF
DFFFu
DFu
DGCZZu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is stuff like DGCZZu coming from? That feels like nonsense that should be put in its own bucket and not applied universally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, base64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YouTube:

* Windows Terminal Launch: [Terminal "Sizzle Video"](https://www.youtube.com/watch?v=8gw0rXPMMPE&list=PLEHMQNlPj-Jzh9DkNpqipDGCZZuOwrQwR&index=2&t=0s)

| 2019-05-07 | [Announcement](https://devblogs.microsoft.com/commandline/introducing-windows-terminal/) | Terminal announced & open-sourced ([Build 2019 Terminal session](https://www.youtube.com/watch?v=KMudkRcwjCw), ["Sizzle" video](https://www.youtube.com/watch?v=8gw0rXPMMPE&list=PLEHMQNlPj-Jzh9DkNpqipDGCZZuOwrQwR&index=2&t=0s)) |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's curious. If we had regex, I'd be inclined to make it ignore all of a URL pattern for regexs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found plenty of typos in URLs... Some can be fixed, e.g. some places include a numerical ID which means one can change the human readable portion :-), and others can be fixed by e.g. updating the website content...

That said, being able to exclude things like youtube urls does seem useful... I'll have to think about how to deal w/ that.

My imaginary pipeline is something like:

  1. select files -- excludes.txt
  2. filter input -- my spelling repo has code for things like this -- specifically to address data: URLs (a key source of base64) and general hex tokens -- but I haven't thought through how to do it here. As you can see, I'm using regular expressions, so yes, that's probably what I'd go w/. Again, probably one regexp per line.
  3. parse corpus
  4. drop words found in dictionary -- currently the dictionary isn't configurable in the action (I intend to support this), and the proposal to add things like curated apis would plug in here
  5. filter output -- whitelist.txt -- this phase is designed to warn about stale whitelist entries (and includes code to automatically remove entries from it as well as adding new entries).

Would you want multiple whitelist files, or would it be sufficient to have multiple "dictionary" files? -- The distinction is that dictionaries are treated as read-only -- the tool will not warn you about unused items in them. Whereas the tool will complain if you've whitelisted something and it can't find it in the whitelist.

It's easier for me to imagine code that handles multiple dictionaries and a single whitelist (especially if I add the code to handle instance counts). Although if I have to handle comments, then I can probably make things work using some fancy greps.
-- I think I might also drop support for having the whitelists live outside of the current branch, right now I'm still theoretically supporting an older model where the tool retrieves the whitelist+excludes files from remote locations, but that won't work well w/ enumerating files...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finally implemented regexp support along w/ multiple file support and dictionaries. I've just published it and am letting it chew on this repository....

The lazy approach to generating a dictionary would be to create a repository with just the spellchecker and all the files whose contents you believe are ok, the warning output should be valid input as a dictionary file (instead of as a whitelist file). -- I'll have to document this approach.

If that's too inconvenient, I can think about other ways of handling the problem.

While you might be comfortable checking in a script into your repository, I worry that others might not.

@jsoref jsoref force-pushed the spell-check branch 3 times, most recently from 1f14140 to 56d9cfe Compare March 22, 2020 21:55
Copy link
Contributor Author

@jsoref jsoref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've released a newer version of the spell checker, and I've split things out into dictionaries (including Names and Fonts).

I've put together a couple of basic patterns which seem to work reasonably well, and improved my basic templates. In working to regularize these, I spotted a couple of additional typos in the project. (Hello rying.)

I think this should be good to go.

@@ -1281,7 +1281,7 @@ Pane::SnapSizeResult Pane::_CalcSnappedDimension(const bool widthOrHeight, const
// already snapped or minimum size.
// Arguments:
// - widthOrHeight: if true operates on width, otherwise on height.
// - sizeNode: a layouting node that corresponds to this pane.
// - sizeNode: a layout size node that corresponds to this pane.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this commit

@@ -182,7 +182,7 @@ class Pane : public std::enable_shared_from_this<Pane>
};

// Helper structure that builds a (roughly) binary tree corresponding
// to the pane tree. Used for layouting panes with snapped sizes.
// to the pane tree. Used for laying out panes with snapped sizes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this commit

@@ -765,7 +765,7 @@ void SCREEN_INFORMATION::SetViewportSize(const COORD* const pcoordSize)

const CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();

// If we're in terminal scrolling mode, and we're rying to set the viewport
// If we're in terminal scrolling mode, and we're trying to set the viewport
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was hiding :-o

@@ -307,7 +307,7 @@ const bool RenderData::IsGridLineDrawingAllowed() noexcept
{
// Otherwise, for compatibility reasons with legacy applications that used the additional CHAR_INFO bits by accident or for their own purposes,
// we must enable grid line drawing only in a DBCS output codepage. (Line drawing historically only worked in DBCS codepages.)
// The only known instance of this is Image for Windows by TeraByte, Inc. (TeryByte Unlimited) which used the bits accidentally and for no purpose
// The only known instance of this is Image for Windows by TeraByte, Inc. (TeraByte Unlimited) which used the bits accidentally and for no purpose
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this out.

@@ -316,7 +316,7 @@ void ProcessCtrlEvents()
{
/*
* Status will be non-successful if a process attached to this console
* vetos shutdown. In that case, we don't want to try to kill any more
* vetoes shutdown. In that case, we don't want to try to kill any more
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥔. 🥔🥔.

\d+x\d+Logo
Scro\&ll
# selectionInput.cpp
:\\windows\\syste\b
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syste is otherwise reported as a false positive. This is because of an intentionally narrow example in:

:\windows\syste

@miniksa
Copy link
Member

miniksa commented Mar 24, 2020

There's no README.md for dictionaries. What exactly is the difference between Dictionaries and Whitelists again?

@jsoref
Copy link
Contributor Author

jsoref commented Mar 24, 2020

Dictionaries can include words that don't appear in the corpus.
Whitelisted items will be complained about if they are no longer in the corpus.

So, you could, e.g. have the entire English language in a dictionary, or a bunch of unused apis, or a list of employees.

You would only whitelist things that are currently used, but which aren't really words -- when they're removed, the tool will complain and suggest removing them from the whitelist.

@miniksa
Copy link
Member

miniksa commented Mar 25, 2020

I've added /.github/actions/spell-check/dictionary/README.md@e88a4da

Excellent clarification, thank you.

@DHowett-MSFT, I'd like to bring this in. Are you okay with that?

Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this so much. Thanks.

@DHowett-MSFT DHowett-MSFT changed the title Spell check using GitHub Action ci: run spell check in CI, fix remaining issues Mar 25, 2020
@DHowett-MSFT DHowett-MSFT merged commit 5de9fa9 into microsoft:master Mar 25, 2020
@jsoref jsoref deleted the spell-check branch March 25, 2020 18:55
@german-one
Copy link
Contributor

@jsoref I really welcome the spell check. Especially because I'm not a native speaker and thus, I've been the source of quite a few of the findings already.
It's still not clear to me how to run the spell check locally though. Am I missing something or don't we have this possibility yet?

@jsoref
Copy link
Contributor Author

jsoref commented Mar 25, 2020

@german-one: if you push a branch to this repository (no need to create a PR), it should automatically run (you'll want your branch to be past the merge of course). If you have a fork of this repository, then it might run–you'd go to actions and possibly enable actions.

It is possible to run this in docker. The action log gives you the steps to run, it isn't particularly hard. I'll probably work on providing instructions for that soon. Please give a branch a try.

@german-one
Copy link
Contributor

@jsoref Outcome after resolving a merge conflict:
#4493 (comment)
I guess little to nothing is related to my PR though.

@jsoref
Copy link
Contributor Author

jsoref commented Mar 25, 2020

@miniksa / @DHowett-MSFT / @german-one : I'm thinking that projects will want to be able to insert custom advice / best practices into the bot message. Offhand, I expect that would be another file, I'm open to advice on naming it (also, where would the best place be to discuss this stuff?).

Tentatively, I think advice.txt would work as a file name, for this project, it could have something like:

If the items listed above are names, please add them to `.github/actions/spell-check/dictionary/names.txt`
If they're APIs, you can add them to a file in `.github/actions/spell-check/dictionary/`
If they're just things you're using, please add them to an appropriate file in `.github/actions/spell-check/whitelist`
If you need to use a specific token in one place and it shouldn't generally be used, you can
add an item in an appropriate file in `.github/actions/spell-check/patterns`

See the README.md in each directory for more information.

How does that sound (as a feature)?
(I'll need to play with it a bit, and you'll get to play with it if I implement it.)


I also need to look into the schedule hook, as it appears to be misbehaving.

@DHowett-MSFT
Copy link
Contributor

I like the idea of having bot advice!

@ghost ghost mentioned this pull request Mar 26, 2020
2 tasks
donno2048 added a commit to donno2048/terminal that referenced this pull request Sep 28, 2020
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? -->
## Summary of the Pull Request
Translate automatically generated message text from German into English.

<!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> 
## References
#4799

<!-- Please review the items on the PR checklist before submitting-->
## PR Checklist
* [ ] Closes #xxx
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

<!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here -->
## Detailed Description of the Pull Request / Additional comments
@miniksa mentioned in microsoft/terminal#4799 (comment)
> Actually..... we would consider that block to be wrong. It should have had the English error text there. [...] It's just our practice to have everything be in English as that's our company's working language. [...]

The translation is based on the message text found in the official docs: https://docs.microsoft.com/en-us/nuget/Consume-Packages/Package-restore-troubleshooting

<!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well -->
## Validation Steps Performed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-CodeHealth Issues related to code cleanliness, linting, rules, warnings, errors, static analysis, etc. Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Second It's a PR that needs another sign-off Product-Meta The product is the management of the products.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding spell checking to CI
5 participants