Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infra spell bot #35224

Merged
merged 10 commits into from
Aug 29, 2024
Merged

Infra spell bot #35224

merged 10 commits into from
Aug 29, 2024

Conversation

OnkarRuikar
Copy link
Contributor

@OnkarRuikar OnkarRuikar commented Jul 29, 2024

I've been running a spell-checker bot in my temp repo for over 1.5 years. The bot runs every Monday start and files an issue giving details about found typos/spelling mistakes in the mdn/content repo. The issue is resolved by fixing typos in mdn/content and adding new words in the project-words.txt file.

In the last weekly meeting, we got the green signal to move the bot to the content repo. The PR moves the bot to this repo.

The following changes have been made to existing files:

  • The project-words.txt file contains 5k+ ignored words gathered over the period.
  • The RegExs have been updated to handle umlauts in fragments, to allow valid=true case
  • There are multiple cases of favourite-colour. As the article is named in such a way, we have to ignore all the occurrences.

@OnkarRuikar OnkarRuikar requested review from a team as code owners July 29, 2024 13:22
@OnkarRuikar OnkarRuikar requested review from rebloor and pepelsbey and removed request for a team July 29, 2024 13:22
@github-actions github-actions bot added Content:WebExt WebExtensions docs system [PR only] Infrastructure and configuration for the project size/xl [PR only] >1000 LoC changed labels Jul 29, 2024
Copy link
Contributor

github-actions bot commented Jul 29, 2024

Preview URLs

External URLs (1)

URL: /en-US/docs/Web/HTTP/Status/204
Title: 204 No Content

(comment last updated: 2024-08-13 07:12:42)

Copy link
Contributor

@rebloor rebloor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the corrections to favourite-colour

Comment on lines 38 to 49
- name: Create an issue
if: env.OUTPUT != ''
uses: dacbd/create-issue-action@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
title: Weekly spelling check
body: |
## Found typos and unknown words:
${{ env.OUTPUT }}

> [!TIP]
> Add valid words to the https://github.com/mdn/content/blob/main/.vscode/project-words.txt file, and fix the typos.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating one issue every week, is it possible to keep one dashboard issue that gets updated every week? No strong feelings, just because our issues tend to stack up and get forgotten—not everyone here has the habit of going through past issues.

Copy link
Contributor Author

@OnkarRuikar OnkarRuikar Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the typo issues will not linger more than a day. A maintainer won't even have to wait for a reviewer.

There are many things to handle. We'll have to reopen the issue every week and update it. We might need a custom implementation for this. If you know of any third-party action that does this out of the box, let me know.
How about we merge this as it is and get this bot going, then work on the recurring issue implementation?

.vscode/project-words.txt Outdated Show resolved Hide resolved
@OnkarRuikar OnkarRuikar marked this pull request as draft August 10, 2024 05:12
@Josh-Cena
Copy link
Member

@OnkarRuikar Perhaps you can split out the content change into a separate PR so it can be quickly merged? #35404

@github-actions github-actions bot added Content:HTML Hypertext Markup Language docs Content:HTTP HTTP docs and removed Content:WebExt WebExtensions docs labels Aug 13, 2024
@Josh-Cena
Copy link
Member

@OnkarRuikar Could you split all content changes into separate PRs? We don't want to loop more and more people into this.

@github-actions github-actions bot added size/m [PR only] 51-500 LoC changed and removed size/xl [PR only] >1000 LoC changed labels Aug 13, 2024
@github-actions github-actions bot added size/xl [PR only] >1000 LoC changed and removed Content:HTML Hypertext Markup Language docs Content:HTTP HTTP docs size/m [PR only] 51-500 LoC changed labels Aug 13, 2024
@OnkarRuikar OnkarRuikar marked this pull request as ready for review August 13, 2024 07:28
@Josh-Cena
Copy link
Member

Josh-Cena commented Aug 13, 2024

We already have project-words.txt right? Why two?

I see that they won't be added to editor suggestions. However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

@OnkarRuikar
Copy link
Contributor Author

However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

Because the bot was working standalone and the editor consideration was not there so all the words were accumulated in one file. There are 5578 words in the file, and classifying them is a huge time-consuming task. But we can do it later after the bot goes live.

If we could force contributors to use camel case variable names then a ton of words will get removed. 🙄 myvar vs myVar

OnkarRuikar and others added 2 commits August 27, 2024 14:56
Co-authored-by: Brian Thomas Smith <brian@smith.berlin>
@OnkarRuikar OnkarRuikar marked this pull request as ready for review August 27, 2024 13:33
@OnkarRuikar OnkarRuikar requested a review from bsmth August 27, 2024 13:33
@OnkarRuikar
Copy link
Contributor Author

I've successfully tested the workflow.
Created Issue: OnkarRuikar#29
The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

@bsmth
Copy link
Member

bsmth commented Aug 29, 2024

I've successfully tested the workflow. Created Issue: OnkarRuikar#29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

It looks good, tnx. I think we can merge this shortly. I would also echo the sentiment from Josh that we should try to reduce the 5k+ size dictionary by some means in a follow-up. Like:

  • moving actual technology terms from ignore-words to project-words
  • camelCasing variable names (requires content updates)
  • Some project documentation about the spellchecker, how to add words, what kinds of words go where (CONTRIBUTING.md?)

Copy link
Member

@bsmth bsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing I noticed; we have dupes in the ignore-words.txt file, could you uniq it first?

edit: Also if you think renaming the dictionaries will help with their purpose, we can do that in this PR, for instance ignored-words -> ignore-list?

@OnkarRuikar
Copy link
Contributor Author

I ran sort file | uniq file > file on the word list files. How about names project-words-list.txt and ignored-words-list.txt for the files?

@OnkarRuikar OnkarRuikar requested a review from bsmth August 29, 2024 10:38
@bsmth
Copy link
Member

bsmth commented Aug 29, 2024

I ran sort file | uniq file > file on the word list files.

nice, thank you

How about names project-words-list.txt and ignored-words-list.txt for the files?

IMO "words" is the part to get rid of in the ignore file, so I'd prefer to keep the original or do something like:

terms-abbreviations.txt
ignore-list.txt

What do you reckon?

@OnkarRuikar
Copy link
Contributor Author

@bsmth terms-abbreviations.txt and ignore-list.txt sound good I've renamed the files and updated rest of the content.

Copy link
Member

@bsmth bsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, let's give it a shot! Thanks, Onkar

@OnkarRuikar
Copy link
Contributor Author

Don't forget to trigger the workflow manually.

@bsmth bsmth merged commit 3f4f116 into mdn:main Aug 29, 2024
8 of 9 checks passed
@OnkarRuikar OnkarRuikar deleted the infra_spell_bot branch August 29, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/xl [PR only] >1000 LoC changed system [PR only] Infrastructure and configuration for the project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add US English spell checker/linter for markdown
4 participants