Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve comments using LLMs. #4013

Closed

Conversation

dcompoze
Copy link
Contributor

@dcompoze dcompoze commented Apr 5, 2024

Hello, I recently submitted changes to fix spelling mistakes in this repository (#3808).

Based on the discussion, I also wanted to improve the general quality of comments in the repository (e.g. grammatical errors, periods, capital letters, markdown formatting).

Simple spelling mistakes are relatively easy to detect and fix with regular spell checkers and some manual intervention, but the formatting and grammar of comments is more difficult to fix with those types of tools.

The approach I took here is to create a Rust parser (using nom) to extract comments from Rust files and feed them to the gpt-4-turbo large language model to suggest improvements.

The current prompt for the model is:

You are provided a list of comments from a Rust file in the Polkadot repository.
Each comment starts with a line number followed by a colon and then the contents of the comment.
Your task is to improve the spelling and grammar of comments and markdown formatting of doc comments.
You should format each comment according to the following rules:
- Comments should be full sentences, begin with a capital letter and end with a period.
- Doc comments should follow markdown formatting rules (e.g. code items should always use backticks).
- Improve comment grammar if it can be improved.
- If you encounter a British or American spelling of a given word, you should keep the one that is generally more common.
- If a comment is already well formatted, you should not modify it.
- Prefer capital letters for abbreviations such as 'ID' vs 'id' unless the text is part of code, such as 'UserId'.
- A single comment cannot exceed a 100 character width limit.

The suggested changes are then manually reviewed by me one by one.

This draft PR includes changes to 12 files in the bridges directory just to showcase what the resulting changes look like.

The cost of the OpenAI API requests is about 0.45 USD per 10 average files.

Given that the whole repository has about 3570 files, the estimate for the full repository would be somewhere around 160.0 USD in API credits.

Note: I also tried doing this with local ollama models but the results were not as good as GPT4.

Also, it takes me about about 30-60 seconds to review the changes to each file, so doing this to all the files would be somewhere in the ballpark of 30-50 hours.

The verification process looks something like this from my end:

selection-1712353756

I wanted to gauge the interest for these changes, and if there is interest, if some funding could be provided to cover the API and time cost.

@joepetrowski @bkchr What do you think?

@dcompoze dcompoze closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant