Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to gengo #1305

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

spenserblack
Copy link
Collaborator

@spenserblack spenserblack commented Apr 8, 2024

See #1152 for previous conversation. With the drastic API changes to gengo it started to feel like more work to modify the existing PR instead of starting over.

To Do

Resolves #26
Closes #1152

To get this working with one language and then expand from there.
This doesn't do much but make the parts that used to use tokei panic.
Languages that aren't yet supported are commented out.
@spenserblack
Copy link
Collaborator Author

@o2sh Before this moves any further we need to determine if we want to move forward with analyzing a git ref (HEAD) or a folder (I think preferably we don't support both). I've started listing reasons for each here: #1303

src/info/blob_size.rs Outdated Show resolved Hide resolved
languages.get_statistics(&[&dir], &ignored, &tokei_config);
languages
) -> Result<gengo::Analysis, Box<dyn Error>> {
// TODO Determine best way to ignore files (and if that should continue to be handled by onefetch)
Copy link
Owner

@o2sh o2sh Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have language type and glob filtering for the V1

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this going to be handled via .gitattributes?

Copy link
Collaborator Author

@spenserblack spenserblack Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the "file source." Right now the Git file source is highly opinionated and would use .gitattributes for ignoring files, overrides, etc.


I plan on the Directory file source being much less opinionated (can't get much more generic than just reading a folder 🙂). The functionality isn't there yet, but I was considering adding ways to configure excluding/including files. Right now .gitignore and .ignore files would ignore files via the ignore crate's default behavior. I was hoping to discuss the usage sometime, but it would probably either be like this:

let file_source = Directory::with_config("./", Config { ignored_files: Some(vec![]) });

or like this:

let file_source = Directory::new("./").extend_ignored_files(&[]);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear, and it also aplies to the language type filter ?

Copy link
Collaborator Author

@spenserblack spenserblack Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default Gengo marks Programming, Markup, and Query as detectable, and Prose and Data as not detectable. detectable is basically a boolean that answers "should this factor in to the stats?" Also, documentation, generated files, and vendored files are not detectable by default. Assuming we use the Git file source with .gitattributes, this is example usage:

# Markdown is Prose, but we still want it in the stats
*.md gengo-detectable

# JavaScript is Programming, but we want to exclude it
*.js -gengo-detectable

# The contents of dist/ are not generated, so they will be detectable
dist/* -gengo-generated
# OR
# Even though dist/* is generated, it should still be included in the stats
dist/* gengo-detectable

By default any file that is not detectable is excluded from the Summary.


Again, the Git file source is highly opinionated and tries to behave a lot like github-linguist. The usage with the Directory file source is still undecided, and right now it doesn't implement any overrides. Or we could write our own file source (#1303 (reply in thread)) if we want very specific behavior that wouldn't be suitable for the gengo crate.

But, however we implement it, IMO the best way would be to inform gengo what files are and aren't detectable, and then make use of either the Analysis (detailed results) or Summary (simplified results).


tl;dr Yes, gengo would be the one handling included/excluded types, and we'd just pass configuration to it somehow.

languages
) -> Result<gengo::Analysis, Box<dyn Error>> {
// TODO Determine best way to ignore files (and if that should continue to be handled by onefetch)
let file_source = Git::new(dir, "HEAD")?;
Copy link
Owner

@o2sh o2sh Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the pending changes taken into count for the language analysis?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be handled by gengo.
Remap languages
@spenserblack
Copy link
Collaborator Author

spenserblack commented Oct 22, 2024

Alright, after a long time procrastinating, I implemented the last of onefetch's languages (besides combining Bash, Zsh, and Sh into Shell).

One thing to note is that the MSRV has been bumped.

@spenserblack
Copy link
Collaborator Author

Bumping gix resulting in compiler errors (new usage). Just a moment.

@spenserblack spenserblack force-pushed the chore/26/switch-to-gengo-2 branch from afa0c1a to 47136f7 Compare October 23, 2024 22:11
@spenserblack spenserblack marked this pull request as ready for review October 23, 2024 22:12
@spenserblack spenserblack mentioned this pull request Oct 24, 2024
1 task
@spenserblack
Copy link
Collaborator Author

spenserblack commented Oct 24, 2024

After thinking about this a bit, I think an issue with tokei and other line-counting tools is that they specialize in counting lines, and, while language ID is a priority, not the top priority. Which leads to e.g. tokei having poor out-of-the-box support for Verilog and V (their *.v extensions were already taken). So line-counting tools are not great for properly identifying a language.

So I'd like to add line counts back at some point, possibly by using gengo as a dependency of some crate and doing something vaguely like

match Language {
    JupyterNotebook => my_super_special_jupyter_notebook_counter()
    _ => generic_counter(Language) // would use community-provided data similar to tokei's languages.json
}

And on that topic, I'd recommend looking into scc. It's a Go project similar to tokei, but also includes some really fun stats like "Estimated Cost to Develop".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect language detected (C++ as C, XML as TypeScript, etc.)
2 participants