-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Associate *.[1-9]
file extensions with Text
#4258
Conversation
This is an attempt to fix widespread misclassification of any files that end in a numeric suffix, like `changelog.8' or `NEWS-1.2'.
Do non-Roff files ending with a number always (or most of the time) follow the pattern |
Release versions like those above do, yeah. But it's also pretty common for somebody to track a duplicate download, which gets saved with a numeric extension ( |
Is that something we should even try to detect though? Those files could be anything, right? |
Yes, as I explained under the Solution section:
Classifying them as text is the safest thing to do, since the files in question are literally text. It's as best as we can get to an "unspecified". |
I've read that. It's fine if the files' content is text from a human language (as in licenses for example). We cannot, however, add samples whose contents match another language under the Text entry because that would skew the Bayesian classifier, making it unable to distinguish properly between Text and the other language and lessening its ability to recognize Text. I think Roff is one of these cases that we are not going to be able to solve entirely without a real no-a-language option. Since you identified that many Text files with a number as extension have filenames ending with a version number, I was thinking we could use that as a much simpler heuristic rule. We would have to feed the filename as input to the Heuristic strategy though. What do you think? PS: The Text samples already contain two files whose contents match other languages. I'll make a pull request to move those. |
Wait, we can do that? 💚 God I wish I'd known that sooner. Yes, deffo. |
@Alhadis Do you want to take a stab at it? |
Sure, I'll have a crack. :) @pchaigno Errr... how lenient do we wanna be with what can be considered a "version number"? =) |
Something like this would be great, see my results while getting Roff files: https://github.com/smola/language-dataset/tree/master/samples/Roff Maybe detecting |
That sounds like it'd work. 👍 On it. |
👍 |
Good point.
|
Oh, right. I forgot that! |
Should I remove the current |
Actually, what I should be asking: how do I access the filename from within the heuristic's body? elsif /[-.](?>[0-9]+\.)++[0-9]$/.match(WELL_OOPS) || !roff_match.match(data)
Language["Text"]
end Is the |
|
Okay, heuristic updated. I've amended the OP with the sources of the new sample files too. I've also added two aliases for change-log modelines. Emacs uses a hyphenated |
lib/linguist/heuristics.rb
Outdated
@@ -70,7 +71,7 @@ def matches?(filename, candidates) | |||
end | |||
|
|||
# Internal: Perform the heuristic | |||
def call(data) | |||
def call(data, filename = nil) | |||
@heuristic.call(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this works? It's weird that filename
is not used here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I'm impressed at how I managed to overlook that, what the hell. 😂
I should really add a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this suffice?
def test_1to9_by_heuristics
assert_heuristics({
"Roff" => all_fixtures("Roff", "*.{1..9}"),
"Text" => all_fixtures("Text", "*.{1..9}")
})
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Okay, merged and updated to accommodate our new heuristic format. @smola Could you review this, please? |
Note: Some Forth files use |
I'd rather not. If github/markup#1196 is to go through, it needs to avoid rendering misclassified |
Ok. Do we have an evaluation of the Bayesian classifier's accuracy for these files? |
On second thought, perhaps it's best to use both the heuristics and a strategy. At least for now. The strategy proposed by #4317 doesn't catch certain (legal) constructs in man page headers, such as macros containing content lines. It could, but that was where I drew the line between accuracy and complexity. Currently, it only identifies well-formed, "obvious" man pages with irregular extensions, entrusting other strategies to identify more common cases. |
Actually, discard everything I wrote in the comment above. I'm working out a simpler solution which will help github/markup#1196 render the right pages and eliminate the need for an overly-specific manpage strategy. TL;DR — Just continue reviewing this PR as per normal. 😉 Quick, before the bot bites it. |
I guess this PR is going to need reworking due to #4393, right? |
Yes. Most definitely, especially since a strategy for unrecognised man page extensions is still warranted (stuff like |
This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions. |
Resolves conflicts with lib/linguist/heuristics.yml.
I have no idea what this means. 😭
There are fixtures at |
@@ -18,6 +18,7 @@ | |||
# regular expression (with union). | |||
# and - An and block merges multiple rules and checks that all of | |||
# of them must match. | |||
# filename_pattern - Same as pattern, but tests filenames instead of content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? It doesn't appear to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that. This was an addition to the heuristics format to accommodate Text
files which ended with version numbers (stuff like GPL-2.2
and stuff). I must've overlooked the heuristic that used it when resolving merge conflicts, but on second thought, I'm not sure it's even needed...
I recall it was @pchaigno's suggestion. @pchaigno, do we still need this? If not, we can simplify the format by removing the filename_pattern
feature, but it might come in handy in future...
.{1..9}
file extensions with Text*.[1-9]
file extensions with Text
extensions: | ||
- ".txt" | ||
- ".1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about .0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't associate that extension with Roff, because there's never been a man0
section…
Revisiting this: I'm wondering if we shouldn't just remove the problematic suffixes altogether and entrust our man-page strategy to identify numeric file-extensions. @lildude, what do you think? |
Makes sense to me if you think that'll do the trick. I'm also happy to close this if you prefer as I don't think anyone has raised an issue about the current behaviour 😉 |
Agreed. Probably better to start anew than turn this PR inside-out. 😂 |
This is an attempt to fix widespread misclassification of any file whose name ends in a dot-separated numeric suffix:
Description
Many files on GitHub receive an incorrect classification of "Roff", simply because their filenames end in a numeric suffix:
Change-logs and version-numbers are particularly problematic examples of these. The
ValveSoftware/steam-runtime
, for instance, was shown as 72.8% Roff (until I fixed it), because of three license files with version-numbers.haskell-CI/haskell-ci
was another example (until fixed). You can find those and more by checking GitHub's list of trending Roff repositories.Solution
To address this matter, I'm adding
.{1..9}
to the list of Text file extensions. We don't haveno language
as an option here, so usingText
as an open-ended fallback for numeral-suffixed filenames (that aren't Roff) feels like a decent compromise.The heuristics I've added basically go like this:
The commands I check for are those used for man-page authoring:
man
andmdoc
. Other macro packages aren't accounted for because by convention, documents using theme
orms
macros have file extensions of that name.Update: I've also added two
Text
aliases to accommodate change-logs. Emacs uses-*- change-log -*-
, and Vim useschangelog
.Checklist:
I am fixing a misclassified language
I have included a new sample for the misclassified language:
atom.1
gather_profile_stats.1
socket.2
pcre32.3
textmate.5
dunelegacy.6
utf.7
libutf
libraryrc.8
cpumem_get.9
mkfontscale-1.1.1
AFL-1.2
GFDL-1.2
ChangeLog-1.9.3
COPYING
_md5sums-7.0.4
CC-BY-2.5
hierarchyThreshold.6
NEWS-1.8.7
random-numbers.8
pathological.9
I have included a change to the heuristics to distinguish my language from others using the same extension.
/cc @smola