Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associate *.[1-9] file extensions with Text #4258

Closed
wants to merge 14 commits into from
Closed

Conversation

Alhadis
Copy link
Collaborator

@Alhadis Alhadis commented Aug 31, 2018

This is an attempt to fix widespread misclassification of any file whose name ends in a dot-separated numeric suffix:

Description

Many files on GitHub receive an incorrect classification of "Roff", simply because their filenames end in a numeric suffix:

Gemfile-3.0.3
GFDL-1.2
AFL-1.2
mkfontscale-1.1.1

Change-logs and version-numbers are particularly problematic examples of these. The ValveSoftware/steam-runtime, for instance, was shown as 72.8% Roff (until I fixed it), because of three license files with version-numbers. haskell-CI/haskell-ci was another example (until fixed). You can find those and more by checking GitHub's list of trending Roff repositories.

Solution

To address this matter, I'm adding .{1..9} to the list of Text file extensions. We don't have no language as an option here, so using Text as an open-ended fallback for numeral-suffixed filenames (that aren't Roff) feels like a decent compromise.

The heuristics I've added basically go like this:

  1. Choose "Roff" if file contains a well-formed title macro:
    .TH PAGE_TITLE 1  \" man(7)
    .Dt PAGE_TITLE 3  \" mdoc(7)
  2. Choose "Text" if file has NO lines which match a well-formed Roff command.
    The commands I check for are those used for man-page authoring: man and mdoc. Other macro packages aren't accounted for because by convention, documents using the me or ms macros have file extensions of that name.
  3. If neither of the above conditions are met, fall back to the Bayesian classifier.

Update: I've also added two Text aliases to accommodate change-logs. Emacs uses -*- change-log -*-, and Vim uses changelog.

Checklist:

/cc @smola

This is an attempt to fix widespread misclassification of any files that
end in a numeric suffix, like `changelog.8' or `NEWS-1.2'.
@pchaigno
Copy link
Contributor

Do non-Roff files ending with a number always (or most of the time) follow the pattern .*-(\d+\.)+\d?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 31, 2018

Release versions like those above do, yeah.

But it's also pretty common for somebody to track a duplicate download, which gets saved with a numeric extension (wget does this, for instance).

@pchaigno
Copy link
Contributor

But it's also pretty common for somebody to track a duplicate download, which gets saved with a numeric extension (wget does this, for instance).

Is that something we should even try to detect though? Those files could be anything, right?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 31, 2018

Yes, as I explained under the Solution section:

We don't have no language as an option here, so using Text as an open-ended fallback for numeral-suffixed filenames (that aren't Roff) feels like a decent compromise.

Classifying them as text is the safest thing to do, since the files in question are literally text. It's as best as we can get to an "unspecified".

@pchaigno
Copy link
Contributor

pchaigno commented Sep 1, 2018

I've read that. It's fine if the files' content is text from a human language (as in licenses for example). We cannot, however, add samples whose contents match another language under the Text entry because that would skew the Bayesian classifier, making it unable to distinguish properly between Text and the other language and lessening its ability to recognize Text.

I think Roff is one of these cases that we are not going to be able to solve entirely without a real no-a-language option. Since you identified that many Text files with a number as extension have filenames ending with a version number, I was thinking we could use that as a much simpler heuristic rule. We would have to feed the filename as input to the Heuristic strategy though. What do you think?

PS: The Text samples already contain two files whose contents match other languages. I'll make a pull request to move those.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 1, 2018

We would have to feed the filename as input to the Heuristic strategy though. What do you think?

Wait, we can do that? 💚 God I wish I'd known that sooner. Yes, deffo.

@pchaigno
Copy link
Contributor

pchaigno commented Sep 1, 2018

@Alhadis Do you want to take a stab at it?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 1, 2018

Sure, I'll have a crack. :)

@pchaigno Errr... how lenient do we wanna be with what can be considered a "version number"? =) name1.1 or name-1.2.1 or... 😕

@smola
Copy link
Contributor

smola commented Sep 3, 2018

Something like this would be great, see my results while getting Roff files: https://github.com/smola/language-dataset/tree/master/samples/Roff

Maybe detecting [-.][0-9]+(\.[0-9]+)+ suffix would do a decent job?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 3, 2018

That sounds like it'd work. 👍

On it.

@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

Maybe detecting [-.][0-9]+(\.[0-9]+)+ suffix would do a decent job?

👍
Maybe [-.](?:\d++\.)++\d$ if we care about performance a lot (I'm not completely sure about the second possessive quantifier).

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 3, 2018

Good point.

[0-9] is probably faster than \d, BTW. Oniguruma is Unicode-aware by default, meaning \d matches a wider range of codepoints than simple ASCII.

@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

Oh, right. I forgot that!

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 3, 2018

Should I remove the current Text heuristic, or have it run after the filename is checked for version numbers?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 3, 2018

Actually, what I should be asking: how do I access the filename from within the heuristic's body?

elsif /[-.](?>[0-9]+\.)++[0-9]$/.match(WELL_OOPS) || !roff_match.match(data)
  Language["Text"]
end

Is the data variable just a String object, or does it have some additional file-specific properties that I can access?

@pchaigno
Copy link
Contributor

pchaigno commented Sep 3, 2018

Is the data variable just a String object, or does it have some additional file-specific properties that I can access?

data is just a String (or the Ruby equivalent), but all strategies have the blob. blob.name is the filename, which you'll have to pass to heuristic.call in addition to data. Or you can pass the blob to heuristic.call.

https://github.com/github/linguist/blob/b5e2687b2ebe02a6f30943b30ddd8ee08322881c/lib/linguist/heuristics.rb#L21-L24

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 3, 2018

Okay, heuristic updated. I've amended the OP with the sources of the new sample files too.

I've also added two aliases for change-log modelines. Emacs uses a hyphenated change-log, while Vim uses changelog.

@@ -70,7 +71,7 @@ def matches?(filename, candidates)
end

# Internal: Perform the heuristic
def call(data)
def call(data, filename = nil)
@heuristic.call(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this works? It's weird that filename is not used here, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'm impressed at how I managed to overlook that, what the hell. 😂

I should really add a test for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this suffice?

  def test_1to9_by_heuristics
    assert_heuristics({
      "Roff" => all_fixtures("Roff", "*.{1..9}"),
      "Text" => all_fixtures("Text", "*.{1..9}")
    })
  end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Alhadis
Copy link
Collaborator Author

Alhadis commented Sep 7, 2018

Okay, merged and updated to accommodate our new heuristic format.

@smola Could you review this, please?

@larsbrinkhoff
Copy link
Contributor

Note: Some Forth files use .4. Maybe not many enough to care about. Then again, maybe it's easy to do it right?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Nov 8, 2018

Maybe we could keep this heuristic but without the negative_pattern?

I'd rather not. If github/markup#1196 is to go through, it needs to avoid rendering misclassified .1 files as markup. Doing so will likely render their contents unintelligible, and users will voice confusion on why their log files have suddenly become mangled when viewed on GitHub...

@pchaigno
Copy link
Contributor

pchaigno commented Nov 8, 2018

Ok. Do we have an evaluation of the Bayesian classifier's accuracy for these files?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Nov 9, 2018

On second thought, perhaps it's best to use both the heuristics and a strategy. At least for now.

The strategy proposed by #4317 doesn't catch certain (legal) constructs in man page headers, such as macros containing content lines. It could, but that was where I drew the line between accuracy and complexity. Currently, it only identifies well-formed, "obvious" man pages with irregular extensions, entrusting other strategies to identify more common cases.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Dec 6, 2018

Actually, discard everything I wrote in the comment above. I'm working out a simpler solution which will help github/markup#1196 render the right pages and eliminate the need for an overly-specific manpage strategy.

TL;DR — Just continue reviewing this PR as per normal. 😉 Quick, before the bot bites it.

@lildude
Copy link
Member

lildude commented Jan 31, 2019

I guess this PR is going to need reworking due to #4393, right?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Feb 1, 2019

Yes. Most definitely, especially since a strategy for unrecognised man page extensions is still warranted (stuff like .1foo, etc). I'd hold off merging this for now.

@stale
Copy link

stale bot commented Mar 15, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Mar 15, 2019
@Alhadis Alhadis removed the Stale label Mar 16, 2019
@Alhadis Alhadis self-assigned this Mar 16, 2019
Resolves conflicts with lib/linguist/heuristics.yml.
@Alhadis
Copy link
Collaborator Author

Alhadis commented Mar 16, 2019

I have no idea what this means. 😭

  1) Failure:
TestHeuristics#test_1to9_by_heuristics [/home/travis/build/github/linguist/test/test_heuristics.rb:23]:
no fixtures for Roff *.{1..9}

There are fixtures at samples/Roff/*.{1..9}... 😕

test/test_heuristics.rb Outdated Show resolved Hide resolved
@@ -18,6 +18,7 @@
# regular expression (with union).
# and - An and block merges multiple rules and checks that all of
# of them must match.
# filename_pattern - Same as pattern, but tests filenames instead of content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? It doesn't appear to be used anywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that. This was an addition to the heuristics format to accommodate Text files which ended with version numbers (stuff like GPL-2.2 and stuff). I must've overlooked the heuristic that used it when resolving merge conflicts, but on second thought, I'm not sure it's even needed...

I recall it was @pchaigno's suggestion. @pchaigno, do we still need this? If not, we can simplify the format by removing the filename_pattern feature, but it might come in handy in future...

@Alhadis Alhadis changed the title Associate .{1..9} file extensions with Text Associate *.[1-9] file extensions with Text Mar 16, 2019
Alhadis added a commit that referenced this pull request Aug 12, 2019
* Add strategy to identify Roff man pages

References: #4258, #4309, #4317

* Remove `# coding: utf-8` junk injected by accident
extensions:
- ".txt"
- ".1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about .0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't associate that extension with Roff, because there's never been a man0 section…

@Alhadis
Copy link
Collaborator Author

Alhadis commented Jul 22, 2020

Revisiting this: I'm wondering if we shouldn't just remove the problematic suffixes altogether and entrust our man-page strategy to identify numeric file-extensions. @lildude, what do you think?

@lildude
Copy link
Member

lildude commented Jul 23, 2020

Revisiting this: I'm wondering if we shouldn't just remove the problematic suffixes altogether and entrust our man-page strategy to identify numeric file-extensions. @lildude, what do you think?

Makes sense to me if you think that'll do the trick. I'm also happy to close this if you prefer as I don't think anyone has raised an issue about the current behaviour 😉

@Alhadis Alhadis closed this Jul 23, 2020
@Alhadis
Copy link
Collaborator Author

Alhadis commented Jul 23, 2020

Agreed. Probably better to start anew than turn this PR inside-out. 😂

@Alhadis Alhadis deleted the numeric-suffixes branch July 23, 2020 09:23
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants