Associate `*.[1-9]` file extensions with Text #4258

Alhadis · 2018-08-31T18:38:20Z

This is an attempt to fix widespread misclassification of any file whose name ends in a dot-separated numeric suffix:

Description

Many files on GitHub receive an incorrect classification of "Roff", simply because their filenames end in a numeric suffix:

Gemfile-3.0.3
GFDL-1.2
AFL-1.2
mkfontscale-1.1.1

Change-logs and version-numbers are particularly problematic examples of these. The ValveSoftware/steam-runtime, for instance, was shown as 72.8% Roff (until I fixed it), because of three license files with version-numbers. haskell-CI/haskell-ci was another example (until fixed). You can find those and more by checking GitHub's list of trending Roff repositories.

Solution

To address this matter, I'm adding .{1..9} to the list of Text file extensions. We don't have no language as an option here, so using Text as an open-ended fallback for numeral-suffixed filenames (that aren't Roff) feels like a decent compromise.

The heuristics I've added basically go like this:

Choose "Roff" if file contains a well-formed title macro:

.TH PAGE_TITLE 1  \" man(7)
.Dt PAGE_TITLE 3  \" mdoc(7)

Choose "Text" if file has NO lines which match a well-formed Roff command.
The commands I check for are those used for man-page authoring: man and mdoc. Other macro packages aren't accounted for because by convention, documents using the me or ms macros have file extensions of that name.
If neither of the above conditions are met, fall back to the Bayesian classifier.

Update: I've also added two Text aliases to accommodate change-logs. Emacs uses -*- change-log -*-, and Vim uses changelog.

Checklist:

I am fixing a misclassified language

I have included a new sample for the misclassified language:

Sample	Source	License
`atom.1`	Alhadis/.files	ISC
`gather_profile_stats.1`	mozilla/home-snippets-server-lib	BSD (mentioned in page source)
`socket.2`	OpenBSD's source tree	ISC
`pcre32.3`	PCRE's manpages	BSD
`textmate.5`	Alhadis/.files	ISC
`dunelegacy.6`	spellbindguy/my_slackbuilds	Unclear, but "it may be freely used by anyone for any purpose"
`utf.7`	Plan9's `libutf` library	Lucent (Permissive)
`rc.8`	OpenBSD's source tree	ISC
`cpumem_get.9`	OpenBSD's source tree	ISC
`mkfontscale-1.1.1`	X11 App ports	X Consortium
`AFL-1.2`	mmoselhy/poky-ml507	N/A? (Do literal licenses have their own licenses?)
`GFDL-1.2`	ValveSoftware/steam-runtime	See above
`ChangeLog-1.9.3`	johnl/deb-ruby1.9.1 (truncated)	BSD (assumed based on `COPYING`
`_md5sums-7.0.4`	qrux/xlfs	N/A? (Data generated from a public list of files)
`CC-BY-2.5`	mmoselhy/poky-ml507	N/A; assuming Creative Commons (?)
`hierarchyThreshold.6`	apache-log4j-1.2.16	Apache
`NEWS-1.8.7`	ruby/ruby	BSD
`random-numbers.8`	Myself	ISC
`pathological.9`	Myself	ISC

I have included a change to the heuristics to distinguish my language from others using the same extension.

/cc @smola

This is an attempt to fix widespread misclassification of any files that end in a numeric suffix, like `changelog.8' or `NEWS-1.2'.

pchaigno · 2018-08-31T20:49:35Z

Do non-Roff files ending with a number always (or most of the time) follow the pattern .*-(\d+\.)+\d?

Alhadis · 2018-08-31T20:52:42Z

Release versions like those above do, yeah.

But it's also pretty common for somebody to track a duplicate download, which gets saved with a numeric extension (wget does this, for instance).

pchaigno · 2018-08-31T21:02:39Z

But it's also pretty common for somebody to track a duplicate download, which gets saved with a numeric extension (wget does this, for instance).

Is that something we should even try to detect though? Those files could be anything, right?

Alhadis · 2018-08-31T21:30:15Z

Yes, as I explained under the Solution section:

We don't have no language as an option here, so using Text as an open-ended fallback for numeral-suffixed filenames (that aren't Roff) feels like a decent compromise.

Classifying them as text is the safest thing to do, since the files in question are literally text. It's as best as we can get to an "unspecified".

pchaigno · 2018-09-01T12:27:51Z

I've read that. It's fine if the files' content is text from a human language (as in licenses for example). We cannot, however, add samples whose contents match another language under the Text entry because that would skew the Bayesian classifier, making it unable to distinguish properly between Text and the other language and lessening its ability to recognize Text.

I think Roff is one of these cases that we are not going to be able to solve entirely without a real no-a-language option. Since you identified that many Text files with a number as extension have filenames ending with a version number, I was thinking we could use that as a much simpler heuristic rule. We would have to feed the filename as input to the Heuristic strategy though. What do you think?

PS: The Text samples already contain two files whose contents match other languages. I'll make a pull request to move those.

Alhadis · 2018-09-01T12:33:16Z

We would have to feed the filename as input to the Heuristic strategy though. What do you think?

Wait, we can do that? 💚 God I wish I'd known that sooner. Yes, deffo.

pchaigno · 2018-09-01T12:41:51Z

@Alhadis Do you want to take a stab at it?

Alhadis · 2018-09-01T12:51:18Z

Sure, I'll have a crack. :)

@pchaigno Errr... how lenient do we wanna be with what can be considered a "version number"? =) name1.1 or name-1.2.1 or... 😕

smola · 2018-09-03T08:00:23Z

Something like this would be great, see my results while getting Roff files: https://github.com/smola/language-dataset/tree/master/samples/Roff

Maybe detecting [-.][0-9]+(\.[0-9]+)+ suffix would do a decent job?

Alhadis · 2018-09-03T08:17:29Z

That sounds like it'd work. 👍

On it.

pchaigno · 2018-09-03T08:21:43Z

Maybe detecting [-.][0-9]+(\.[0-9]+)+ suffix would do a decent job?

👍
Maybe [-.](?:\d++\.)++\d$ if we care about performance a lot (I'm not completely sure about the second possessive quantifier).

Alhadis · 2018-09-03T08:48:31Z

Good point.

[0-9] is probably faster than \d, BTW. Oniguruma is Unicode-aware by default, meaning \d matches a wider range of codepoints than simple ASCII.

pchaigno · 2018-09-03T08:53:03Z

Oh, right. I forgot that!

Alhadis · 2018-09-03T09:06:05Z

Should I remove the current Text heuristic, or have it run after the filename is checked for version numbers?

Alhadis · 2018-09-03T09:26:55Z

Actually, what I should be asking: how do I access the filename from within the heuristic's body?

elsif /[-.](?>[0-9]+\.)++[0-9]$/.match(WELL_OOPS) || !roff_match.match(data)
  Language["Text"]
end

Is the data variable just a String object, or does it have some additional file-specific properties that I can access?

pchaigno · 2018-09-03T09:44:54Z

Is the data variable just a String object, or does it have some additional file-specific properties that I can access?

data is just a String (or the Ruby equivalent), but all strategies have the blob. blob.name is the filename, which you'll have to pass to heuristic.call in addition to data. Or you can pass the blob to heuristic.call.

https://github.com/github/linguist/blob/b5e2687b2ebe02a6f30943b30ddd8ee08322881c/lib/linguist/heuristics.rb#L21-L24

Alhadis · 2018-09-03T10:01:11Z

Okay, heuristic updated. I've amended the OP with the sources of the new sample files too.

I've also added two aliases for change-log modelines. Emacs uses a hyphenated change-log, while Vim uses changelog.

pchaigno · 2018-09-03T11:46:38Z

lib/linguist/heuristics.rb

@@ -70,7 +71,7 @@ def matches?(filename, candidates)
    end

    # Internal: Perform the heuristic
-    def call(data)
+    def call(data, filename = nil)
      @heuristic.call(data)


Are you sure this works? It's weird that filename is not used here, no?

... I'm impressed at how I managed to overlook that, what the hell. 😂

I should really add a test for this.

Will this suffice?

def test_1to9_by_heuristics assert_heuristics({ "Roff" => all_fixtures("Roff", "*.{1..9}"), "Text" => all_fixtures("Text", "*.{1..9}") }) end

Alhadis · 2018-09-07T09:19:44Z

Okay, merged and updated to accommodate our new heuristic format.

@smola Could you review this, please?

larsbrinkhoff · 2018-09-07T09:19:48Z

Note: Some Forth files use .4. Maybe not many enough to care about. Then again, maybe it's easy to do it right?

Alhadis · 2018-11-08T17:11:09Z

Maybe we could keep this heuristic but without the negative_pattern?

I'd rather not. If github/markup#1196 is to go through, it needs to avoid rendering misclassified .1 files as markup. Doing so will likely render their contents unintelligible, and users will voice confusion on why their log files have suddenly become mangled when viewed on GitHub...

pchaigno · 2018-11-08T18:30:19Z

Ok. Do we have an evaluation of the Bayesian classifier's accuracy for these files?

Alhadis · 2018-11-09T08:25:04Z

On second thought, perhaps it's best to use both the heuristics and a strategy. At least for now.

The strategy proposed by #4317 doesn't catch certain (legal) constructs in man page headers, such as macros containing content lines. It could, but that was where I drew the line between accuracy and complexity. Currently, it only identifies well-formed, "obvious" man pages with irregular extensions, entrusting other strategies to identify more common cases.

Alhadis · 2018-12-06T14:27:38Z

Actually, discard everything I wrote in the comment above. I'm working out a simpler solution which will help github/markup#1196 render the right pages and eliminate the need for an overly-specific manpage strategy.

TL;DR — Just continue reviewing this PR as per normal. 😉 Quick, before the bot bites it.

lildude · 2019-01-31T14:39:21Z

I guess this PR is going to need reworking due to #4393, right?

Alhadis · 2019-02-01T06:26:45Z

Yes. Most definitely, especially since a strategy for unrecognised man page extensions is still warranted (stuff like .1foo, etc). I'd hold off merging this for now.

References: #4258, #4309, #4317

stale · 2019-03-15T17:01:49Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

Resolves conflicts with lib/linguist/heuristics.yml.

Alhadis · 2019-03-16T07:41:02Z

I have no idea what this means. 😭

  1) Failure:
TestHeuristics#test_1to9_by_heuristics [/home/travis/build/github/linguist/test/test_heuristics.rb:23]:
no fixtures for Roff *.{1..9}

There are fixtures at samples/Roff/*.{1..9}... 😕

test/test_heuristics.rb

lildude · 2019-03-16T09:43:46Z

lib/linguist/heuristics.yml

@@ -18,6 +18,7 @@
 #                       regular expression (with union).
 # and                 - An and block merges multiple rules and checks that all of
 #                       of them must match.
+# filename_pattern    - Same as pattern, but tests filenames instead of content.


What's this for? It doesn't appear to be used anywhere.

Oh, that. This was an addition to the heuristics format to accommodate Text files which ended with version numbers (stuff like GPL-2.2 and stuff). I must've overlooked the heuristic that used it when resolving merge conflicts, but on second thought, I'm not sure it's even needed...

I recall it was @pchaigno's suggestion. @pchaigno, do we still need this? If not, we can simplify the format by removing the filename_pattern feature, but it might come in handy in future...

* Add strategy to identify Roff man pages References: #4258, #4309, #4317 * Remove `# coding: utf-8` junk injected by accident

smola · 2020-07-05T14:49:35Z

lib/linguist/languages.yml

  extensions:
  - ".txt"
+  - ".1"


What about .0?

We don't associate that extension with Roff, because there's never been a man0 section…

Alhadis · 2020-07-22T18:14:03Z

Revisiting this: I'm wondering if we shouldn't just remove the problematic suffixes altogether and entrust our man-page strategy to identify numeric file-extensions. @lildude, what do you think?

lildude · 2020-07-23T09:18:02Z

Revisiting this: I'm wondering if we shouldn't just remove the problematic suffixes altogether and entrust our man-page strategy to identify numeric file-extensions. @lildude, what do you think?

Makes sense to me if you think that'll do the trick. I'm also happy to close this if you prefer as I don't think anyone has raised an issue about the current behaviour 😉

Alhadis · 2020-07-23T09:23:02Z

Agreed. Probably better to start anew than turn this PR inside-out. 😂

Alhadis added 2 commits September 1, 2018 01:21

Associate .{1..9} file extensions with Text

baa4cad

This is an attempt to fix widespread misclassification of any files that end in a numeric suffix, like `changelog.8' or `NEWS-1.2'.

Replace ambiguously licensed sample files

13eae15

Replace code-like Text samples with English texts

fb94d5e

Alhadis added 2 commits September 3, 2018 19:31

Merge branch 'master' into numeric-suffixes

96e6354

Add aliases for Emacs and Vim change-log modes

c3a414e

Check filenames for trailing version strings

faf7c6f

pchaigno reviewed Sep 3, 2018

View reviewed changes

Fix incorrigible blunder and add heuristic test

721f6e2

pchaigno mentioned this pull request Sep 3, 2018

Specify disambiguation rules in YAML #4087

Merged

2 tasks

Alhadis added 2 commits September 7, 2018 19:15

Merge branch 'master' into numeric-suffixes

efdb86f

Include support for Emacs's outline-mode

f2ce033

Alhadis mentioned this pull request Feb 13, 2019

Fix repository's incorrect language-classification on GitHub svi-opensource/libics#14

Merged

Alhadis added a commit that referenced this pull request Feb 24, 2019

Add strategy to identify Roff man pages

e79e407

References: #4258, #4309, #4317

Alhadis mentioned this pull request Feb 24, 2019

Add strategy to identify Roff man pages: Take 2 #4433

Merged

2 tasks

stale bot added the Stale label Mar 15, 2019

Alhadis removed the Stale label Mar 16, 2019

Alhadis self-assigned this Mar 16, 2019

Merge branch 'master' into numeric-suffixes

328b834

Resolves conflicts with lib/linguist/heuristics.yml.

lildude reviewed Mar 16, 2019

View reviewed changes

test/test_heuristics.rb Outdated Show resolved Hide resolved

lildude reviewed Mar 16, 2019

View reviewed changes

Fix brace expansion syntax

d860473

Alhadis changed the title ~~Associate .{1..9} file extensions with Text~~ Associate *.[1-9] file extensions with Text Mar 16, 2019

Alhadis added a commit that referenced this pull request Aug 12, 2019

Add strategy to identify Roff man pages: Take 2 (#4433)

2991273

* Add strategy to identify Roff man pages References: #4258, #4309, #4317 * Remove `# coding: utf-8` junk injected by accident

Merge remote-tracking branch 'origin/master' into numeric-suffixes

4c5321c

smola reviewed Jul 5, 2020

View reviewed changes

Alhadis closed this Jul 23, 2020

Alhadis deleted the numeric-suffixes branch July 23, 2020 09:23

Alhadis mentioned this pull request Aug 5, 2020

Add support for generic file extensions #4936

Merged

Alhadis mentioned this pull request Oct 22, 2020

Mark .{1..9} as generic file-extensions #5059

Merged

2 tasks

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Associate *.[1-9] file extensions with Text #4258

Associate *.[1-9] file extensions with Text #4258

Conversation

Alhadis commented Aug 31, 2018 • edited Loading

Description

Solution

Checklist:

pchaigno commented Aug 31, 2018

Alhadis commented Aug 31, 2018

pchaigno commented Aug 31, 2018

Alhadis commented Aug 31, 2018

pchaigno commented Sep 1, 2018 • edited Loading

Alhadis commented Sep 1, 2018

pchaigno commented Sep 1, 2018

Alhadis commented Sep 1, 2018 • edited Loading

smola commented Sep 3, 2018 • edited Loading

Alhadis commented Sep 3, 2018

pchaigno commented Sep 3, 2018

Alhadis commented Sep 3, 2018

pchaigno commented Sep 3, 2018

Alhadis commented Sep 3, 2018

Alhadis commented Sep 3, 2018

pchaigno commented Sep 3, 2018

Alhadis commented Sep 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alhadis commented Sep 7, 2018

larsbrinkhoff commented Sep 7, 2018

Alhadis commented Nov 8, 2018

pchaigno commented Nov 8, 2018

Alhadis commented Nov 9, 2018

Alhadis commented Dec 6, 2018

lildude commented Jan 31, 2019

Alhadis commented Feb 1, 2019

stale bot commented Mar 15, 2019

Alhadis commented Mar 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alhadis commented Jul 22, 2020

lildude commented Jul 23, 2020

Alhadis commented Jul 23, 2020

Associate `*.[1-9]` file extensions with Text #4258

Associate `*.[1-9]` file extensions with Text #4258

Alhadis commented Aug 31, 2018 •

edited

Loading

pchaigno commented Sep 1, 2018 •

edited

Loading

Alhadis commented Sep 1, 2018 •

edited

Loading

smola commented Sep 3, 2018 •

edited

Loading

Alhadis commented Mar 16, 2019 •

edited

Loading