Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add separate language-entry for Roff Manpages #4393

Merged
merged 4 commits into from
Feb 4, 2019
Merged

Conversation

Alhadis
Copy link
Collaborator

@Alhadis Alhadis commented Jan 25, 2019

This is a hack to support github/markup#1196 and @jordemort in his efforts to bring rendered man-pages to GitHub. This will be a warmly-received feature, but I sensed there'd be numerous pitfalls that would risk having the feature withdrawn unless we found some way to circumvent them.

I'm too exhausted to repeat the lengthy details I gave earlier, so I'm just copy+pasting everything from 11 hours ago. Hope it's coherent.


@jordemort Just a heads up I'm working on a solution, which I hope I can get landed before the next Linguist release. I'd appreciate a review from those involved here once it's done.

Basically, it's a hack. I'm adding a new language entry for Roff Manpages, which this PR should render instead of literally any Roff file, This new entry is "hidden" and grouped under Roff, meaning it'll continue to be classified as such. Ergo, nobody will notice any confusing distinction between "Roff" and "Roff Manpage" in language statistics.

I know the hack is inelegant, but all things considered, it's the only sane solution:

1. We don't want misclassified files with numeric suffixes being rendered as man-pages

People can forgive odd-looking highlighting in ChangeLog-1.9.3, or ignore the fact GFDL-1.2 is causing their repository to be classified as 49.5% Roff.

But people won't be as forgiving if those documents are no longer intelligible because of a botched attempt to render them as man-pages in mandoc(1). I opened a PR to address this issue a while back, but it's an awkward, everted solution. It could be drastically simplified to be made less airtight/paranoid, because we're no longer at risk of having random text-files rendered as Roff.

2. Mandoc can only render man-pages. That's it.

Roff, being a typesetting and document preparation language, has countless macro packages for things like memoranda, manuscripts, books, essays, theses, and virtually anything that involves arranging type for a printed medium. Mandoc is only designed to format manpages, meaning an .ms or .mm document will yield extremely shitty/unintelligible results were we to try rendering them with mandoc(1). So we need a way to ONLY render manpages, leaving any other Roff source as-is.

3. We can restrict rendering to well-formed man-pages only

... but we don't need to make our heuristics airtight, because if the manpage patterns don't match a valid .Dd & .Sh (if mdoc(7)) or a .TH & .SH (if man(7)), then it'll gracefully fall back to the "normal" Roff classification - e.g., what we have now.

There may be cases where either a man-page isn't using mdoc/man macros, or it uses the ancient original man macros, such as in historic repositories.

4. Authors have an "escape-hatch" if they don't want their man-pages to be rendered

By overriding the language of certain files, an author can exempt a manpage from being rendered by adding something like this to their .gitattributes:

# Don't render this as a document
includes/macros.1 linguist-language=Roff

It's important to remember Mandoc isn't a Roff interpreter the way Groff/Troff are. Many language elements are unsupported, and if an author decides mandoc's handling of their man-pages does no justice, they have a way to switch it off. We don't give them that opportunity if we render every Roff file the same way.

/cc @ischwarze, @RalphCorderoy

Checklist:

  • I am adding a "new" language.
    • The extensions of the "new" language is used in hundreds of repositories on GitHub.com.
    • I have updated the heuristics to distinguish my language from others using the same extension.
    • I have included a real-world usage sample for all extensions added in this PR:
Sample file License Link to source
samples/Roff/he.mdoc MIT Source
samples/Roff/as.1 Caldera Source
samples/Roff/roff.1in Caldera Source
samples/Roff/man.1m Caldera Source
samples/Roff/dsw.1x Caldera Source
samples/Roff/fstat.2 Caldera Source
samples/Roff/pow.3 Caldera Source
samples/Roff/printf.3in Caldera Source
samples/Roff/switch.3m Caldera Source
samples/Roff/qsort.3qt Caldera Source
samples/Roff/vt.3x Caldera Source
samples/Roff/cat.4 Caldera Source
samples/Roff/core.5 Caldera Source
samples/Roff/moo.6 Caldera Source
samples/Roff/greek.7 Caldera Source
samples/Roff/ino.8 Caldera Source
samples/Roff/vs.9 Caldera Source
samples/Roff Manpage/gather_profile_stats.man BSD Source
samples/Roff Manpage/jg-coding-style.7 ISC Source
samples/Roff Manpage/lyxclient.1in GPL 2.0 Source
samples/Roff Manpage/wump.6 Caldera Source

Every other manpage not listed was randomly plucked from OpenBSD's /usr/share/man/* directory,
with some random file-extension assigned to fulfil the required file-extension coverage for both languages.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Jan 26, 2019

@lildude Once this goes through, I can begin work on a cleaner and simple strategy to identify man-pages with unrecognised file extensions. I ditched my earlier attempt for sane reasons, and the new one will be many times less complex because there's less concern over "well-formedness".

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even going to try read and test all these regexes and will take your word for it 😄

LGTM.

@@ -28,6 +28,29 @@
#
---
disambiguations:
- extensions: ['.1', '.1in', '.1m', '.1x', '.2', '.3', '.3in', '.3m', '.3qt', '.3x', '.4', '.5', '.6', '.7', '.8', '.9', '.man', '.mdoc']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alhadis would it work also for manpages if their file name ends with .8.in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because handling .in and other unrecognised extensions will be the responsibility of a dedicated man page strategy. 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So support for .8.in manpages would become later, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. 👍

- pattern: '^[.''][ \t]*Dd[ \t]+(?:[^"\s]+|"[^"\s]+")'
- pattern: '^[.''][ \t]*Dt[ \t]+(?:[^"\s]+|"[^"\s]+")[ \t]+"?(?:[1-9]|@[^\s@]+@)'
- pattern: '^[.''][ \t]*Os(?=$|[ \t])'
- pattern: '^[.''][ \t]*Sh[ \t]+(?i:NAME|"NAME"?)[ \t]*$'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything i'm saying here is neither an objection (because i think getting this started is valuable
and details can be polished later when needed) nor an OK (because i know next to nothing
about Linguist).

That said, your treatment of whitespace, in particular tab characters, is confusing.
In general, in roff macro invocations, a tab does *not* start a new argument
but is part of the preceding argument like any other character, without requiring
any quoting or escaping.  Try stuff like (in man(7)) ".BI b<tab>i": everything ends up
bold, nothing italic.  Similarly, ".BI b<space><tab><space>i" is not two
arguments "b" and "i", but *three* three arguments "b", "<tab>", and "i".
Furthermore, the whole point of quoting arguments is that quoted arguments can
contain spaces, so this:  |"[^"\s]+"  makes no sense to me.

So your regexes are too ambitious (trying to match various roff syntax details)
and too sloppy (getting roff syntax wrong in various respects) at the same time.
But i think they are good enough for getting started.
If you want to polish them, i'd recommend simplifying them.
There is little danger that any line starting with "^[.''][ \t]*Dd[ \t]" is not a .Dt macro, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, in roff macro invocations, a tab does not start a new argument

I keep forgetting about that... oops. 😅

There is little danger that any line starting with "^[.''][ \t]*Dd[ \t]" is not a .Dt macro, for example.

Yes, that's true. Will amend.

- pattern: '^[.''][ \t]*Dd[ \t]+(?:[^"\s]+|"[^"\s]+")'
- pattern: '^[.''][ \t]*Dt[ \t]+(?:[^"\s]+|"[^"\s]+")[ \t]+"?(?:[1-9]|@[^\s@]+@)'
- pattern: '^[.''][ \t]*Os(?=$|[ \t])'
- pattern: '^[.''][ \t]*Sh[ \t]+(?i:NAME|"NAME"?)[ \t]*$'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAME|"NAME" fails to match non-English manual pages.
While that kind of failure is hard to avoid in apropos(1), being so picky here seems gratuitous:
you are not actually attempting to parse the section,
so there is little point in rejecting ".Sh NOMBRE".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware there were manual pages circulating with non-English sections, so I've amended it match only files with a .Sh macro.

- language: Roff Manpage
and:
- pattern: '^[.''][ \t]*Dd[ \t]+(?:[^"\s]+|"[^"\s]+")'
- pattern: '^[.''][ \t]*Dt[ \t]+(?:[^"\s]+|"[^"\s]+")[ \t]+"?(?:[1-9]|@[^\s@]+@)'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have occasionally (admittedly somewhat rarely) seen manual pages
where the second argument to .Dt was forgotten.
It might be better to not require it here.

- ".3in"
- ".3m"
- ".3qt"
- ".3x"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.3p is somewhat widespread for Perl manual pages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also .3pm is used for Perl modules manual pages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of those extensions are currently associated with Roff, so adding them would be changing the scope of this pull-request. We can add them in a follow-up PR; I just want to get this merged before the next Linguist release, so @jordemort has a chance to test the new changes in #1196.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been over this 'chase the suffix' thing elsewhere. There are many amongst man pages. I try and avoid installing software, but even on this Arch Linux I have a few.

$ find /usr/share/man -type f -printf '%f\n' |
> sed 's/\.gz$//; s/.*\.//' |
> sort -Vu |
> awk '$0+0 != l {print ""; l = $0+0} 1' |
> fmt |
> grep .
0p
1 1m 1p 1perl 1ssl 1x
2
3 3am 3bsd 3guile 3lua 3ocaml 3p 3pcap 3perl 3pm 3python 3r 3sharp 3ssl
3t 3tcl 3tiff 3x
4
5 5ssl
6
7 7ssl
8
n
$

Copy link
Collaborator Author

@Alhadis Alhadis Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are simply (a subset of) the same extensions which the site already associates with Roff. =) Not adding anything new here; just making sure that only the right Roff files are rendered as manpages.

We'll tackle the missing extensions in a follow-up PR.

@@ -0,0 +1,78 @@
.Dd April 5, 2016
.Dt he 1
.Sh NAME

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this misclassified under samples/Roff?
Yeah sure, the author forgot the .Os macro, but I'd say .Dd, .Dt, .Sh is still sufficient to identify mdoc(7).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linguist requires a sample for each file extension registered for a language; this was the best I could do. 😞

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've edited it to use a malformed .Dt line. These samples are needed for regression tests, so it needs to be invalid as a man-page.

# ".so links" are used like symlinks in manpage directories.
# Match only if target file uses a relevant file extension.
- language: Roff Manpage
pattern: '\A[.''][ \t]*so[ \t]+\S+?\.((?:[1-9](?![0-9])[a-z_0-9]*|0p|n|man|mdoc)(?:\.in)?)\s*\z'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better (and simpler and safer) to show the typical roff files containing
nothing but a single .so line verbatim rather than trying to render the file pointed to?
If the user wants to see the rendered manual page, they can always look at the target file.
It seems valuable to clearly show the file containing .so as a roff-style symlink.

Besides, i see nothing here requiring that the file only contains the single line.
So any file containing .so pointing to a man-page-style filename will be treated
as a manual page, even if there is much other stuff in the file before or after the .so?

Finally, calling mandoc(1) on such a link in the present context is very likely to fail.
These links typically read ".so man1/foo.1" but there is almost certainly no "man1"
directory in the source tree, so mandoc(1) cannot include the target file.  Besides,
mandoc is very picky, for security reasons, about which .so paths it will follow, and
if the path looks odd, it will render something like  "See the file man1/foo.1" which
is less useful than seeing the raw file content.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user wants to see the rendered manual page, they can always look at the target file.

Okay, yeah, that's true. 😀 And the part about failing based on include paths is a good point, too. Will remove. 👍

Besides, i see nothing here requiring that the file only contains the single line.

The \A and \z anchors are like ^ and $, except they're guaranteed to match only the start and end of the entire input, respectively.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Feb 1, 2019

@ischwarze, if the changes in fe3c124 look good to you, I'll go ahead and merge since @lildude has given approval. 👍

We can reiterate on this later if refinements are needed.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Feb 4, 2019

Gonna merge this as-is, as it's blocking another PR that I'd like to get in before the next release (due sometime this week, I believe). @ischwarze, don't hesitate to speak up if the fixes look wrong to you. 👍

@Alhadis Alhadis merged commit 77e10e8 into master Feb 4, 2019
@Alhadis Alhadis deleted the manpage-markup-hack branch February 4, 2019 00:29
@ischwarze
Copy link

Gonna merge this as-is, as it's blocking another PR that I'd like to get in before the next release (due sometime this week, I believe).

I'm totally fine with you moving ahead as needed.

@ischwarze, don't hesitate to speak up if the fixes look wrong to you.

Your changes look reasonable to me.

If i read your REs correctly, the following would still be rejected:

.Sh "QUOTED TITLE

Note there is no quotation mark at the end, which is arguably not good style, but valid roff syntax.

I believe something like

^[.''][ \t]*Sh +\S

(assuming there are Perl REs), i.e. "Sh followed by at least one space character followed by something that is non-blank" would be simpler and more robust. Roff quoting rules are really nasty, and i don't see the point in even trying to re-implement part of them in a regular expression.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Feb 5, 2019

If i read your REs correctly, the following would still be rejected:

Yes, I deliberately chose not to concern myself with nonstandard quoting, as that would greatly increase the complexity of the heuristics (and for little gain, since most in-the-wild usages generally have cleaner quotes than that).

Roff quoting rules are really nasty

Yep, don't I know it. 😉 The following highlighting was implemented using only regular expressions:

.BI "B
.BI "B""B
.BI "B" "I" "B
.BI "BBB B""B"II
.BI "BBB BBB BBB" III
.BI "BBB"" BBB \" CCC" CCC
.BI BBB""BBB III BBB" III
.BI BBB"""BBB III
.BI "BBB"""III BBB
.BI BBB\ BBB
.BI "BBB\ BBB"
.BI "BBB\\ BBB"
.BI "BBB\\\ BBB"

Alhadis added a commit to Cutlery-Drawer/markup that referenced this pull request Oct 23, 2020
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants