Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format RFC links with anchors nicely #11809

Merged
merged 8 commits into from
Oct 6, 2024

Conversation

jstasiak
Copy link
Contributor

Feature or Bugfix

  • Feature

Purpose

Previously if we wrote

:rfc:`1234#section-2`

links would be formatted like

RFC 1234#section-2

which isn't quite nice. After this patch:

RFC 1234 Section 2

which looks and reads better.

There are some non-section anchors that I discovered when
looking at a random RFC, I included them all in unit tests.

Detail

Relates

Resolves: #7027

Previously if we wrote

    :rfc:`1234#section-2`

links would be formatted like

    RFC 1234#section-2

which isn't quite nice. After this patch:

    RFC 1234 Section 2

which looks and reads better.

Resolves: sphinx-doc#7027
@jstasiak
Copy link
Contributor Author

I'm not entirely sure if the build failures are actually related to the patch

sphinx/roles.py Outdated
"""
parts = target.replace('#', ' ').replace('-', ' ').split()
if len(parts) >= 2:
parts[1] = parts[1].title()
Copy link

@verhovsky verhovsky Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC links can have section names, for example

https://datatracker.ietf.org/doc/html/rfc9076.html#name-risks-in-the-dns-data

and title casing arbitrary English text well is complicated, for example the above would be formatted as "RFC 9076 Name Risks in the Dns Data" which is worse than "RFC 9076#name-risks-in-the-dns-data" in my opinion. So you should only do this formatting for exact matches for a few specific cases, I think the ones you put in the test "Section" "Page" "Appendix" are good.

Or if you extract all possible links from all the RFCs and their corresponding natural language text, then you could write and test that this function perfectly extracts the original titles from all possible existing # links.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't know that was possible – thank you.

What do you think we should do in that case? Leaving it be and rendering as RFC 9076#name-risks-in-the-dns-data is easy but it doesn't look great.

On the other hand I can't think of a good capitalization logic that would convert things like "risks-in-the-dns-data" to "Risks in the DNS data" reliably.

Something like this would be easy (when a name- prefix is detected in the anchor): RFC 9076 Section risks-in-the-dns-data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS.

the above would be formatted as "Name Risks in the Dns Data"

Note that I'm currenly applying title() to the first word in that sequence, not the whole thing.

Copy link

@verhovsky verhovsky Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good capitalization logic

If we're only limiting ourselves to links that appear on datatracker.ietf.org then as I said, we have a specific number of text documents (RFCs) with a specific number of human readable section names that are being lowercased and .replace(' ', '-')'d and we want to recover that specific set of original strings. The way we can do it is to extract every single possible link on the website and the actual original human readable text, which should be doable locally (i.e. without scraping) (I did something similar once here and here) since it's an open source website. There's like 9000 RFCs, if each has 100 links then that's a list of 900,000 2 item test cases that your function needs to convert correctly. Title casing (without "and" "as" "but" "for" "if" "nor" "or" "so" "yet" "a" "an" "the" "as" "at" "by" "for" "in" "of" "off" "on" "per" "to" "up" "via") + a list of acronyms + 100 hard coded special cases would probably do it. Then you make the extraction easily repeatable so that people can periodically re-run the extraction code as the RFCs are edited. Also, I think for section names it might make sense to also add a colon like "RFC 9076: Name Risks in the DNS Data".

Or you can just do it for sections, to me it's better to leave section names in URL syntax than do a bad job making them human readable because it's a link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to leave anchors with names (and anything else not already determined to be safe) alone for now, probably not worth blocking this PR because of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally,

  • RFC 1234, Section 2.
  • RFC 1234, Appendix 2, page 23.
  • RFC 1234: The Section Title (capitalize words except common conjunctions)

But I'm not sure whether it's worth the case-handling.

@AA-Turner AA-Turner merged commit e04d042 into sphinx-doc:master Oct 6, 2024
23 checks passed
@AA-Turner
Copy link
Member

Thank you @jstasiak, this is a nice improvement.

A

@AA-Turner AA-Turner added this to the 8.1.x milestone Oct 6, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

:rfc: should format links to sections
4 participants