Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INFRA] fixing "remove_internal_links" for PDF build #855

Closed

Conversation

DimitriPapadopoulos
Copy link
Collaborator

@DimitriPapadopoulos DimitriPapadopoulos commented Aug 18, 2021

This regular expression includes duplicate character \w in a set of characters.

Fixes #852.

Or at least attempts to fix that issue. This patch should not modify the current behaviour of the program. Yet, please review it carefully as the duplication might hide a conceptual error - but then I suppose such an error would have been detected by now!

@DimitriPapadopoulos DimitriPapadopoulos changed the title Regex [INFRA] LGTM warning: Duplication in regular expression character class Aug 18, 2021
Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO looks LGTM

elif link_type == 'same':
# regex that matches references sections within the same markdown
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._\w]+)\)')
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)')
primary_pattern = re.compile(r'\[([\w\s\.\(\)`*/–]+)\]\(([#\w\-\._]+)\)')

These .s should be escaped, too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I have fixed it. I have fixed the previous regex too. Can you have a look?

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that there is something more basic that is broken with remove_internal_links.

The purpose of this function is a bit better described in #596, but the TLDR is: for the PDF build only, we need to remove Markdown links that refer to headings within the specification text.

However, looking at the "build_docs_pdf" build artifact (see screenshot):

image

... you can see that there are several "internal" links that are NOT removed and that are broken (try clicking one).

Without having looked deeper into it, I suspect that either (i) the regexp doesn't properly work for the majority of our cases, or (ii) it doesn't work for markdown links that are declared at the bottom of the page and referred to via this syntax: [I am a link text][I-am-a-ref-to-be-declared-somewhere-else], or (iii) both of the above.

Could you look into this @DimitriPapadopoulos?

EDIT: For example, look at page 167 in the pdf build: dataset-level metadata is included in Derived dataset and pipeline description. The link on that text should have been removed --- but it's there and broken.

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Aug 23, 2021

Yes, that's why I had asked for a careful review. I suspect that the [] classes might need to be changed into actual () groups or something like that. I will have a look at #596.

@sappelhoff
Copy link
Member

Yes, that's why I had asked for a careful review. I suspect that the [] classes might need to be changed into actual () groups or something like that. I will have a look at #596.

I know you are currently busy with two fresh PRs @DimitriPapadopoulos - but do you have news on this one? Or an expected timeline?

@DimitriPapadopoulos
Copy link
Collaborator Author

Not really. Perhaps a few weeks? I believe it's not an easy one, but maybe I haven't taken the time to find the right angle yet.

@sappelhoff
Copy link
Member

I also think it might be hard. So no rush, any work here is highly appreciated!

@sappelhoff sappelhoff changed the title [INFRA] LGTM warning: Duplication in regular expression character class [INFRA] fixing "remove_internal_links" for PDF build Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"LGTM" tool warns about regexp for "remove_internal_links" function (for PDF build)
4 participants