[INFRA] fixing "remove_internal_links" for PDF build #855

DimitriPapadopoulos · 2021-08-18T16:24:20Z

This regular expression includes duplicate character \w in a set of characters.

Fixes #852.

Or at least attempts to fix that issue. This patch should not modify the current behaviour of the program. Yet, please review it carefully as the duplication might hide a conceptual error - but then I suppose such an error would have been detected by now!

yarikoptic

IMHO looks LGTM

effigies · 2021-08-20T20:01:12Z

pdf_build_src/process_markdowns.py

    elif link_type == 'same':
        # regex that matches references sections within the same markdown
-        primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._\w]+)\)')
+        primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)')


Suggested change

primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)')

primary_pattern = re.compile(r'\[([\w\s\.\(\)`*/–]+)\]\(([#\w\-\._]+)\)')

These .s should be escaped, too.

Indeed, I have fixed it. I have fixed the previous regex too. Can you have a look?

sappelhoff

It seems to me that there is something more basic that is broken with remove_internal_links.

The purpose of this function is a bit better described in #596, but the TLDR is: for the PDF build only, we need to remove Markdown links that refer to headings within the specification text.

However, looking at the "build_docs_pdf" build artifact (see screenshot):

... you can see that there are several "internal" links that are NOT removed and that are broken (try clicking one).

Without having looked deeper into it, I suspect that either (i) the regexp doesn't properly work for the majority of our cases, or (ii) it doesn't work for markdown links that are declared at the bottom of the page and referred to via this syntax: [I am a link text][I-am-a-ref-to-be-declared-somewhere-else], or (iii) both of the above.

Could you look into this @DimitriPapadopoulos?

EDIT: For example, look at page 167 in the pdf build: dataset-level metadata is included in Derived dataset and pipeline description. The link on that text should have been removed --- but it's there and broken.

DimitriPapadopoulos · 2021-08-23T12:05:23Z

Yes, that's why I had asked for a careful review. I suspect that the [] classes might need to be changed into actual () groups or something like that. I will have a look at #596.

sappelhoff · 2021-09-13T10:14:43Z

Yes, that's why I had asked for a careful review. I suspect that the [] classes might need to be changed into actual () groups or something like that. I will have a look at #596.

I know you are currently busy with two fresh PRs @DimitriPapadopoulos - but do you have news on this one? Or an expected timeline?

DimitriPapadopoulos · 2021-09-13T10:44:56Z

Not really. Perhaps a few weeks? I believe it's not an easy one, but maybe I haven't taken the time to find the right angle yet.

sappelhoff · 2021-09-13T10:47:27Z

I also think it might be hard. So no rush, any work here is highly appreciated!

This regular expression includes duplicate character '\w' in a set of characters.

DimitriPapadopoulos force-pushed the regex branch from 226284d to 48940ff Compare August 18, 2021 16:25

DimitriPapadopoulos changed the title ~~Regex~~ [INFRA] LGTM warning: Duplication in regular expression character class Aug 18, 2021

yarikoptic approved these changes Aug 20, 2021

View reviewed changes

effigies reviewed Aug 20, 2021

View reviewed changes

DimitriPapadopoulos force-pushed the regex branch from fc2cf9d to 2e84258 Compare August 23, 2021 08:03

sappelhoff reviewed Aug 23, 2021

View reviewed changes

sappelhoff marked this pull request as draft August 31, 2021 15:18

sappelhoff mentioned this pull request Aug 31, 2021

"LGTM" tool warns about regexp for "remove_internal_links" function (for PDF build) #852

Closed

DimitriPapadopoulos force-pushed the regex branch from 2e84258 to a03342c Compare September 1, 2021 06:47

DimitriPapadopoulos added 2 commits September 13, 2021 19:37

[INFRA] LGTM warning: Duplication in regular expression character class

771cc30

This regular expression includes duplicate character '\w' in a set of characters.

Another regex fix: . → \.

6529560

DimitriPapadopoulos force-pushed the regex branch from a03342c to 6529560 Compare September 13, 2021 17:38

sappelhoff changed the title ~~[INFRA] LGTM warning: Duplication in regular expression character class~~ [INFRA] fixing "remove_internal_links" for PDF build Sep 28, 2021

sappelhoff mentioned this pull request Oct 28, 2021

[INFRA] PDF version of spec: fix handling of internal links #915

Merged

3 tasks

sappelhoff closed this in #915 Nov 15, 2021

DimitriPapadopoulos deleted the regex branch November 18, 2021 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFRA] fixing "remove_internal_links" for PDF build #855

[INFRA] fixing "remove_internal_links" for PDF build #855

DimitriPapadopoulos commented Aug 18, 2021 •

edited

Loading

yarikoptic left a comment

effigies Aug 20, 2021

DimitriPapadopoulos Aug 21, 2021

sappelhoff left a comment •

edited

Loading

DimitriPapadopoulos commented Aug 23, 2021 •

edited

Loading

sappelhoff commented Sep 13, 2021

DimitriPapadopoulos commented Sep 13, 2021

sappelhoff commented Sep 13, 2021

	primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)')
	primary_pattern = re.compile(r'\[([\w\s\.\(\)`*/–]+)\]\(([#\w\-\._]+)\)')

[INFRA] fixing "remove_internal_links" for PDF build #855

[INFRA] fixing "remove_internal_links" for PDF build #855

Conversation

DimitriPapadopoulos commented Aug 18, 2021 • edited Loading

yarikoptic left a comment

Choose a reason for hiding this comment

effigies Aug 20, 2021

Choose a reason for hiding this comment

DimitriPapadopoulos Aug 21, 2021

Choose a reason for hiding this comment

sappelhoff left a comment • edited Loading

Choose a reason for hiding this comment

DimitriPapadopoulos commented Aug 23, 2021 • edited Loading

sappelhoff commented Sep 13, 2021

DimitriPapadopoulos commented Sep 13, 2021

sappelhoff commented Sep 13, 2021

DimitriPapadopoulos commented Aug 18, 2021 •

edited

Loading

sappelhoff left a comment •

edited

Loading

DimitriPapadopoulos commented Aug 23, 2021 •

edited

Loading