-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INFRA] fixing "remove_internal_links" for PDF build #855
[INFRA] fixing "remove_internal_links" for PDF build #855
Conversation
226284d
to
48940ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO looks LGTM
pdf_build_src/process_markdowns.py
Outdated
elif link_type == 'same': | ||
# regex that matches references sections within the same markdown | ||
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._\w]+)\)') | ||
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primary_pattern = re.compile(r'\[([\w\s.\(\)`*/–]+)\]\(([#\w\-._]+)\)') | |
primary_pattern = re.compile(r'\[([\w\s\.\(\)`*/–]+)\]\(([#\w\-\._]+)\)') |
These .
s should be escaped, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I have fixed it. I have fixed the previous regex too. Can you have a look?
fc2cf9d
to
2e84258
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that there is something more basic that is broken with remove_internal_links
.
The purpose of this function is a bit better described in #596, but the TLDR is: for the PDF build only, we need to remove Markdown links that refer to headings within the specification text.
However, looking at the "build_docs_pdf" build artifact (see screenshot):
... you can see that there are several "internal" links that are NOT removed and that are broken (try clicking one).
Without having looked deeper into it, I suspect that either (i) the regexp doesn't properly work for the majority of our cases, or (ii) it doesn't work for markdown links that are declared at the bottom of the page and referred to via this syntax: [I am a link text][I-am-a-ref-to-be-declared-somewhere-else]
, or (iii) both of the above.
Could you look into this @DimitriPapadopoulos?
EDIT: For example, look at page 167 in the pdf build: dataset-level metadata is included in Derived dataset and pipeline description.
The link on that text should have been removed --- but it's there and broken.
Yes, that's why I had asked for a careful review. I suspect that the |
2e84258
to
a03342c
Compare
I know you are currently busy with two fresh PRs @DimitriPapadopoulos - but do you have news on this one? Or an expected timeline? |
Not really. Perhaps a few weeks? I believe it's not an easy one, but maybe I haven't taken the time to find the right angle yet. |
I also think it might be hard. So no rush, any work here is highly appreciated! |
This regular expression includes duplicate character '\w' in a set of characters.
a03342c
to
6529560
Compare
This regular expression includes duplicate character
\w
in a set of characters.Fixes #852.
Or at least attempts to fix that issue. This patch should not modify the current behaviour of the program. Yet, please review it carefully as the duplication might hide a conceptual error - but then I suppose such an error would have been detected by now!