Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing getting the full link #1

Closed
Ainali opened this issue May 10, 2023 · 1 comment
Closed

Failing getting the full link #1

Ainali opened this issue May 10, 2023 · 1 comment

Comments

@Ainali
Copy link
Member

Ainali commented May 10, 2023

In this run we see that the script has the link for https://commons.wikimedia.org/wiki/File:Trains_icons_(evolution)_SVG.svg as https://commons.wikimedia.org/wiki/File:Trains_icons_ likely failing on the first (.
Should the script ULR encode or what could be a solution?

@ericherman
Copy link
Member

ericherman commented May 11, 2023

I see this as tricky because the text we're search is

* Graphic: [Trains icons (evolution) SVG](https://commons.wikimedia.org/wiki/File:Trains_icons_(evolution)_SVG.svg)

I think that if the URL were https://commons.wikimedia.org/wiki/File:Trains_icons_%28evolution%29_SVG.svg it would be correctly identified by the reg-ex which matches the URL.

grep --extended-regexp --only-matching --text '(http|https)://[a-zA-Z0-9\./\?=_%:\-]*' $FILE_NAME

By looking at rfc3986 Uniform Resource Identifier (URI): Generic Syntax, I think we should be able to ensure that we are correctly matching URLs, and we should take care that some characters can only exist in very specific spots in a URL and otherwise should be url-encoded.

At a glance we should probably add ~, !, @, +, ,, ; and maybe (, ), [, ] as well.

The regex intentionally does not grab # and anything which follows as that denotes the anchor tag, which we ignore for this purpose.

I think we could add ( and ) as valid URL characters to capture, then it would also capture the trailing ), and we would need to discard that as a special case. I expect [ and ] will also prove to be challenging if we need to add those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants