Failing getting the full link #1

Ainali · 2023-05-10T11:23:50Z

In this run we see that the script has the link for https://commons.wikimedia.org/wiki/File:Trains_icons_(evolution)_SVG.svg as https://commons.wikimedia.org/wiki/File:Trains_icons_ likely failing on the first (.
Should the script ULR encode or what could be a solution?

The text was updated successfully, but these errors were encountered:

ericherman · 2023-05-11T07:28:16Z

I see this as tricky because the text we're search is

* Graphic: [Trains icons (evolution) SVG](https://commons.wikimedia.org/wiki/File:Trains_icons_(evolution)_SVG.svg)

I think that if the URL were https://commons.wikimedia.org/wiki/File:Trains_icons_%28evolution%29_SVG.svg it would be correctly identified by the reg-ex which matches the URL.

grep --extended-regexp --only-matching --text '(http|https)://[a-zA-Z0-9\./\?=_%:\-]*' $FILE_NAME

By looking at rfc3986 Uniform Resource Identifier (URI): Generic Syntax, I think we should be able to ensure that we are correctly matching URLs, and we should take care that some characters can only exist in very specific spots in a URL and otherwise should be url-encoded.

At a glance we should probably add ~, !, @, +, ,, ; and maybe (, ), [, ] as well.

The regex intentionally does not grab # and anything which follows as that denotes the anchor tag, which we ignore for this purpose.

I think we could add ( and ) as valid URL characters to capture, then it would also capture the trailing ), and we would need to discard that as a special case. I expect [ and ] will also prove to be challenging if we need to add those.

ericherman closed this as completed in 08a795f May 11, 2023

ericherman mentioned this issue May 11, 2023

Regex for matching URLs does not fully conform to RFC 3986 URI Generic Syntax #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing getting the full link #1

Failing getting the full link #1

Ainali commented May 10, 2023

ericherman commented May 11, 2023 •

edited

Loading

Failing getting the full link #1

Failing getting the full link #1

Comments

Ainali commented May 10, 2023

ericherman commented May 11, 2023 • edited Loading

ericherman commented May 11, 2023 •

edited

Loading