Links in HTML source followed by "), ">, ') or '> returns a wrong end index #3

manuelleduc · 2015-11-19T09:48:59Z

I tried to extract links from https://fr.yahoo.com/?p=us for a test.

Some of the returned links contained too many characters at the end, for example :

I'm trying to understand UrlScanner's code but I'm not sure to be able to fix it.

I'll send a pull request later if I manage to fix it.

robinst · 2015-12-01T02:41:55Z

Hey! Sorry for the late reply (I was on holidays).

It looks like you're trying to extract links from HTML source? That's not a goal of this library at the moment. It's for extracting links from plain text written by humans, such as markup formats.

If you want to extract links from the text contents of HTML, I recommend using e.g. jsoup to parse the HTML, and then running autolink-java on the text nodes.

Note that rinku (which was the inspiration for this library) tries to detect HTML tags to exclude them. I would be open to having an option to enable such a feature, in case you want to contribute it. See here how this is implemented in rinku.

frmz · 2016-02-01T15:41:59Z

Very very dirty solution but adding a replaceAll("["'].*", "") to the resulting URL fixes this problem, also, library wise, it wouldn't be too difficult to stop whenever one of those 2 chars is detected

* `;` is now treated the same as `,` and `:` * `<` and `>` now also need to match, same as other brackets * `/` can still be within or at the end of an URL, but if it's within a group of other delimiters, it behaves as a delimiter Together, these new rules result in `">`, `"/>` and `");` to be excluded at the end of links, while hopefully not messing with the overall heuristics too much.

robinst · 2016-02-05T06:56:44Z

Ok, I've taken a stab at implementing this.

@manuelleduc and @frmz, feel free to look at the test cases in PR #4 and give feedback. Basically, all the above cases now work as expected, except this one: https://s.yimg.com/os/uh-icons/0.1.16/uh/fonts/uh.eot?);src:url(https://s.yimg.com/os/uh-icons/0.1.16/uh/fonts/uh.eot?#iefix

frmz · 2016-02-05T10:10:27Z

That works however i don't understand why you cannot just stop when a " or ' is found, i mean, a standard, properly encoded URL shouldn't contain those 2 chars, do you have situations where URLs contains single / double quotes?

robinst · 2016-02-07T13:39:49Z

@frmz Sure, how about this one?: https://en.wiktionary.org/wiki/it's (note how GitHub recognizes it)

Both RFC 3986 and 1738 allow '. If you can point me to a document that says otherwise, I'd love to read it. The situation for " is less clear, though it seems like they should be treated the same.

frmz · 2016-02-08T08:57:20Z

Yeah you are right, i was actually thinking about double quotes ("), single quotes in HTML shouldn't really be used (although they are), even i found quite a lot of links with single quotes around i did not found any with a non escaped double quote inside, so i would argue that, at least ", can be safely considered a delimiter.

robinst · 2016-02-08T11:48:32Z

Not sure. What about this?: https://en.wiktionary.org/wiki/"_"

frmz · 2016-02-08T12:28:41Z

If you look at this very page source code you will see that inside the href the link is correctly escaped as "https://en.wiktionary.org/wiki/%22_%22", the quotes are only in the text part of the link which is not the link itself (off course a text dump of this page will still have the quotes). Real problem i see is that my browser is not escaping them in the URL bar so it might still represent an usable char.

Anyway i am still fine with implementaion above, that works in most cases, even if when scraping html probably a better solution would be using something like s/href="([^"]*)"/\1/ to get the link (but i see we might go out of topic in this case being the library a generic implementation)

robinst · 2016-02-09T13:14:08Z

Sure, for HTML this is mostly true, although the following also works (not sure if actually valid according to spec): <a href='https://en.wiktionary.org/wiki/"_"'>test</a>.

But, as I've said in my first comment: This library is not about extracting links from HTML. Use a HTML parser for that. This library is for extracting links from plain text that a user might write, such as markup text or a GitHub comment. If it happens to also work for some forms of HTML, that is fine, but not an explicit goal.

Treat more special characters as trailing delimiters (#3)

robinst · 2016-02-09T13:15:21Z

Merged PR #4, closing this.

robinst · 2016-02-10T13:06:09Z

These changes are now released in version 0.4.0.

robinst closed this as completed Dec 1, 2015

robinst changed the title ~~Links followed by "), ">, ') or '> returns a wrong end index~~ Links in HTML source followed by "), ">, ') or '> returns a wrong end index Dec 1, 2015

robinst reopened this Feb 5, 2016

robinst added a commit that referenced this issue Feb 9, 2016

Merge pull request #4 from robinst/issue-3-angle-brackets-and-slash

9a6c3f9

Treat more special characters as trailing delimiters (#3)

robinst closed this as completed Feb 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Links in HTML source followed by "), ">, ') or '> returns a wrong end index #3

Links in HTML source followed by "), ">, ') or '> returns a wrong end index #3

manuelleduc commented Nov 19, 2015

robinst commented Dec 1, 2015

frmz commented Feb 1, 2016

robinst commented Feb 5, 2016

frmz commented Feb 5, 2016

robinst commented Feb 7, 2016

frmz commented Feb 8, 2016

robinst commented Feb 8, 2016

frmz commented Feb 8, 2016

robinst commented Feb 9, 2016

robinst commented Feb 9, 2016

robinst commented Feb 10, 2016

Links in HTML source followed by "), ">, ') or '> returns a wrong end index #3

Links in HTML source followed by "), ">, ') or '> returns a wrong end index #3

Comments

manuelleduc commented Nov 19, 2015

robinst commented Dec 1, 2015

frmz commented Feb 1, 2016

robinst commented Feb 5, 2016

frmz commented Feb 5, 2016

robinst commented Feb 7, 2016

frmz commented Feb 8, 2016

robinst commented Feb 8, 2016

frmz commented Feb 8, 2016

robinst commented Feb 9, 2016

robinst commented Feb 9, 2016

robinst commented Feb 10, 2016