feat: check URL fragments #94

zregvart · 2020-11-09T21:04:42Z

Note: I'm fairly certain that this can be polished, please consider the direction of the implementation first. I'm happy include any suggestions. I appreciate the time it takes to review this largish PR. Thanks!

Update: rebased and changed to check fragments in HTTP URLs

This implements URL fragment checking. Parsing is expanded to also
collect values of all id attributes.

For missing fragments in files a CheckError::Anchor will be returned
containing a path and the missing fragment.

In Rust documentation non-existent fragments in the form of #n-m is
used to highlight a range of lines in the source code. For fragments in
that form additional check is performed to test if all fragments n
through m exist. In case of missing items in a ranged fragment
CheckError::AnchorRange will be returned containing a path, the
missing ranged fragment, and missing item from the range.

Bare in mind that this introduces overhead of performing n+1 parsing or
fetching via HTTP. This is because multiple links pointing to the same
URL with different fragments end up fetching/parsing the same content
for each fragment. To help with that fragment caching is using
memoization using cached crate.

Fixes #7

jyn514

Thanks for working on this, but I kind of wish you'd talked to me about it first ... this is not exactly the approach I would have suggested.

src/check.rs

src/lib.rs

Cargo.toml

src/parse.rs

zregvart · 2020-11-11T09:43:07Z

@jyn514 I've converted this to draft, I'm happy to rebase this and tidy up with the suggestions above. I don't want to push forward if you're uncomfortable with the direction this is heading, so let me know if this is the right direction to go forward.

zregvart · 2020-11-11T20:36:20Z

@jyn514 this is my attempt at using the algorithm as described in #94 (comment) based on lol_html with fragment checks for HTTP URLs. As noted above I'm aware that there are drawbacks of this approach. Restructuring for queue/async algorithm might be a bigger refactor that this functionality, so I stayed away from that.

jyn514

Ok, after reading your comments you've convinced me this is the right approach :) I forgot that you have to actually parse the other HTML file to see what IDs it has.

src/check.rs

Cargo.toml

src/lib.rs

src/check.rs

zregvart · 2020-11-12T10:10:44Z

@jyn514 thanks for the review, all on point. I'll polish this further.

zregvart · 2020-11-12T11:06:34Z

@jyn514, I incorporated most of the changes, the bit that I didn't is around caching the IO when reading files and HTTP responses that parses and looks for fragments. I think it makes sense to cache the IO, I think that's the expensive part here.

I'm not particularly keen on the duplication of logic when handling HEAD vs GET requests. The resp.syntetic() and resp.ok() bits are almost completely the same on the two branches, I wonder if you have any suggestions on how to eliminate that duplication?

src/check.rs

zregvart · 2020-11-13T12:40:26Z

I'm not entirely unhappy with this version. Thanks for all the feedback and suggestions @jyn514, would you mind having another look?

jyn514

Left a bunch more comments on the implementation, but I think this is getting close to ready :) the main thing is that I want the caching to work differently - right now it caches on both the HTML file and the fragment, which means it fetches the same file over and over again.

jyn514 · 2020-11-13T15:10:28Z

src/check.rs

+    if resp.synthetic() {
+        Err(CheckError::Http(Box::new(HttpError::Fetch(
+            url.clone(),
+            resp.into_synthetic_error().unwrap(),
+        ))))
+    } else if resp.ok() {
+        Ok(resp)
+    } else {
+        Err(CheckError::Http(Box::new(HttpError::UnexpectedStatus(
+            url.clone(),
+            resp,
+        ))))


note to self: simplify this in a follow-up

src/check.rs

src/lib.rs

zregvart · 2020-11-13T23:46:04Z

I've almost rewritten this from scratch, taking in the feedback (thanks @jyn514) and trying few different avenues. So this now needs a complete review (again). In short, I've made CheckErrors have four different kinds of errors; there is now a single cached method, using closures to implement protocol specific logic (reading file vs issuing a request over HTTP); created a Link enum to remove some duplication.

Thank you for your suggestions thus far @jyn514, please take your time in reviewing this, I'm happy to change course and incorporate any suggestions.

src/lib.rs

src/check.rs

zregvart · 2020-11-15T20:25:46Z

Okay, so I think that's all of the suggestions incorporated. Thank you @jyn514 for reviewing this, I've learned a lot. Mind having another look (for the n-th time :)), thanks!

jyn514 · 2020-11-15T20:26:02Z

In short, I've made CheckErrors have four different kinds of errors; there is now a single cached method, using closures to implement protocol specific logic (reading file vs issuing a request over HTTP)

What's the reason for having a separate IoError compared to the AnchorError?

zregvart · 2020-11-15T20:33:07Z

What's the reason for having a separate IoError compared to the AnchorError?

I thought that handling IO issues separately made sense: it's not really that the fragment is missing but we can't check if it's there due to an IO problem (disk read error, network issue). I think I went down that road once I wanted to unify the two (http/file) fragment parsing functions into one by using the closure to fetch HTML content. It can be dropped if you think that would be better.

jyn514

Mostly nits now, the shape of this looks right. I might leave a few more comments as I try it out locally.

src/check.rs

jyn514 · 2020-11-17T23:22:53Z

src/check.rs

@@ -60,17 +100,128 @@ pub fn is_available(url: &Url, ctx: &CheckContext) -> Result<(), CheckError> {
        }
    }
 }
+cached_key_result! {
+    CHECK_FILE: SizedCache<String, HashSet<String>> = SizedCache::with_size(100);
+    Key = { link.to_string() };


Suggested change

Key = { link.to_string() };

/// `fetch_html` is different depending on whether the link is being loaded from disk or from the network.

/// However, the set of URLs from disk and from the web are disjoint, so it's OK to cache based on the name of the link alone.

Key = { link.to_string() };

The cached_key_result macro doesn't really allow rustdoc, I've added this as regular comment.

src/check.rs

jyn514 · 2020-11-17T23:30:26Z

src/check.rs

+pub enum Link {
+    File(String),
+    Http(Url),


This should let you get rid of the clone()s.

Suggested change

pub enum Link {

File(String),

Http(Url),

pub enum Link<'a> {

File(&'a str),

Http(&'a Url),

If this gives you too much trouble, no problem, I can fix it in a follow-up.

Yeah, this trickles down to CheckError also as CheckError contains Link via Fragment. Perhaps something to cleanup later?

Sounds good, this PR is already big enough 😅

src/check.rs

jyn514 · 2020-11-17T23:36:27Z

src/check.rs

+        let mut url = url.clone();
+        url.set_fragment(None);


Is this necessary? What happens if you don't call set_fragment?

Yeah, I refactored that wrong, though I do need to fix the caching for HTTP Links, which I think this was doing before the refactor to Link. Cache needs to be keyed to URL without the fragment, otherwise it'll miss for different fragments.

jyn514 · 2020-11-17T23:38:16Z

It might also be nice to add a --ignore-fragments flag, otherwise people will suddenly have failing CI with no way to opt-out.

jyn514

Wow, I just tried this on exonum (755 dependencies) and it finished almost immediately :) great job with the caching!

jyn514 · 2020-11-17T23:41:20Z

src/check.rs

+        let resp = ureq::head(url.as_str()).call();
+
+        handle_response(url, resp).map(|_: ureq::Response| ())


As a follow-up, not necessarily here, it might be nice to cache URLs that don't have fragments too.

jyn514 · 2020-11-18T01:48:04Z

This also needs a rebase because of #87 (sorry!)

jyn514 · 2020-11-18T14:03:06Z

@zregvart I think you forgot to push your changes after replying ;)

zregvart · 2020-11-18T14:47:49Z

@jyn514 should be up to date now.

src/check.rs

zregvart · 2020-11-18T17:25:35Z

It might also be nice to add a --ignore-fragments flag, otherwise people will suddenly have failing CI with no way to opt-out.

I agree, I can do that on a followup PR if that's okay.

This implements URL fragment checking. Parsing is expanded to also collect values of all `id` attributes. For missing fragments in files a `CheckError::Fragment` will be returned containing a path and the missing fragment. In Rust documentation non-existent fragments in the form of `#n-m` is used to highlight a range of lines in the source code. For fragments in that form additional check is performed to test if all fragments `n` through `m` exist. In case of missing items in a ranged fragment `CheckError::Fragment` will be returned containing a path, the missing ranged fragment, and missing items from the range. Bare in mind that this introduces overhead of performing n+1 parsing or fetching via HTTP. This is because multiple links pointing to the same URL with different fragments end up fetching/parsing the same content for each fragment. To help with that fragment caching is using memoization using cached crate. Fixes deadlinks#7

jyn514 · 2020-11-19T14:02:09Z

Thanks so much for sticking with this!

If you're interested in addressing some of the follow-ups that would be great - --ignore-fragments especially would be a nice addition I think :) If you don't have time, though, I'm happy to add it myself.

zregvart · 2020-11-19T18:42:07Z

@jyn514 you're welcome and thanks for reviewing and merging this, I think I have one or two more touch-ups on this, happy to work on them.

zregvart force-pushed the pr/check-fragments branch from c71d76e to b8e3240 Compare November 9, 2020 21:06

jyn514 reviewed Nov 9, 2020

View reviewed changes

zregvart marked this pull request as draft November 11, 2020 09:41

zregvart force-pushed the pr/check-fragments branch from b8e3240 to c4a106b Compare November 11, 2020 20:31

zregvart changed the title ~~feat: check URL fragments for file URLs~~ feat: check URL fragments Nov 11, 2020

zregvart force-pushed the pr/check-fragments branch from c4a106b to c5cc36c Compare November 11, 2020 20:33

jyn514 reviewed Nov 12, 2020

View reviewed changes

jyn514 added C-enhancement Category: This is a new feature S-waiting-on-review Status: Waiting for either a maintainer or issue author to review labels Nov 12, 2020

zregvart force-pushed the pr/check-fragments branch from c5cc36c to 8da785e Compare November 12, 2020 11:03

jyn514 reviewed Nov 12, 2020

View reviewed changes

src/check.rs Outdated Show resolved Hide resolved

jyn514 reviewed Nov 12, 2020

View reviewed changes

src/check.rs Outdated Show resolved Hide resolved

zregvart force-pushed the pr/check-fragments branch from 8da785e to 0e1fc32 Compare November 13, 2020 12:38

zregvart marked this pull request as ready for review November 13, 2020 12:39

jyn514 reviewed Nov 13, 2020

View reviewed changes

jyn514 added S-waiting-on-author Status: Waiting on the author of this PR or issue and removed S-waiting-on-review Status: Waiting for either a maintainer or issue author to review labels Nov 13, 2020

jyn514 mentioned this pull request Nov 13, 2020

Run assert on Drop lipanski/mockito#131

Closed

zregvart force-pushed the pr/check-fragments branch from 0e1fc32 to c7b4642 Compare November 13, 2020 23:39

jyn514 reviewed Nov 15, 2020

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/check.rs Show resolved Hide resolved

src/check.rs Outdated Show resolved Hide resolved

src/check.rs Show resolved Hide resolved

src/check.rs Outdated Show resolved Hide resolved

src/check.rs Outdated Show resolved Hide resolved

zregvart force-pushed the pr/check-fragments branch from c7b4642 to 74e0f24 Compare November 15, 2020 20:24

jyn514 reviewed Nov 17, 2020

View reviewed changes

zregvart force-pushed the pr/check-fragments branch from 74e0f24 to 1d617ab Compare November 18, 2020 14:47

jyn514 reviewed Nov 18, 2020

View reviewed changes

src/check.rs Outdated Show resolved Hide resolved

zregvart force-pushed the pr/check-fragments branch from 1d617ab to bf38772 Compare November 18, 2020 20:37

jyn514 merged commit 343d01d into deadlinks:master Nov 19, 2020

zregvart deleted the pr/check-fragments branch November 19, 2020 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: check URL fragments #94

feat: check URL fragments #94

zregvart commented Nov 9, 2020 •

edited

Loading

jyn514 left a comment

zregvart commented Nov 11, 2020

zregvart commented Nov 11, 2020

jyn514 left a comment

zregvart commented Nov 12, 2020

zregvart commented Nov 12, 2020

zregvart commented Nov 13, 2020

jyn514 left a comment

jyn514 Nov 13, 2020

zregvart commented Nov 13, 2020

zregvart commented Nov 15, 2020

jyn514 commented Nov 15, 2020

zregvart commented Nov 15, 2020

jyn514 left a comment

jyn514 Nov 17, 2020

zregvart Nov 18, 2020

jyn514 Nov 17, 2020

zregvart Nov 18, 2020

jyn514 Nov 18, 2020

jyn514 Nov 17, 2020

zregvart Nov 18, 2020

jyn514 commented Nov 17, 2020

jyn514 left a comment

jyn514 Nov 17, 2020

jyn514 commented Nov 18, 2020

jyn514 commented Nov 18, 2020

zregvart commented Nov 18, 2020

zregvart commented Nov 18, 2020

jyn514 commented Nov 19, 2020

zregvart commented Nov 19, 2020

		let resp = ureq::head(url.as_str()).call();

		handle_response(url, resp).map(\|_: ureq::Response\| ())

feat: check URL fragments #94

feat: check URL fragments #94

Conversation

zregvart commented Nov 9, 2020 • edited Loading

jyn514 left a comment

Choose a reason for hiding this comment

zregvart commented Nov 11, 2020

zregvart commented Nov 11, 2020

jyn514 left a comment

Choose a reason for hiding this comment

zregvart commented Nov 12, 2020

zregvart commented Nov 12, 2020

zregvart commented Nov 13, 2020

jyn514 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zregvart commented Nov 13, 2020

zregvart commented Nov 15, 2020

jyn514 commented Nov 15, 2020

zregvart commented Nov 15, 2020

jyn514 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jyn514 commented Nov 17, 2020

jyn514 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jyn514 commented Nov 18, 2020

jyn514 commented Nov 18, 2020

zregvart commented Nov 18, 2020

zregvart commented Nov 18, 2020

jyn514 commented Nov 19, 2020

zregvart commented Nov 19, 2020

zregvart commented Nov 9, 2020 •

edited

Loading