Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check URL fragments #94

Merged
merged 1 commit into from
Nov 19, 2020
Merged

Conversation

zregvart
Copy link
Contributor

@zregvart zregvart commented Nov 9, 2020

Note: I'm fairly certain that this can be polished, please consider the direction of the implementation first. I'm happy include any suggestions. I appreciate the time it takes to review this largish PR. Thanks!

Update: rebased and changed to check fragments in HTTP URLs

This implements URL fragment checking. Parsing is expanded to also
collect values of all id attributes.

For missing fragments in files a CheckError::Anchor will be returned
containing a path and the missing fragment.

In Rust documentation non-existent fragments in the form of #n-m is
used to highlight a range of lines in the source code. For fragments in
that form additional check is performed to test if all fragments n
through m exist. In case of missing items in a ranged fragment
CheckError::AnchorRange will be returned containing a path, the
missing ranged fragment, and missing item from the range.

Bare in mind that this introduces overhead of performing n+1 parsing or
fetching via HTTP. This is because multiple links pointing to the same
URL with different fragments end up fetching/parsing the same content
for each fragment. To help with that fragment caching is using
memoization using cached crate.

Fixes #7

Copy link
Contributor

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, but I kind of wish you'd talked to me about it first ... this is not exactly the approach I would have suggested.

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/parse.rs Outdated Show resolved Hide resolved
@zregvart zregvart marked this pull request as draft November 11, 2020 09:41
@zregvart
Copy link
Contributor Author

@jyn514 I've converted this to draft, I'm happy to rebase this and tidy up with the suggestions above. I don't want to push forward if you're uncomfortable with the direction this is heading, so let me know if this is the right direction to go forward.

@zregvart zregvart changed the title feat: check URL fragments for file URLs feat: check URL fragments Nov 11, 2020
@zregvart
Copy link
Contributor Author

@jyn514 this is my attempt at using the algorithm as described in #94 (comment) based on lol_html with fragment checks for HTTP URLs. As noted above I'm aware that there are drawbacks of this approach. Restructuring for queue/async algorithm might be a bigger refactor that this functionality, so I stayed away from that.

Copy link
Contributor

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after reading your comments you've convinced me this is the right approach :) I forgot that you have to actually parse the other HTML file to see what IDs it has.

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
@jyn514 jyn514 added C-enhancement Category: This is a new feature S-waiting-on-review Status: Waiting for either a maintainer or issue author to review labels Nov 12, 2020
@zregvart
Copy link
Contributor Author

@jyn514 thanks for the review, all on point. I'll polish this further.

@zregvart
Copy link
Contributor Author

@jyn514, I incorporated most of the changes, the bit that I didn't is around caching the IO when reading files and HTTP responses that parses and looks for fragments. I think it makes sense to cache the IO, I think that's the expensive part here.

I'm not particularly keen on the duplication of logic when handling HEAD vs GET requests. The resp.syntetic() and resp.ok() bits are almost completely the same on the two branches, I wonder if you have any suggestions on how to eliminate that duplication?

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
@zregvart zregvart marked this pull request as ready for review November 13, 2020 12:39
@zregvart
Copy link
Contributor Author

I'm not entirely unhappy with this version. Thanks for all the feedback and suggestions @jyn514, would you mind having another look?

Copy link
Contributor

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch more comments on the implementation, but I think this is getting close to ready :) the main thing is that I want the caching to work differently - right now it caches on both the HTML file and the fragment, which means it fetches the same file over and over again.

src/check.rs Outdated
Comment on lines 163 to 249
if resp.synthetic() {
Err(CheckError::Http(Box::new(HttpError::Fetch(
url.clone(),
resp.into_synthetic_error().unwrap(),
))))
} else if resp.ok() {
Ok(resp)
} else {
Err(CheckError::Http(Box::new(HttpError::UnexpectedStatus(
url.clone(),
resp,
))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: simplify this in a follow-up

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@jyn514 jyn514 added S-waiting-on-author Status: Waiting on the author of this PR or issue and removed S-waiting-on-review Status: Waiting for either a maintainer or issue author to review labels Nov 13, 2020
@zregvart
Copy link
Contributor Author

I've almost rewritten this from scratch, taking in the feedback (thanks @jyn514) and trying few different avenues. So this now needs a complete review (again). In short, I've made CheckErrors have four different kinds of errors; there is now a single cached method, using closures to implement protocol specific logic (reading file vs issuing a request over HTTP); created a Link enum to remove some duplication.

Thank you for your suggestions thus far @jyn514, please take your time in reviewing this, I'm happy to change course and incorporate any suggestions.

src/lib.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
@zregvart
Copy link
Contributor Author

Okay, so I think that's all of the suggestions incorporated. Thank you @jyn514 for reviewing this, I've learned a lot. Mind having another look (for the n-th time :)), thanks!

@jyn514
Copy link
Contributor

jyn514 commented Nov 15, 2020

In short, I've made CheckErrors have four different kinds of errors; there is now a single cached method, using closures to implement protocol specific logic (reading file vs issuing a request over HTTP)

What's the reason for having a separate IoError compared to the AnchorError?

@zregvart
Copy link
Contributor Author

What's the reason for having a separate IoError compared to the AnchorError?

I thought that handling IO issues separately made sense: it's not really that the fragment is missing but we can't check if it's there due to an IO problem (disk read error, network issue). I think I went down that road once I wanted to unify the two (http/file) fragment parsing functions into one by using the closure to fetch HTML content. It can be dropped if you think that would be better.

Copy link
Contributor

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits now, the shape of this looks right. I might leave a few more comments as I try it out locally.

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated
@@ -60,17 +100,128 @@ pub fn is_available(url: &Url, ctx: &CheckContext) -> Result<(), CheckError> {
}
}
}
cached_key_result! {
CHECK_FILE: SizedCache<String, HashSet<String>> = SizedCache::with_size(100);
Key = { link.to_string() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Key = { link.to_string() };
/// `fetch_html` is different depending on whether the link is being loaded from disk or from the network.
/// However, the set of URLs from disk and from the web are disjoint, so it's OK to cache based on the name of the link alone.
Key = { link.to_string() };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached_key_result macro doesn't really allow rustdoc, I've added this as regular comment.

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
src/check.rs Show resolved Hide resolved
src/check.rs Outdated Show resolved Hide resolved
Comment on lines +44 to +46
pub enum Link {
File(String),
Http(Url),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should let you get rid of the clone()s.

Suggested change
pub enum Link {
File(String),
Http(Url),
pub enum Link<'a> {
File(&'a str),
Http(&'a Url),

If this gives you too much trouble, no problem, I can fix it in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this trickles down to CheckError also as CheckError contains Link via Fragment. Perhaps something to cleanup later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, this PR is already big enough 😅

src/check.rs Outdated Show resolved Hide resolved
src/check.rs Outdated
Comment on lines 265 to 266
let mut url = url.clone();
url.set_fragment(None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? What happens if you don't call set_fragment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I refactored that wrong, though I do need to fix the caching for HTTP Links, which I think this was doing before the refactor to Link. Cache needs to be keyed to URL without the fragment, otherwise it'll miss for different fragments.

@jyn514
Copy link
Contributor

jyn514 commented Nov 17, 2020

It might also be nice to add a --ignore-fragments flag, otherwise people will suddenly have failing CI with no way to opt-out.

Copy link
Contributor

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I just tried this on exonum (755 dependencies) and it finished almost immediately :) great job with the caching!

Comment on lines +251 to +278
let resp = ureq::head(url.as_str()).call();

handle_response(url, resp).map(|_: ureq::Response| ())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, not necessarily here, it might be nice to cache URLs that don't have fragments too.

@jyn514
Copy link
Contributor

jyn514 commented Nov 18, 2020

This also needs a rebase because of #87 (sorry!)

@jyn514
Copy link
Contributor

jyn514 commented Nov 18, 2020

@zregvart I think you forgot to push your changes after replying ;)

@zregvart
Copy link
Contributor Author

@jyn514 should be up to date now.

src/check.rs Outdated Show resolved Hide resolved
@zregvart
Copy link
Contributor Author

It might also be nice to add a --ignore-fragments flag, otherwise people will suddenly have failing CI with no way to opt-out.

I agree, I can do that on a followup PR if that's okay.

This implements URL fragment checking. Parsing is expanded to also
collect values of all `id` attributes.

For missing fragments in files a `CheckError::Fragment` will be returned
containing a path and the missing fragment.

In Rust documentation non-existent fragments in the form of `#n-m` is
used to highlight a range of lines in the source code. For fragments in
that form additional check is performed to test if all fragments `n`
through `m` exist. In case of missing items in a ranged fragment
`CheckError::Fragment` will be returned containing a path, the
missing ranged fragment, and missing items from the range.

Bare in mind that this introduces overhead of performing n+1 parsing or
fetching via HTTP. This is because multiple links pointing to the same
URL with different fragments end up fetching/parsing the same content
for each fragment. To help with that fragment caching is using
memoization using cached crate.

Fixes deadlinks#7
@jyn514 jyn514 merged commit 343d01d into deadlinks:master Nov 19, 2020
@jyn514
Copy link
Contributor

jyn514 commented Nov 19, 2020

Thanks so much for sticking with this!

If you're interested in addressing some of the follow-ups that would be great - --ignore-fragments especially would be a nice addition I think :) If you don't have time, though, I'm happy to add it myself.

@zregvart
Copy link
Contributor Author

@jyn514 you're welcome and thanks for reviewing and merging this, I think I have one or two more touch-ups on this, happy to work on them.

@zregvart zregvart deleted the pr/check-fragments branch November 19, 2020 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: This is a new feature S-waiting-on-author Status: Waiting on the author of this PR or issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check anchor links
2 participants