-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: check URL fragments #94
Conversation
c71d76e
to
b8e3240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, but I kind of wish you'd talked to me about it first ... this is not exactly the approach I would have suggested.
@jyn514 I've converted this to draft, I'm happy to rebase this and tidy up with the suggestions above. I don't want to push forward if you're uncomfortable with the direction this is heading, so let me know if this is the right direction to go forward. |
b8e3240
to
c4a106b
Compare
c4a106b
to
c5cc36c
Compare
@jyn514 this is my attempt at using the algorithm as described in #94 (comment) based on lol_html with fragment checks for HTTP URLs. As noted above I'm aware that there are drawbacks of this approach. Restructuring for queue/async algorithm might be a bigger refactor that this functionality, so I stayed away from that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after reading your comments you've convinced me this is the right approach :) I forgot that you have to actually parse the other HTML file to see what IDs it has.
@jyn514 thanks for the review, all on point. I'll polish this further. |
c5cc36c
to
8da785e
Compare
@jyn514, I incorporated most of the changes, the bit that I didn't is around caching the IO when reading files and HTTP responses that parses and looks for fragments. I think it makes sense to cache the IO, I think that's the expensive part here. I'm not particularly keen on the duplication of logic when handling HEAD vs GET requests. The |
8da785e
to
0e1fc32
Compare
I'm not entirely unhappy with this version. Thanks for all the feedback and suggestions @jyn514, would you mind having another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch more comments on the implementation, but I think this is getting close to ready :) the main thing is that I want the caching to work differently - right now it caches on both the HTML file and the fragment, which means it fetches the same file over and over again.
src/check.rs
Outdated
if resp.synthetic() { | ||
Err(CheckError::Http(Box::new(HttpError::Fetch( | ||
url.clone(), | ||
resp.into_synthetic_error().unwrap(), | ||
)))) | ||
} else if resp.ok() { | ||
Ok(resp) | ||
} else { | ||
Err(CheckError::Http(Box::new(HttpError::UnexpectedStatus( | ||
url.clone(), | ||
resp, | ||
)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: simplify this in a follow-up
0e1fc32
to
c7b4642
Compare
I've almost rewritten this from scratch, taking in the feedback (thanks @jyn514) and trying few different avenues. So this now needs a complete review (again). In short, I've made CheckErrors have four different kinds of errors; there is now a single cached method, using closures to implement protocol specific logic (reading file vs issuing a request over HTTP); created a Link enum to remove some duplication. Thank you for your suggestions thus far @jyn514, please take your time in reviewing this, I'm happy to change course and incorporate any suggestions. |
c7b4642
to
74e0f24
Compare
Okay, so I think that's all of the suggestions incorporated. Thank you @jyn514 for reviewing this, I've learned a lot. Mind having another look (for the n-th time :)), thanks! |
What's the reason for having a separate IoError compared to the AnchorError? |
I thought that handling IO issues separately made sense: it's not really that the fragment is missing but we can't check if it's there due to an IO problem (disk read error, network issue). I think I went down that road once I wanted to unify the two (http/file) fragment parsing functions into one by using the closure to fetch HTML content. It can be dropped if you think that would be better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits now, the shape of this looks right. I might leave a few more comments as I try it out locally.
src/check.rs
Outdated
@@ -60,17 +100,128 @@ pub fn is_available(url: &Url, ctx: &CheckContext) -> Result<(), CheckError> { | |||
} | |||
} | |||
} | |||
cached_key_result! { | |||
CHECK_FILE: SizedCache<String, HashSet<String>> = SizedCache::with_size(100); | |||
Key = { link.to_string() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key = { link.to_string() }; | |
/// `fetch_html` is different depending on whether the link is being loaded from disk or from the network. | |
/// However, the set of URLs from disk and from the web are disjoint, so it's OK to cache based on the name of the link alone. | |
Key = { link.to_string() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cached_key_result
macro doesn't really allow rustdoc, I've added this as regular comment.
pub enum Link { | ||
File(String), | ||
Http(Url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should let you get rid of the clone()
s.
pub enum Link { | |
File(String), | |
Http(Url), | |
pub enum Link<'a> { | |
File(&'a str), | |
Http(&'a Url), |
If this gives you too much trouble, no problem, I can fix it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this trickles down to CheckError
also as CheckError
contains Link
via Fragment
. Perhaps something to cleanup later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, this PR is already big enough 😅
src/check.rs
Outdated
let mut url = url.clone(); | ||
url.set_fragment(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? What happens if you don't call set_fragment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I refactored that wrong, though I do need to fix the caching for HTTP Links, which I think this was doing before the refactor to Link. Cache needs to be keyed to URL without the fragment, otherwise it'll miss for different fragments.
It might also be nice to add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I just tried this on exonum
(755 dependencies) and it finished almost immediately :) great job with the caching!
let resp = ureq::head(url.as_str()).call(); | ||
|
||
handle_response(url, resp).map(|_: ureq::Response| ()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, not necessarily here, it might be nice to cache URLs that don't have fragments too.
This also needs a rebase because of #87 (sorry!) |
@zregvart I think you forgot to push your changes after replying ;) |
74e0f24
to
1d617ab
Compare
@jyn514 should be up to date now. |
I agree, I can do that on a followup PR if that's okay. |
This implements URL fragment checking. Parsing is expanded to also collect values of all `id` attributes. For missing fragments in files a `CheckError::Fragment` will be returned containing a path and the missing fragment. In Rust documentation non-existent fragments in the form of `#n-m` is used to highlight a range of lines in the source code. For fragments in that form additional check is performed to test if all fragments `n` through `m` exist. In case of missing items in a ranged fragment `CheckError::Fragment` will be returned containing a path, the missing ranged fragment, and missing items from the range. Bare in mind that this introduces overhead of performing n+1 parsing or fetching via HTTP. This is because multiple links pointing to the same URL with different fragments end up fetching/parsing the same content for each fragment. To help with that fragment caching is using memoization using cached crate. Fixes deadlinks#7
1d617ab
to
bf38772
Compare
Thanks so much for sticking with this! If you're interested in addressing some of the follow-ups that would be great - |
@jyn514 you're welcome and thanks for reviewing and merging this, I think I have one or two more touch-ups on this, happy to work on them. |
Note: I'm fairly certain that this can be polished, please consider the direction of the implementation first. I'm happy include any suggestions. I appreciate the time it takes to review this largish PR. Thanks!
Update: rebased and changed to check fragments in HTTP URLs
This implements URL fragment checking. Parsing is expanded to also
collect values of all
id
attributes.For missing fragments in files a
CheckError::Anchor
will be returnedcontaining a path and the missing fragment.
In Rust documentation non-existent fragments in the form of
#n-m
isused to highlight a range of lines in the source code. For fragments in
that form additional check is performed to test if all fragments
n
through
m
exist. In case of missing items in a ranged fragmentCheckError::AnchorRange
will be returned containing a path, themissing ranged fragment, and missing item from the range.
Bare in mind that this introduces overhead of performing n+1 parsing or
fetching via HTTP. This is because multiple links pointing to the same
URL with different fragments end up fetching/parsing the same content
for each fragment. To help with that fragment caching is using
memoization using cached crate.
Fixes #7