-
-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF link with a hash causes PDF-parsing by accident #663
Comments
The But the main issue is that PDFs shouldn't be parsed by Nokogiri. (Edit: Deleted a sequence of errors on my part.) |
There is a workaround. HTMLProofer.check_directory(
"./path",
{ url_ignore: [/.*\.pdf(#.*)?/] }
).run |
@stevecheckoway |
@scivola, @stevecheckoway, Simple minimal reproducible example Bash: touch exists.pdf
wget https://file-examples-com.github.io/uploads/2017/10/file-example_PDF_1MB.pdf -O large.pdf
echo '<!doctype html>
<html>
<head><title>Example</title></head>
<body>
<a href="exists.pdf#page=2">Exists</a>
<a href="large.pdf#page=2">Large</a>
<a href="missing.pdf#page=2">Missing</a>
</body>
</html>' > example.html Ruby: require 'html-proofer'
HTMLProofer.check_file(
"example.html",
{ :url_swap => {/(?<=\.pdf)#.*$/ => ""} }
).run
# * internally linking to missing.pdf#page=2, which does not exist (line 6)
# <a href="missing.pdf#page=2">Missing</a> Note that the error reports the original full link w/o stripped hash, but the link is in fact checked post- |
@riccardoporreca |
Thanks @riccardoporreca! Hopefully, html-proofer itself will be fixed, but in the mean time, that's a good workaround. |
If we want to make a link to a specified page of a PDF file, we can write as follows:
But it causes HTMLProofer to parse of the PDF file as an HTML file.
There are three problems.
[1] This useless parsing wastes CPU time and memory.
[2] htmlproofer will report a failure as follows:
[3] Sometimes HTMLProofer raises an ArgumentError such as:
The exception tends to occur for relatively complicated PDF file.
But I have yet to find the exact conditions for reproducing.
The text was updated successfully, but these errors were encountered: