Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF link with a hash causes PDF-parsing by accident #663

Closed
scivola opened this issue Nov 26, 2021 · 6 comments · Fixed by #677
Closed

PDF link with a hash causes PDF-parsing by accident #663

scivola opened this issue Nov 26, 2021 · 6 comments · Fixed by #677
Labels

Comments

@scivola
Copy link

scivola commented Nov 26, 2021

If we want to make a link to a specified page of a PDF file, we can write as follows:

<a href="foo.pdf#page=2">foo</a>

But it causes HTMLProofer to parse of the PDF file as an HTML file.

There are three problems.

[1] This useless parsing wastes CPU time and memory.

[2] htmlproofer will report a failure as follows:

linking to internal hash #page=2 that does not exist

[3] Sometimes HTMLProofer raises an ArgumentError such as:

/Users/XXXXX/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/nokogiri-1.12.5-arm64-darwin/lib/nokogiri/html5/document.rb:68:in `parse': Document tree depth limit exceeded (ArgumentError)

The exception tends to occur for relatively complicated PDF file.
But I have yet to find the exact conditions for reproducing.

@stevecheckoway
Copy link
Contributor

stevecheckoway commented Dec 6, 2021

The ArgumentError is because Nokogiri has a default HTML tree depth of 100 and every sequence of bytes is a valid HTML document. Something about the PDF is causing that limit to be exceeded.

But the main issue is that PDFs shouldn't be parsed by Nokogiri. (Edit: Deleted a sequence of errors on my part.)

@stevecheckoway
Copy link
Contributor

There is a workaround.

  HTMLProofer.check_directory(
    "./path",
    { url_ignore: [/.*\.pdf(#.*)?/] }
  ).run

@scivola
Copy link
Author

scivola commented Dec 7, 2021

@stevecheckoway
Thank you for your advice.
But the workaround disables the check for existence of PDFs, too.

@riccardoporreca
Copy link
Collaborator

riccardoporreca commented Dec 8, 2021

@scivola, @stevecheckoway, url_swap might be your better friend here to check links to PDFs ignoring stripping the hash.

Simple minimal reproducible example
EDIT: the example now includes an actual PDF file that would cause the ArgumentError from Nokogiri w/o url_swap.

Bash:

touch exists.pdf
wget https://file-examples-com.github.io/uploads/2017/10/file-example_PDF_1MB.pdf -O large.pdf
echo '<!doctype html>
<html>
<head><title>Example</title></head>
<body>
<a href="exists.pdf#page=2">Exists</a>
<a href="large.pdf#page=2">Large</a>
<a href="missing.pdf#page=2">Missing</a>
</body>
</html>' > example.html

Ruby:

require 'html-proofer'
HTMLProofer.check_file(
    "example.html",
    { :url_swap => {/(?<=\.pdf)#.*$/ => ""} }
).run

#   *  internally linking to missing.pdf#page=2, which does not exist (line 6)
#     <a href="missing.pdf#page=2">Missing</a>

Note that the error reports the original full link w/o stripped hash, but the link is in fact checked post-url_swap.

@scivola
Copy link
Author

scivola commented Dec 9, 2021

@riccardoporreca
I could resolve the problem using url_swap option. I have adopted /(?<=\.pdf)#.*/i.
Thank you for the advice.

@stevecheckoway
Copy link
Contributor

Thanks @riccardoporreca! Hopefully, html-proofer itself will be fixed, but in the mean time, that's a good workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants