PDF link with a hash causes PDF-parsing by accident #663

scivola · 2021-11-26T01:00:07Z

If we want to make a link to a specified page of a PDF file, we can write as follows:

<a href="foo.pdf#page=2">foo</a>

But it causes HTMLProofer to parse of the PDF file as an HTML file.

There are three problems.

[1] This useless parsing wastes CPU time and memory.

[2] htmlproofer will report a failure as follows:

linking to internal hash #page=2 that does not exist

[3] Sometimes HTMLProofer raises an ArgumentError such as:

/Users/XXXXX/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/nokogiri-1.12.5-arm64-darwin/lib/nokogiri/html5/document.rb:68:in `parse': Document tree depth limit exceeded (ArgumentError)

The exception tends to occur for relatively complicated PDF file.
But I have yet to find the exact conditions for reproducing.

The text was updated successfully, but these errors were encountered:

stevecheckoway · 2021-12-06T23:20:14Z

The ArgumentError is because Nokogiri has a default HTML tree depth of 100 and every sequence of bytes is a valid HTML document. Something about the PDF is causing that limit to be exceeded.

But the main issue is that PDFs shouldn't be parsed by Nokogiri. (Edit: Deleted a sequence of errors on my part.)

stevecheckoway · 2021-12-07T02:50:39Z

There is a workaround.

  HTMLProofer.check_directory(
    "./path",
    { url_ignore: [/.*\.pdf(#.*)?/] }
  ).run

scivola · 2021-12-07T07:33:47Z

@stevecheckoway
Thank you for your advice.
But the workaround disables the check for existence of PDFs, too.

riccardoporreca · 2021-12-08T22:21:36Z

@scivola, @stevecheckoway, url_swap might be your better friend here to check links to PDFs ignoring stripping the hash.

Simple minimal reproducible example
EDIT: the example now includes an actual PDF file that would cause the ArgumentError from Nokogiri w/o url_swap.

Bash:

touch exists.pdf
wget https://file-examples-com.github.io/uploads/2017/10/file-example_PDF_1MB.pdf -O large.pdf
echo '<!doctype html>
<html>
<head><title>Example</title></head>
<body>
<a href="exists.pdf#page=2">Exists</a>
<a href="large.pdf#page=2">Large</a>
<a href="missing.pdf#page=2">Missing</a>
</body>
</html>' > example.html

Ruby:

require 'html-proofer'
HTMLProofer.check_file(
    "example.html",
    { :url_swap => {/(?<=\.pdf)#.*$/ => ""} }
).run

#   *  internally linking to missing.pdf#page=2, which does not exist (line 6)
#     <a href="missing.pdf#page=2">Missing</a>

Note that the error reports the original full link w/o stripped hash, but the link is in fact checked post-url_swap.

scivola · 2021-12-09T00:44:34Z

@riccardoporreca
I could resolve the problem using url_swap option. I have adopted /(?<=\.pdf)#.*/i.
Thank you for the advice.

stevecheckoway · 2021-12-10T15:24:21Z

Thanks @riccardoporreca! Hopefully, html-proofer itself will be fixed, but in the mean time, that's a good workaround.

gjtorikian added the next-gen label Dec 17, 2021

gjtorikian mentioned this issue Dec 31, 2021

Change extensions logic #677

Merged

gjtorikian closed this as completed in #677 Dec 31, 2021

stevecheckoway mentioned this issue Jan 8, 2023

Crashing when checking the external hashes of URLs #787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF link with a hash causes PDF-parsing by accident #663

PDF link with a hash causes PDF-parsing by accident #663

scivola commented Nov 26, 2021

stevecheckoway commented Dec 6, 2021 •

edited

Loading

stevecheckoway commented Dec 7, 2021

scivola commented Dec 7, 2021

riccardoporreca commented Dec 8, 2021 •

edited

Loading

scivola commented Dec 9, 2021

stevecheckoway commented Dec 10, 2021

PDF link with a hash causes PDF-parsing by accident #663

PDF link with a hash causes PDF-parsing by accident #663

Comments

scivola commented Nov 26, 2021

stevecheckoway commented Dec 6, 2021 • edited Loading

stevecheckoway commented Dec 7, 2021

scivola commented Dec 7, 2021

riccardoporreca commented Dec 8, 2021 • edited Loading

scivola commented Dec 9, 2021

stevecheckoway commented Dec 10, 2021

stevecheckoway commented Dec 6, 2021 •

edited

Loading

riccardoporreca commented Dec 8, 2021 •

edited

Loading