Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 parsing and error checking #362

Merged
merged 11 commits into from
Dec 13, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
source 'http://rubygems.org'
source 'https://rubygems.org'
jeremy marked this conversation as resolved.
Show resolved Hide resolved

gem 'nokogumbo', git: 'https://github.com/rubys/nokogumbo'
jeremy marked this conversation as resolved.
Show resolved Hide resolved

gemspec
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Below is mostly comprehensive list of checks that HTMLProofer can perform.

### HTML

* Whether your HTML markup is valid. This is done via [Nokogiri](http://www.nokogiri.org/tutorials/ensuring_well_formed_markup.html) to ensure well-formed markup.
* Whether your HTML markup is valid. This is done via [Nokogumbo](https://github.com/rubys/nokogumbo) to validate well-formed HTML5 markup.

## Usage

Expand Down Expand Up @@ -215,7 +215,7 @@ htmlproofer --assume-extension ./_site

### Using through Docker

If you have trouble with (or don't want to) install Ruby/Nokogiri, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.
If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.

## Ignoring content

Expand Down Expand Up @@ -245,7 +245,7 @@ The `HTMLProofer` constructor takes an optional hash of additional options:
| `check_external_hash` | Checks whether external hashes exist (even if the webpage exists). This slows the checker down. | `false` |
| `check_favicon` | Enables the favicon checker. | `false` |
| `check_opengraph` | Enables the Open Graph checker. | `false` |
| `check_html` | Enables HTML validation errors from Nokogiri | `false` |
| `check_html` | Enables HTML validation errors from Nokogumbo | `false` |
| `check_img_http` | Fails an image if it's marked as `http` | `false` |
|`checks_to_ignore`| An array of Strings indicating which checks you'd like to not perform. | `[]`
| `directory_index_file` | Sets the file to look for when a link refers to a directory. | `index.html` |
Expand Down Expand Up @@ -275,13 +275,13 @@ See below for more information.

### Configuring HTML validation rules

If `check_html` is `true`, Nokogiri performs additional validation on your HTML.
If `check_html` is `true`, Nokogumbo performs additional validation on your HTML.

You can pass in additional options to configure this validation.

| Option | Description | Default |
| :----- | :---------- | :------ |
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogiri are reported as errors. | `false`
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogumbo are reported as errors. | `false`
| `report_missing_names` | When `check_html` is enabled, HTML markup that are missing entity names are reported as errors. | `false`
| `report_script_embeds` | When `check_html` is enabled, `script` tags containing markup [are reported as errors](http://git.io/vOovv). Enabling this option ignores those errors. | `false`

Expand Down
10 changes: 6 additions & 4 deletions bin/htmlproofer
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'checks_to_ignore', '--checks-to-ignore check1,[check2,...]', Array, ' An array of Strings indicating which checks you\'d like to not perform.'
p.option 'check_external_hash', '--check-external-hash', 'Checks whether external hashes exist (even if the webpage exists). This slows the checker down (default: `false`).'
p.option 'check_favicon', '--check-favicon', 'Enables the favicon checker (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogiri (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogumbo (default: `false`).'
p.option 'check_img_http', '--check-img-http', 'Fails an image if it\'s marked as `http` (default: `false`).'
p.option 'check_opengraph', '--check-opengraph', 'Enables the Open Graph checker (default: `false`).'
p.option 'check_sri', '--check-sri', 'Check that `<link>` and `<script>` external resources do use SRI (default: `false`).'
Expand All @@ -34,9 +34,10 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'external_only', '--external_only', 'Only checks problems with external references'
p.option 'file_ignore', '--file-ignore file1,[file2,...]', Array, 'A comma-separated list of Strings or RegExps containing file paths that are safe to ignore'
p.option 'http_status_ignore', '--http-status-ignore 123,[xxx, ...]', Array, 'A comma-separated list of numbers representing status codes to ignore.'
p.option 'report_invalid_tags', '--report-invalid-tags', 'Ignore `check_html` errors associated with unknown markup (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'Ignore `check_html` errors associated with missing entities (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'Ignore `check_html` errors associated with `script`s (default: `false`)'
p.option 'report_invalid_tags', '--report-invalid-tags', 'Report `check_html` errors associated with unknown markup (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'Report `check_html` errors associated with missing entities (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'Report `check_html` errors associated with `script`s (default: `false`)'
p.option 'report_missing_doctype', '--report-missing-doctype', 'Report `check_html` errors associated with missing or out-of-order DOCTYPE (default: `false`)'
p.option 'log_level', '--log-level <level>', String, 'Sets the logging level, as determined by Yell. One of `:debug`, `:info`, `:warn`, `:error`, or `:fatal`. (default: `:info`)'
p.option 'only_4xx', '--only-4xx', 'Only reports errors for links that fall within the 4xx status code range'
p.option 'timeframe', '--timeframe <time>', String, 'A string representing the caching timeframe.'
Expand Down Expand Up @@ -79,6 +80,7 @@ Mercenary.program(:htmlproofer) do |p|
options[:validation][:report_script_embeds] = opts['report_script_embeds']
options[:validation][:report_missing_names] = opts['report_missing_names']
options[:validation][:report_invalid_tags] = opts['report_invalid_tags']
options[:validation][:report_missing_doctype] = opts['report_missing_doctype']

options[:cache] = {}
options[:cache][:timeframe] = opts['timeframe'] unless opts['timeframe'].nil?
Expand Down
2 changes: 1 addition & 1 deletion html-proofer.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Gem::Specification.new do |gem|
gem.require_paths = ['lib']

gem.add_dependency 'mercenary', '~> 0.3.2'
gem.add_dependency 'nokogiri', '~> 1.8.1'
gem.add_dependency 'nokogumbo', '>= 2.0.0.alpha', '< 3'
jeremy marked this conversation as resolved.
Show resolved Hide resolved
gem.add_dependency 'colorize', '~> 0.8'
gem.add_dependency 'typhoeus', '~> 1.3'
gem.add_dependency 'yell', '~> 2.0'
Expand Down
31 changes: 17 additions & 14 deletions lib/html-proofer/check/html.rb
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
class HtmlCheck < ::HTMLProofer::Check
# tags embedded in scripts are used in templating languages: http://git.io/vOovv
SCRIPT_EMBEDS_MSG = /Element script embeds close tag/
INVALID_TAG_MSG = /Tag ([\w\-:]+) invalid/
INVALID_PREFIX = /Namespace prefix/
PARSE_ENTITY_REF = /htmlParseEntityRef: no name/
DOCTYPE_MSG = /The doctype must be the first token in the document/

def run
@html.errors.each do |error|
message = error.message
line = error.line

if message =~ INVALID_TAG_MSG || message =~ INVALID_PREFIX
next unless options[:validation][:report_invalid_tags]
end

if message =~ PARSE_ENTITY_REF
next unless options[:validation][:report_missing_names]
end

# tags embedded in scripts are used in templating languages: http://git.io/vOovv
next if !options[:validation][:report_script_embeds] && message =~ SCRIPT_EMBEDS_MSG
add_issue(error.message, line: error.line) if report?(error.message)
end
end

add_issue(message, line: line)
def report?(message)
case message
when SCRIPT_EMBEDS_MSG
options[:validation][:report_script_embeds]
when INVALID_TAG_MSG, INVALID_PREFIX
options[:validation][:report_invalid_tags]
when PARSE_ENTITY_REF
options[:validation][:report_missing_names]
when DOCTYPE_MSG
options[:validation][:report_missing_doctype]
else
true
jeremy marked this conversation as resolved.
Show resolved Hide resolved
end
end
end
2 changes: 1 addition & 1 deletion lib/html-proofer/check/links.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def run
if missing_href?
next if @link.allow_missing_href?
# HTML5 allows dropping the href: http://git.io/vBX0z
next if @html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?
next if @html.internal_subset.nil? || (@html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?)
add_issue('anchor has no href attribute', line: line, content: content)
next
end
Expand Down
3 changes: 2 additions & 1 deletion lib/html-proofer/configuration.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ module Configuration
VALIDATION_DEFAULTS = {
report_script_embeds: false,
report_missing_names: false,
report_invalid_tags: false
report_invalid_tags: false,
report_missing_doctype: false
}.freeze

CACHE_DEFAULTS = {}.freeze
Expand Down
14 changes: 2 additions & 12 deletions lib/html-proofer/utils.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
require 'nokogiri'
require 'nokogumbo'

module HTMLProofer
module Utils
Expand All @@ -13,7 +13,7 @@ def create_nokogiri(path)
path
end

Nokogiri::HTML(clean_content(content))
Nokogiri::HTML5(content)
end
module_function :create_nokogiri

Expand All @@ -24,15 +24,5 @@ def swap(href, replacement)
href
end
module_function :swap

# address a problem with Nokogiri's parsing URL entities
# problem from http://git.io/vBYU1
# solution from http://git.io/vBYUi
def clean_content(string)
string.gsub(%r{(?:https?:)?//([^>]+)}i) do |url|
url.gsub(/&(?!amp;)/, '&amp;')
end
end
module_function :clean_content
end
end
4 changes: 2 additions & 2 deletions spec/html-proofer/command_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,9 @@
end

it 'works with check-html' do
broken = "#{FIXTURES_DIR}/html/invalid_tag.html"
broken = "#{FIXTURES_DIR}/html/unmatched_end_tag.html"
output = make_bin('--check-html --report-invalid-tags', broken)
expect(output).to match('1 failure')
expect(output).to match('HTML-Proofer finished successfully')
end

it 'works with empty-alt-ignore' do
Expand Down
16 changes: 8 additions & 8 deletions spec/html-proofer/element_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,54 @@

describe HTMLProofer::Element do
before(:each) do
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML5(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
end

describe '#initialize' do
it 'accepts the xmlns attribute' do
nokogiri = Nokogiri::HTML('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
nokogiri = Nokogiri::HTML5('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
checkable = HTMLProofer::Element.new(nokogiri.css('a').first, @check)
expect(checkable.instance_variable_get(:@xmlns_cc)).to eq 'http://creativecommons.org/ns#'
end

it 'assignes the text node' do
nokogiri = Nokogiri::HTML('<p>One')
nokogiri = Nokogiri::HTML5('<p>One')
checkable = HTMLProofer::Element.new(nokogiri.css('p').first, @check)
expect(checkable.instance_variable_get(:@text)).to eq 'One'
end

it 'accepts the content attribute' do
nokogiri = Nokogiri::HTML('<meta name="twitter:card" content="summary">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:card" content="summary">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.instance_variable_get(:@content)).to eq 'summary'
end
end

describe '#ignores_pattern_check' do
it 'works for regex patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check([%r{\/assets\/.*(js|css|png|svg)}])).to eq true
end

it 'works for string patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check(['/assets/main.js'])).to eq true
end
end

describe '#url' do
it 'works for src attributes' do
nokogiri = Nokogiri::HTML('<img src=image.png />')
nokogiri = Nokogiri::HTML5('<img src=image.png />')
checkable = HTMLProofer::Element.new(nokogiri.css('img').first, @check)
expect(checkable.url).to eq 'image.png'
end
end

describe '#ignore' do
it 'works for twitter cards' do
nokogiri = Nokogiri::HTML('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.ignore?).to eq true
end
Expand Down
2 changes: 1 addition & 1 deletion spec/html-proofer/fixtures/html/weird_onclick.html
Original file line number Diff line number Diff line change
@@ -1 +1 @@
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')">
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')"></a>
30 changes: 15 additions & 15 deletions spec/html-proofer/html_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,40 +19,40 @@
expect(proofer.failed_tests).to eq []
end

it 'fails for an invalid tag' do
it 'works for custom tags' do
html = "#{FIXTURES_DIR}/html/invalid_tag.html"
proofer = run_proofer(html, :file, check_html: true, validation: { report_invalid_tags: true })
expect(proofer.failed_tests.first).to match(/Tag myfancytag invalid \(line 2\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for an unmatched end tag' do
it 'allows an unmatched end tag' do
html = "#{FIXTURES_DIR}/html/unmatched_end_tag.html"
proofer = run_proofer(html, :file, check_html: true)
expect(proofer.failed_tests.first).to match(/Unexpected end tag : div \(line 3\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for an unescaped ampersand in attribute' do
it 'allows an unescaped ampersand in attribute' do
html = "#{FIXTURES_DIR}/html/unescaped_ampersand_in_attribute.html"
proofer = run_proofer(html, :file, check_html: true)
expect(proofer.failed_tests.first).to match(/htmlParseEntityRef: expecting ';' \(line 2\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for mismatch between opening and ending tag' do
it 'allows mismatch between opening and ending tag' do
html = "#{FIXTURES_DIR}/html/opening_and_ending_tag_mismatch.html"
proofer = run_proofer(html, :file, check_html: true)
expect(proofer.failed_tests.first).to match(/Opening and ending tag mismatch: p and strong/)
expect(proofer.failed_tests).to eq []
end

it 'fails for div inside head' do
it 'allows div inside head' do
html = "#{FIXTURES_DIR}/html/div_inside_head.html"
proofer = run_proofer(html, :file, check_html: true)
expect(proofer.failed_tests.first).to match(/Unexpected end tag : head \(line 5\)/)
expect(proofer.failed_tests).to eq []
end

it 'fails for missing closing quotation mark in href' do
it 'allows missing closing quotation mark in href' do
html = "#{FIXTURES_DIR}/html/missing_closing_quotes.html"
proofer = run_proofer(html, :file, check_html: true)
expect(proofer.failed_tests.to_s).to match(/Couldn't find end of Start Tag a \(line 6\)/)
expect(proofer.failed_tests).to eq []
end

it "doesn't fail for single ampersand" do
Expand All @@ -61,10 +61,10 @@
expect(proofer.failed_tests).to eq []
end

it 'fails for single ampersand when asked' do
it "allows single ampersand" do
html = "#{FIXTURES_DIR}/html/single_amp.html"
proofer = run_proofer(html, :file, check_html: true, validation: { report_missing_names: true })
expect(proofer.failed_tests.first).to match('htmlParseEntityRef: no name')
expect(proofer.failed_tests).to eq []
end

it 'ignores embeded scripts when asked' do
Expand All @@ -78,7 +78,7 @@
opts = { check_html: true, validation: { report_script_embeds: true } }
ignorable_script = "#{FIXTURES_DIR}/html/ignore_script_embeds.html"
proofer = run_proofer(ignorable_script, :file, opts)
expect(proofer.failed_tests.length).to eq 2
expect(proofer.failed_tests).to eq []
end

it 'does not fail for weird iframe sources' do
Expand Down
Loading