Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache external URL results #249

Merged
merged 29 commits into from
Nov 22, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
074f8a2
Begin implementing cache
gjtorikian Sep 18, 2015
9b4f4c6
Implement actual cache save, lookup, and load
gjtorikian Oct 4, 2015
616165c
Add `timecop` dependency
gjtorikian Oct 4, 2015
f6d0bb0
Support hour lookups
gjtorikian Oct 4, 2015
65cb3ec
Stub out the noise
gjtorikian Oct 27, 2015
0f76da6
Test loading and writing of cache
gjtorikian Oct 27, 2015
a930d37
Merge branch 'master' into store-url-results
gjtorikian Oct 27, 2015
91806b8
Finalize implementation of caching load and save
gjtorikian Oct 27, 2015
ed32c07
Remove URLs from cache that were removed
gjtorikian Oct 27, 2015
68e01b2
Add some basic cache read/write tests with time
gjtorikian Oct 27, 2015
92f0608
Add tests for URL addition and failure checks
gjtorikian Oct 27, 2015
4510965
Drop the possibility of "years"
gjtorikian Oct 28, 2015
5469803
Add docs for caching
gjtorikian Oct 28, 2015
36b3ead
Update cache test to halt rewrite
gjtorikian Oct 28, 2015
0746109
Simplify DURATIONS check
gjtorikian Oct 28, 2015
81f4645
Unnecessary
gjtorikian Oct 28, 2015
3e912c1
Move cacher to a new folder called .htmlproofer
gjtorikian Oct 28, 2015
2707128
Update README to note new folder path
gjtorikian Oct 28, 2015
40adf50
Move storage directory around
gjtorikian Oct 29, 2015
50e94f6
Ignore trailing slash when comparing links
gjtorikian Oct 29, 2015
4ec365a
Note new directory for cache
gjtorikian Oct 29, 2015
0cf9ec8
Improved debugging messages
gjtorikian Oct 29, 2015
4288cfd
Add some improvements to the caching
gjtorikian Nov 5, 2015
f83099a
Fix altered handling of a 0 response code
gjtorikian Nov 5, 2015
f3932af
Merge pull request #269 from plaindocs/master
gjtorikian Nov 20, 2015
7050961
Add IP address test
gjtorikian Nov 22, 2015
48b46d7
Merge branch 'ip-href' into store-url-results
gjtorikian Nov 22, 2015
296fd06
Test text changed
gjtorikian Nov 22, 2015
3e5b740
Make more sense in the README
gjtorikian Nov 22, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ In addition, there are a few "namespaced" options. These are:
* `:validation`
* `:typhoeus`
* `:parallel`

* `:cache`

See below for more information.

Expand Down Expand Up @@ -223,6 +223,33 @@ HTML::Proofer.new("out/", {:ext => ".htm", :parallel => { :in_processes => 3} })

In this example, `:in_processes => 3` is passed into Parallel as a configuration option.

## Configuring caching

Checking external URLs can slow your tests down. If you'd like to speed that up, you can enable caching for your external links. Caching simply means to skip links that are valid for a certain period of time.

While running tests, HTML::Proofer will always write to a log file within a directory called *tmp/.htmlproofer*. You should probably ignore this folder in your version control system. You can enable caching for this log file by passing in the option `:cache`, with a hash containing a single key, `:timeframe`. `:timeframe` defines the length of time the cache will be used before the link is checked again. The format of `:timeframe` is a number followed by a letter indicating the length of time. For example:

* `M` means months
* `w` means weeks
* `d` means days
* `h` means hours

For example, passing the following options means "recheck links older than thirty days":

``` ruby
{ :cache => { :timeframe => '30d' } }
```

And the following options means "recheck links older than two weeks":

``` ruby
{ :cache => { :timeframe => '2w' } }
```

Links that were failures are kept in the cache and *always* rechecked. If they pass, the cache is updated to note the new timestamp.

The cache operates on external links only.

## Logging

HTML-Proofer can be as noisy or as quiet as you'd like. There are two ways to log information:
Expand Down
2 changes: 2 additions & 0 deletions html-proofer.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ Gem::Specification.new do |gem|
gem.add_dependency 'yell', '~> 2.0'
gem.add_dependency 'parallel', '~> 1.3'
gem.add_dependency 'addressable', '~> 2.3'
gem.add_dependency 'activesupport', '~> 4.2'

gem.add_development_dependency 'redcarpet'
gem.add_development_dependency 'rspec', '~> 3.1'
gem.add_development_dependency 'rake'
gem.add_development_dependency 'awesome_print'
gem.add_development_dependency 'vcr', '~> 2.9'
gem.add_development_dependency 'timecop', '~> 0.8'
end
16 changes: 8 additions & 8 deletions lib/html/proofer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,15 @@ def require_all(path)
require_all 'proofer'
require_all 'proofer/check_runner'
require_all 'proofer/checks'
require_relative './proofer/utils'
require_relative './proofer/xpathfunctions'

require 'parallel'
require 'fileutils'

begin
require 'awesome_print'
rescue LoadError; end

module HTML

class Proofer
include HTML::Proofer::Utils

Expand All @@ -36,6 +34,8 @@ class Proofer
}

def initialize(src, opts = {})
FileUtils.mkdir_p(STORAGE_DIR) unless File.exist?(STORAGE_DIR)

@src = src

if opts[:verbose]
Expand Down Expand Up @@ -89,9 +89,7 @@ def logger
end

def run
count = checks.length
check_text = "#{checks} " << (count == 1 ? 'check' : 'checks')
logger.log :info, :blue, "Running #{check_text} on #{@src} on *#{@options[:ext]}... \n\n"
logger.log :info, :blue, "Running #{checks} on #{@src} on *#{@options[:ext]}... \n\n"

if @src.is_a?(Array) && !@options[:disable_external]
check_list_of_links
Expand Down Expand Up @@ -130,7 +128,9 @@ def check_directory_of_files

validate_urls unless @options[:disable_external]

logger.log :info, :blue, "Ran on #{files.length} files!\n\n"
count = files.length
file_text = pluralize(count, 'file', 'files')
logger.log :info, :blue, "Ran on #{file_text}!\n\n"
end

# Walks over each implemented check and runs them on the files, in parallel.
Expand Down Expand Up @@ -200,7 +200,7 @@ def print_failed_tests

sorted_failures.sort_and_report
count = @failed_tests.length
failure_text = "#{count} " << (count == 1 ? 'failure' : 'failures')
failure_text = pluralize(count, 'failure', 'failures')
fail logger.colorize :red, "HTML-Proofer found #{failure_text}!"
end
end
Expand Down
139 changes: 132 additions & 7 deletions lib/html/proofer/cache.rb
Original file line number Diff line number Diff line change
@@ -1,16 +1,141 @@
require_relative 'utils'

require 'json'
require 'active_support/core_ext/string'
require 'active_support/core_ext/date'
require 'active_support/core_ext/numeric/time'

module HTML
class Proofer
module Cache
def create_nokogiri(path)
if File.exist? path
content = File.open(path).read
class Cache
include HTML::Proofer::Utils

FILENAME = File.join(STORAGE_DIR, 'cache.log')

attr_accessor :exists, :load, :cache_log, :cache_time

def initialize(logger, options)
@logger = logger
@cache_log = {}

if options.nil? || options.empty?
@load = false
else
@load = true
@parsed_timeframe = parsed_timeframe(options[:timeframe] || '30d')
end
@cache_time = Time.now

if File.exist?(FILENAME)
@exists = true
contents = File.read(FILENAME)
@cache_log = contents.empty? ? {} : JSON.parse(contents)
else
@exists = false
end
end

def within_timeframe?(time)
(@parsed_timeframe..@cache_time).cover?(time)
end

def urls
@cache_log['urls'] || []
end

def parsed_timeframe(timeframe)
time, date = timeframe.match(/(\d+)(\D)/).captures
time = time.to_f
case date
when 'M'
time.months.ago
when 'w'
time.weeks.ago
when 'd'
time.days.ago
when 'h'
time.hours.ago
else
content = path
fail ArgumentError, "#{date} is not a valid timeframe!"
end
end

def add(url, filenames, status, msg = '')
data = {
:time => @cache_time,
:filenames => filenames,
:status => status,
:message => msg
}

@cache_log[clean_url(url)] = data
end

def detect_url_changes(found)
existing_urls = @cache_log.keys.map { |url| clean_url(url) }
found_urls = found.keys.map { |url| clean_url(url) }

# prepare to add new URLs detected
additions = found.reject do |url, _|
url = clean_url(url)
if existing_urls.include?(url)
true
else
@logger.log :debug, :yellow, "Adding #{url} to cache check"
false
end
end

new_link_count = additions.length
new_link_text = pluralize(new_link_count, 'link', 'links')
@logger.log :info, :blue, "Adding #{new_link_text} to the cache..."

# remove from cache URLs that no longer exist
del = 0
@cache_log.delete_if do |url, _|
url = clean_url(url)
if !found_urls.include?(url)
@logger.log :debug, :yellow, "Removing #{url} from cache check"
del += 1
true
else
false
end
end

del_link_text = pluralize(del, 'link', 'links')
@logger.log :info, :blue, "Removing #{del_link_text} from the cache..."

additions
end

def write
File.write(FILENAME, @cache_log.to_json)
end

def load?
@load.nil?
end


# FIXME: there seems to be some discrepenacy where Typhoeus occasionally adds
# a trailing slash to URL strings, which causes issues with the cache
def slashless_url(url)
url.chomp('/')
end

# FIXME: it seems that Typhoeus actually acts on escaped URLs,
# but there's no way to get at that information, and the cache
# stores unescaped URLs. Because of this, some links, such as
# github.com/search/issues?q=is:open+is:issue+fig are not matched
# as github.com/search/issues?q=is%3Aopen+is%3Aissue+fig
def unescape_url(url)
Addressable::URI.unescape(url)
end

Nokogiri::HTML(content)
def clean_url(url)
slashless_url(unescape_url(url))
end
module_function :create_nokogiri
end
end
end
57 changes: 50 additions & 7 deletions lib/html/proofer/url_validator.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
require 'typhoeus'
require 'uri'
require_relative './utils'
require_relative './cache'

module HTML
class Proofer
Expand All @@ -18,16 +19,40 @@ def initialize(logger, external_urls, options, typhoeus_opts, hydra_opts)
@hydra = Typhoeus::Hydra.new(hydra_opts)
@typhoeus_opts = typhoeus_opts
@external_domain_paths_with_queries = {}
@cache = Cache.new(@logger, @options[:cache])
end

def run
@iterable_external_urls = remove_query_values
external_link_checker(@iterable_external_urls)

if @cache.exists && @cache.load
cache_count = @cache.cache_log.length
cache_text = pluralize(cache_count, 'link', 'links')

logger.log :info, :blue, "Found #{cache_text} in the cache..."

urls_to_check = @cache.detect_url_changes(@iterable_external_urls)

@cache.cache_log.each_pair do |url, cache|
if @cache.within_timeframe?(cache['time'])
next if cache['message'].empty? # these were successes to skip
urls_to_check[url] = cache['filenames'] # these are failures to retry
else
urls_to_check[url] = cache['filenames'] # pass or fail, recheck expired links
end
end

external_link_checker(urls_to_check)
else
external_link_checker(@iterable_external_urls)
end

@cache.write
@failed_tests
end

def remove_query_values
return if @external_urls.nil?
return nil if @external_urls.nil?
iterable_external_urls = @external_urls.dup
@external_urls.keys.each do |url|
uri = begin
Expand Down Expand Up @@ -75,14 +100,16 @@ def external_link_checker(external_urls)
external_urls = Hash[external_urls.sort]

count = external_urls.length
check_text = "#{count} " << (count == 1 ? 'external link' : 'external links')
check_text = pluralize(count, 'external link', 'external links')
logger.log :info, :blue, "Checking #{check_text}..."

Ethon.logger = logger # log from Typhoeus/Ethon

url_processor(external_urls)

logger.log :debug, :yellow, "Running requests for all #{hydra.queued_requests.size} external URLs..."
logger.log :debug, :yellow, "Running requests for:"
logger.log :debug, :yellow, "###\n" + external_urls.keys.join("\n") + "\n###"

hydra.run
end

Expand Down Expand Up @@ -125,14 +152,19 @@ def response_handler(response, filenames)

if response_code.between?(200, 299)
check_hash_in_2xx_response(href, effective_url, response, filenames)
@cache.add(href, filenames, response_code)
elsif response.timed_out?
handle_timeout(href, filenames, response_code)
elsif response_code == 0
handle_failure(href, filenames, response_code)
elsif method == :head
queue_request(:get, href, filenames)
else
return if @options[:only_4xx] && !response_code.between?(400, 499)
# Received a non-successful http response.
add_external_issue(filenames, "External link #{href} failed: #{response_code} #{response.return_message}", response_code)
msg = "External link #{href} failed: #{response_code} #{response.return_message}"
add_external_issue(filenames, msg, response_code)
@cache.add(href, filenames, response_code, msg)
end
end

Expand All @@ -153,12 +185,23 @@ def check_hash_in_2xx_response(href, effective_url, response, filenames)

return unless body_doc.xpath(xpath).empty?

add_external_issue filenames, "External link #{href} failed: #{effective_url} exists, but the hash '#{hash}' does not", response.code
msg = "External link #{href} failed: #{effective_url} exists, but the hash '#{hash}' does not"
add_external_issue(filenames, msg, response.code)
@cache.add(href, filenames, response.code, msg)
end

def handle_timeout(href, filenames, response_code)
msg = "External link #{href} failed: got a time out (response code #{response_code})"
@cache.add(href, filenames, 0, msg)
return if @options[:only_4xx]
add_external_issue(filenames, msg, response_code)
end

def handle_failure(href, filenames, response_code)
msg = "External link #{href} failed: response code #{response_code} means something's wrong"
@cache.add(href, filenames, 0, msg)
return if @options[:only_4xx]
add_external_issue filenames, "External link #{href} failed: got a time out", response_code
add_external_issue(filenames, msg, response_code)
end

def add_external_issue(filenames, desc, status = nil)
Expand Down
6 changes: 6 additions & 0 deletions lib/html/proofer/utils.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@
module HTML
class Proofer
module Utils
STORAGE_DIR = File.join('tmp', '.htmlproofer')

def pluralize(count, single, plural)
"#{count} " << (count == 1 ? single : plural)
end

def create_nokogiri(path)
if File.exist? path
content = File.open(path).read
Expand Down
Loading