Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache external URL results #249

Merged
merged 29 commits into from
Nov 22, 2015
Merged

Cache external URL results #249

merged 29 commits into from
Nov 22, 2015

Conversation

gjtorikian
Copy link
Owner

Closes #237.

This implements a cache file, .htmlproofer.log, which stores the results of all the external URL lookups. Proofer will consult this file if asked to, given a certain range.

Everything appears to be working, but I need to write more tests before being certain.

@msh100
Copy link

msh100 commented Oct 15, 2015

undefined method `[]' for nil:NilClass

As with the tests I am getting back this unfortunately.

I think the issue (not a Ruby person) is because a .htmlproofer.log is not loaded as it does not exist, ergo

@cache_log_timestamp ||= Date.strptime(@cache_log['time'])

cache_log['time'] is not defined and cache_log_timestamp doesn't try and check this.

@gjtorikian
Copy link
Owner Author

Had a chance to spend some time on this today and it's working out the way I want it to. I'm writing this down so that I remember to update the docs/tests this week:

  • The format of cache log is a hash of URL keys with values such as the timestamp of the last check and the success/failure of the check.
  • On a Proofer run, if no cache exists, one is written. If a user asks to rely on the cache, it's loaded up.
  • Any URLs that were successful and within the timeframe provided (eg, 30d) are ignored
  • Any URLs that were failures are queued up for a recheck
  • Any URLs, successful or failure, found in the cache that exceed the timeframe provided are requeued
  • New URLs found in files are added for recheck
  • Old URLs found in files are removed from the cache

@gjtorikian gjtorikian self-assigned this Oct 27, 2015
@gjtorikian
Copy link
Owner Author

Hmm. Everything appears to be working locally, but while testing on Travis, nothing seems to be caching. It does keep saying that html-proofer's log is being cached, but never that it's being used.

@akoeplinger
Copy link
Contributor

I think you need to add the directory to cache: http://docs.travis-ci.com/user/caching/#Arbitrary-directories

@gjtorikian
Copy link
Owner Author

@gjtorikian
Copy link
Owner Author

This is now working as expected on Travis! Part of it was my problem, and part of it was theirs. 😁

I'm heading out on vacation so I'll keep this around for a few days more for comments and will likely merge it in next week.

@msh100
Copy link

msh100 commented Oct 30, 2015

Thanks for this @gjtorikian ❤️ This is looking good! There is one thing that I have encountered which does seem a little odd so far.

When I test (with 10h cache) and it passes, then I test again with no changes 100's of links still need to be tested. The cache is working for some links but doesn't look like for them all. Do you have any ideas?

Thanks!

msh100@Marcus ~/ClusterHQ/clusterhq.github.io $ cat Gemfile
source 'https://rubygems.org'
gem 'github-pages'
gem 'html-proofer', :github => 'gjtorikian/html-proofer', :branch => 'store-url-results'
msh100@Marcus ~/ClusterHQ/clusterhq.github.io $ bundle exec rake
bundle exec jekyll build
Configuration file: /Users/msh100/ClusterHQ/clusterhq.github.io/_config.yml
            Source: /Users/msh100/ClusterHQ/clusterhq.github.io
       Destination: /Users/msh100/ClusterHQ/clusterhq.github.io/_site
      Generating...
                    done.
 Auto-regeneration: disabled. Use --watch to enable.
"running htmlproofer with typhoeus for parallel link checking"
"edit rake file to toggle verbose"
Running ["HtmlCheck", "ImageCheck", "LinkCheck", "ScriptCheck"] checks on ./_site on *.html...


Checking 321 external links...
Ran on 116 files!


HTML-Proofer finished successfully.
msh100@Marcus ~/ClusterHQ/clusterhq.github.io $ bundle exec rake
bundle exec jekyll build
Configuration file: /Users/msh100/ClusterHQ/clusterhq.github.io/_config.yml
            Source: /Users/msh100/ClusterHQ/clusterhq.github.io
       Destination: /Users/msh100/ClusterHQ/clusterhq.github.io/_site
      Generating...
                    done.
 Auto-regeneration: disabled. Use --watch to enable.
"running htmlproofer with typhoeus for parallel link checking"
"edit rake file to toggle verbose"
Running ["HtmlCheck", "ImageCheck", "LinkCheck", "ScriptCheck"] checks on ./_site on *.html...


Found 314 links in the cache
Acting on 115 links
Checking 115 external links...
Ran on 116 files!


HTML-Proofer finished successfully.

@gjtorikian
Copy link
Owner Author

@msh100 Check the cache. Are there any non-2xx errors? If so, those are rechecked. I just now realized that 3xx (like a redirect) is rechecked too, which may not be desirable, and might be happening to you. But I definitely want failures to be rechecked so that the cache can update.

@msh100
Copy link

msh100 commented Oct 30, 2015

I checked that, I think they're all 2xx.

I have just invited you for access to our website repository, if you check link-cache-117 branch you can see the changes I have made for this.

@gjtorikian
Copy link
Owner Author

Thanks! I'll dig into it 🔜.

@gjtorikian
Copy link
Owner Author

So I've spent some time on the problem yesterday and today and have almost nailed the problem down. There's one weird edge case but I want to clean up the code a bit before pushing this branch up. Thanks for the real-world scenario, it's exposed some inconsistencies between Typheous and the caching!

Mostly, this deals with inconsistencies regarding the way Typheous
handles/changes URLs. The URLs available in the actual HTML text is not
the same as the URL that Typheous considers the “actual” URL. Typheous
also does not give access to its “actual” URL, and because of this, we
must do some workarounds to get as close as possible to the Typheous
URL.
@gjtorikian
Copy link
Owner Author

Ehh scratch that, I've pushed up my changes. If you'd like you can wipe out our cache file, fetch the new branch, and rerun proofer. You should see (on subsequent runs) the cache mostly working. As I said though there's still one more edge case to iron out.

@gjtorikian
Copy link
Owner Author

After much grinding of teeth I couldn't figure a solution for the edge case.

Basically, there's a single URL called https://appdevelopermagazine.com/3147/2015/9/9/Hedvig%2527s-New-Solution-for-Storage-Provisioning-and-Data-Persistence-for-Docker/. Typheous seems to have a hard time parsing this URL, and so it's always being added and removed from the cache. The "fix" is to just rewrite the URL to https://appdevelopermagazine.com/3147/2015/9/9/Hedvigs-New-Solution-for-Storage-Provisioning-and-Data-Persistence-for-Docker/. It's not the greatest solution but I'm totally baffled as to why the string encoding is throwing things off.

In any case, @msh100 I invite you to try using this branch once more and if you find it acceptable I'll do a merge + release. Thanks!

Regex example for clarity
@msh100
Copy link

msh100 commented Nov 20, 2015

Thanks @gjtorikian, looks good. I think this cache seems to work reliably, thanks for all your effort!

Though this does raise questions about how tests pass with URLs which are IPs that time out (this is likely another issue?)/

@gjtorikian
Copy link
Owner Author

Though this does raise questions about how tests pass with URLs which are IPs that time out (this is likely another issue?)

I'm not sure what you mean here. Are hrefs that are IPs passing, when they should be failing? Was that happening before the cache was introduced, or just now?

@msh100
Copy link

msh100 commented Nov 20, 2015

I'm not sure what you mean here. Are hrefs that are IPs passing, when they should be failing? Was that happening before the cache was introduced, or just now?

Yes, and I guess it was always passing as I have never seen them fail. Look out our repo again, you will see some like http://1.2.3.4/ (I am fairly confident this is unrelated to the caching changes).

@gjtorikian
Copy link
Owner Author

I suspect this might be related to your local network? When I run the tests locally, I get:

- ./_site/flocker/try-flocker/simple-tutorial/index.html
  *  External link http://1.2.3.4/ failed: got a time out (response code 0)
  *  External link http://5.6.7.8/ failed: got a time out (response code 0)

To validate that, in 7050961, I added the same IP addresses. These are failing (correctly) for Travis CI.

gjtorikian added a commit that referenced this pull request Nov 22, 2015
@gjtorikian gjtorikian merged commit 1ff5471 into master Nov 22, 2015
@gjtorikian gjtorikian deleted the store-url-results branch November 22, 2015 20:42
@msh100
Copy link

msh100 commented Nov 22, 2015

Interesting, maybe something to do with how I am running the tests. It passes locally and on travis for me.

Thanks for your work on this PR, it's much appreciated ❤️

@msh100
Copy link

msh100 commented Nov 23, 2015

Update: Ignore the issue I was having, I was only checking for 404 - there's no bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants