Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

403 error for drupal.org domain #197

Closed
kostajh opened this issue Apr 3, 2015 · 14 comments · Fixed by #201
Closed

403 error for drupal.org domain #197

kostajh opened this issue Apr 3, 2015 · 14 comments · Fixed by #201

Comments

@kostajh
Copy link

kostajh commented Apr 3, 2015

Any idea why html-proofer is returning "403 No error" for drupal.org links? See for example the output from https://travis-ci.org/savaslabs/savaslabs.github.io/builds/57088282

Other https links on that site work just fine.

@gjtorikian
Copy link
Owner

Such a link is behind authentication; when I visit it, I get https://www.drupal.org/user/login?destination=user/3013749.

I had some ideas around this: #86 But for now if it's an annoyance I would suggest setting up the href_ignore option to skip over these links. 😢

@gjtorikian
Copy link
Owner

Also closing as a dupe of #86

@kostajh
Copy link
Author

kostajh commented Apr 6, 2015

@gjtorikian thanks, that makes sense. I didn't even realize that the /u/{username} links on drupal.org required authentication - the same page (aliased to a different URL) loads fine for unauthenticated user.

What's odd, though, is that the last two errors in the build definitely do not require authentication but are yielding 403 errors:

If you have thoughts on those two please let me know.

@gjtorikian
Copy link
Owner

Now that is indeed interesting. 😄

@gjtorikian gjtorikian reopened this Apr 6, 2015
@benbalter
Copy link
Contributor

I'm seeing this on several sites, including codex.wordpress.org. I wonder if travis's IP is blacklisted? See https://travis-ci.org/benbalter/benbalter.github.com/builds/59152824 for an example.

@benbalter
Copy link
Contributor

Both this and #200 (which may be related) seem domain based. Perhaps a test-wide domain blacklist?

@gjtorikian
Copy link
Owner

This is definitely related to #200. Compare

curl -A 'Typhoeus' -I https://www.drupal.org/
HTTP/1.1 403 Forbidden
Cache-Control: no-cache

to

curl -I https://www.drupal.org/
HTTP/1.1 200 OK
Age: 1640

@benbalter
Copy link
Contributor

@gjtorikian you're 100% right. Reached out to friends at Drupal and WP to see what's up.

@gjtorikian
Copy link
Owner

I appreciate that. Still, the problem seems to expand beyond just the scope of those large providers:

curl -I http://www.law.cornell.edu/supct/html/92-1292.ZS.html
HTTP/1.1 301 Moved Permanently
Server: nginx/1.1.19
Date: Sun, 19 Apr 2015 21:16:48 GMT
curl -A 'Typhoeus' -I http://www.law.cornell.edu/supct/html/92-1292.ZS.html
curl: (52) Empty reply from server

I'm thinking to try the initial HEAD, then if proofer receives an empty reply, to try it again with a new User-Agent.

/cc @i0rek as well since he may be unaware of this.

@hanshasselberg
Copy link

I was indeed unaware. Thats interesting, thanks for the heads up. Is there anything I can do?

jcemer added a commit to jcemer/personal-website that referenced this issue Apr 20, 2015
Links to Tumblr or Wordpress doesnt allow ping.
Check this issue for more information gjtorikian/html-proofer#197.
@gjtorikian
Copy link
Owner

I don't think so, at least as it pertains to this project. But you might ask future users to try changing their user-agent if Typhoeus fails.

@joshuami
Copy link

I checked with my engineering team. We blocked the Typheous user agent back on April 2nd because we had a user that was causing a denial of service on Drupal.org by attempting to mirror every piece of the site.

That's not the fault of html-proofer per se, but there was definitely a user abusing Drupal.org with it.

Given the size of Drupal.org and the age of some of our pages, I'm not sure that running html-proofer on the entire domain is best practice.

If you want to do something like this for research, let me know and we can set up a rate-limited test on our staging site rather than production.

If I'm missing why there is a need to run Drupal.org through html-proofer, let me know. We are open to working on something that improves the quality of our site.

@kostajh
Copy link
Author

kostajh commented Apr 20, 2015

@joshuami thanks for your reply. I'm definitely not interested to run all of Drupal.org through html-proofer. Rather, I use html-proofer to check the validity of external links on my company's site. I'd like html-proofer to know that, for example, I linked correctly to https://drupal.org/project/drupal instead of https://drupal.org/project/drupl

Check out the build at the top of this issue for more detail: https://travis-ci.org/savaslabs/savaslabs.github.io/builds/57088282

@gjtorikian
Copy link
Owner

@kostajh The next version of html-proofer will have a fix for this.

@joshuami Thanks a bunch for the reply. I think my only concern here is to notify Typhoeus of the User-Agent ban so that legit users/projects (like this one!) are aware. It sucks that someone abused Typhoeus in that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants