-
-
Notifications
You must be signed in to change notification settings - Fork 85
Respect robots.txt #127
Comments
If this is implemented I think that there should be an option to turn it off as well. However, I think it might be worth considering that if a site disallows all bots except the wayback machine then it might be okay to crawl it, or if they just specifically allow the Internet Archive and don't disallow HTTP Archive. |
Could you elaborate? Not sure I follow re: option to turn it off. It seems iffy to assume that sites that allow archive.org implicitly want to allow httparchive.org. If we go this route I'd prefer to take the site's config literally to be sure. |
Like allow someone to run it without listening to robots.txt. We could potentially avoid people getting mad by changing the useragent when that option is on. Since HTTP Archive is a smaller project than the Wayback Machine, people are less likely to put HTTP Archive into their robots.txt file. For example, github.com/robots.txt excludes all directories except for certain bots. This is the bottom of their robots.txt file:
It's likely that many websites have a similar configuration. So respecting robots.txt would have the side effect of many websites being excluded. |
Well if anyone else is running their own HTTP Archive crawl, they're free to change any configs in their fork :) Before we make a change like this, we would definitely run a test to evaluate its impact. If it turns out that we're losing a significant portion of sites, we'd rethink it. |
FWIW, I expect that this would be done in a pre-crawl filtering pass of some kind when we are determining the URLs to crawl (same with removing dead domains, etc). |
FWIW if/when we switch over from Alexa to Chrome UX Report, non-indexable sites are filtered out by default. |
Obsoleted by CrUX corpus |
If a site to be tested explicitly disallows crawlers, we should remove it from the test URL list.
WPT also allows appending to the User-Agent and we might want to consider setting that to "HTTP Archive" so it's obvious in the logs who initiated the crawl on a site.
I'd say this would also make it easier for sites to explicitly disallow HTTP Archive as opposed to all crawlers or even all WPT agents (via PTST/1 UA string) but the browser version changes the literal UA string frequently enough that it's not really going to matter. A site can still have a UA wildcard, applying to all bots/crawlers, so we should still respect that and blacklist the site.
Edit: Turns out the UA field can be a partial match, so it can specify "HTTP Archive" and we should still follow its disallow rules regardless of browser version.
Related to #115
The text was updated successfully, but these errors were encountered: