Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Respect robots.txt #127

Closed
rviscomi opened this issue Aug 11, 2017 · 7 comments
Closed

Respect robots.txt #127

rviscomi opened this issue Aug 11, 2017 · 7 comments

Comments

@rviscomi
Copy link
Member

rviscomi commented Aug 11, 2017

If a site to be tested explicitly disallows crawlers, we should remove it from the test URL list.

WPT also allows appending to the User-Agent and we might want to consider setting that to "HTTP Archive" so it's obvious in the logs who initiated the crawl on a site.

I'd say this would also make it easier for sites to explicitly disallow HTTP Archive as opposed to all crawlers or even all WPT agents (via PTST/1 UA string) but the browser version changes the literal UA string frequently enough that it's not really going to matter. A site can still have a UA wildcard, applying to all bots/crawlers, so we should still respect that and blacklist the site.

Edit: Turns out the UA field can be a partial match, so it can specify "HTTP Archive" and we should still follow its disallow rules regardless of browser version.

Related to #115

@hook321
Copy link

hook321 commented Aug 16, 2017

If this is implemented I think that there should be an option to turn it off as well. However, I think it might be worth considering that if a site disallows all bots except the wayback machine then it might be okay to crawl it, or if they just specifically allow the Internet Archive and don't disallow HTTP Archive.

@rviscomi
Copy link
Member Author

Could you elaborate? Not sure I follow re: option to turn it off.

It seems iffy to assume that sites that allow archive.org implicitly want to allow httparchive.org. If we go this route I'd prefer to take the site's config literally to be sure.

@hook321
Copy link

hook321 commented Aug 16, 2017

Like allow someone to run it without listening to robots.txt. We could potentially avoid people getting mad by changing the useragent when that option is on.

Since HTTP Archive is a smaller project than the Wayback Machine, people are less likely to put HTTP Archive into their robots.txt file. For example, github.com/robots.txt excludes all directories except for certain bots. This is the bottom of their robots.txt file:

User-agent: * Allow: /humans.txt Disallow: /

It's likely that many websites have a similar configuration. So respecting robots.txt would have the side effect of many websites being excluded.

@rviscomi
Copy link
Member Author

Well if anyone else is running their own HTTP Archive crawl, they're free to change any configs in their fork :)

Before we make a change like this, we would definitely run a test to evaluate its impact. If it turns out that we're losing a significant portion of sites, we'd rethink it.

@pmeenan
Copy link
Member

pmeenan commented Aug 21, 2017

FWIW, I expect that this would be done in a pre-crawl filtering pass of some kind when we are determining the URLs to crawl (same with removing dead domains, etc).

@rviscomi
Copy link
Member Author

FWIW if/when we switch over from Alexa to Chrome UX Report, non-indexable sites are filtered out by default.

@rviscomi
Copy link
Member Author

Obsoleted by CrUX corpus

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants