Respect robots.txt #127

rviscomi · 2017-08-11T17:50:46Z

If a site to be tested explicitly disallows crawlers, we should remove it from the test URL list.

WPT also allows appending to the User-Agent and we might want to consider setting that to "HTTP Archive" so it's obvious in the logs who initiated the crawl on a site.

I'd say this would also make it easier for sites to explicitly disallow HTTP Archive as opposed to all crawlers or even all WPT agents (via PTST/1 UA string) but the browser version changes the literal UA string frequently enough that it's not really going to matter. A site can still have a UA wildcard, applying to all bots/crawlers, so we should still respect that and blacklist the site.

Edit: Turns out the UA field can be a partial match, so it can specify "HTTP Archive" and we should still follow its disallow rules regardless of browser version.

Related to #115

hook321 · 2017-08-16T19:03:34Z

If this is implemented I think that there should be an option to turn it off as well. However, I think it might be worth considering that if a site disallows all bots except the wayback machine then it might be okay to crawl it, or if they just specifically allow the Internet Archive and don't disallow HTTP Archive.

rviscomi · 2017-08-16T19:10:27Z

Could you elaborate? Not sure I follow re: option to turn it off.

It seems iffy to assume that sites that allow archive.org implicitly want to allow httparchive.org. If we go this route I'd prefer to take the site's config literally to be sure.

hook321 · 2017-08-16T19:22:41Z

Like allow someone to run it without listening to robots.txt. We could potentially avoid people getting mad by changing the useragent when that option is on.

Since HTTP Archive is a smaller project than the Wayback Machine, people are less likely to put HTTP Archive into their robots.txt file. For example, github.com/robots.txt excludes all directories except for certain bots. This is the bottom of their robots.txt file:

User-agent: * Allow: /humans.txt Disallow: /

It's likely that many websites have a similar configuration. So respecting robots.txt would have the side effect of many websites being excluded.

rviscomi · 2017-08-16T19:44:03Z

Well if anyone else is running their own HTTP Archive crawl, they're free to change any configs in their fork :)

Before we make a change like this, we would definitely run a test to evaluate its impact. If it turns out that we're losing a significant portion of sites, we'd rethink it.

pmeenan · 2017-08-21T18:02:16Z

FWIW, I expect that this would be done in a pre-crawl filtering pass of some kind when we are determining the URLs to crawl (same with removing dead domains, etc).

rviscomi · 2018-03-23T21:12:10Z

FWIW if/when we switch over from Alexa to Chrome UX Report, non-indexable sites are filtered out by default.

rviscomi · 2018-08-28T17:36:54Z

Obsoleted by CrUX corpus

rviscomi added enhancement P3 labels Aug 11, 2017

rviscomi mentioned this issue Aug 31, 2017

[SEO Audits] Document has a meta description GoogleChrome/lighthouse#3175

Closed

rviscomi closed this as completed Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect robots.txt #127

Respect robots.txt #127

rviscomi commented Aug 11, 2017 •

edited

Loading

hook321 commented Aug 16, 2017 •

edited

Loading

rviscomi commented Aug 16, 2017

hook321 commented Aug 16, 2017

rviscomi commented Aug 16, 2017

pmeenan commented Aug 21, 2017

rviscomi commented Mar 23, 2018

rviscomi commented Aug 28, 2018

Respect robots.txt #127

Respect robots.txt #127

Comments

rviscomi commented Aug 11, 2017 • edited Loading

hook321 commented Aug 16, 2017 • edited Loading

rviscomi commented Aug 16, 2017

hook321 commented Aug 16, 2017

rviscomi commented Aug 16, 2017

pmeenan commented Aug 21, 2017

rviscomi commented Mar 23, 2018

rviscomi commented Aug 28, 2018

rviscomi commented Aug 11, 2017 •

edited

Loading

hook321 commented Aug 16, 2017 •

edited

Loading