[SEO Audits] Integrate robots.txt analysis #4356

rviscomi · 2018-01-25T20:25:15Z

Using a JS-based robots.txt parser (like this one), validate the file itself and apply existing SEO audits whenever applicable.

This integration has two parts:

robots.txt is valid
Page is not blocked from indexing

robots.txt is valid (new audit)

Audit group: Crawling and indexing
Description: robots.txt is valid
Failure description: robots.txt is not valid
Help text: If your robots.txt file is malformed, crawlers may not be able to understand how you want your website to be crawled or indexed. Learn more.

Success conditions:

robots.txt doesn't exist, otherwise:
- the response code is not 5xx
- it can be parsed without errors
- it does not contain conflicting directives, eg all, noindex

Page is not blocked from indexing

Add the following success condition:

The robots.txt file does not include directives in the blocklist.

Note that directives may be applied to the site as a whole or a specific page. Only fail if the current page is blocked from indexing (directly or indirectly).

The text was updated successfully, but these errors were encountered:

kdzwinel · 2018-02-08T15:29:17Z

Do we want to let user know if /robots.txt fails with something like HTTP 500? IMO if the response code is in HTTP 500 - 600 range we can safely report it as an issue.

kdzwinel · 2018-02-12T23:57:56Z

User-agent: Googlebot
Disallow: / # everything is blocked for googlebot

User-agent: *
Disallow: # but allowed for everyone else

Should we fail in such case? How about robots.txt that is only blocking e.g. Googlebot-Image, or Bing, Yandex, DuckDuckGo? 🤔

rviscomi · 2018-02-13T00:23:34Z

For consistency with #3182 let's try to avoid distinguishing between crawlers. If the common case is * UAs, then that's the one we should check. If possible, it would be great to warn saying "you passed, but you're blocking googlebot".

The alternative is to fail the audit when seeing anything resembling noindex, which seems too strict.

I'd also love to see the contents echoed back in the extra info table or similar. Just showing it to users is a sort of manual validation, even if the audit passes. As a secondary benefit, this would be great for data mining later.

kdzwinel · 2018-03-22T23:23:59Z

For the record, here is the full set of rules I've put together from various sources and implemented in the robots.txt validator:

Rules

request for /robots.txt doesn't return HTTP 500+
robots.txt file is smaller than 500kb (gziped) - this one is WIP
only empty lines, comments and directives (matching "name: value" format) are allowed
only directives from the safelist are allowed:

'user-agent', 'disallow', // standard
'allow', 'sitemap', // universally supported
'crawl-delay', // yahoo, bing, yandex
'clean-param', 'host', // yandex
'request-rate', 'visit-time', 'noindex' // not officially supported, but used in the wild
there are no 'allow' or 'disallow' directives before 'user-agent'
'user-agent' can't have empty value
'sitemap' must provide an absolute URL with http/https/ftp scheme
'allow' and 'disallow' values should be either: empty, or start with "/" or "*"
'allow' and 'disallow' should not use '$' in the middle of a value (e.g. "allow: /file$html")

Test

I did run my validator against top 1000 domains and got following errors for 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a .

Resources

Various docs:

and online validators:

rviscomi added P1 seo labels Jan 25, 2018

rviscomi assigned kdzwinel Jan 25, 2018

rviscomi mentioned this issue Jan 25, 2018

[SEO Audits] Write documentation #4355

Closed

14 tasks

rviscomi added this to the Sprint Ocho: January 29 - Feb 9 milestone Jan 29, 2018

paulirish modified the milestones: Sprint Ocho: January 29 - Feb 9, Sprint Nueve: Feb 12 - Feb 23 Feb 12, 2018

kdzwinel mentioned this issue Feb 15, 2018

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

Merged

paulirish modified the milestones: Sprint Nueve: Feb 12 - Feb 23, Sprint Diez: Feb 26 - Mar 9 Feb 26, 2018

paulirish modified the milestones: Sprint Diez: Feb 26 - Mar 9, Sprint Once: March 12 - 23 Mar 12, 2018

kdzwinel mentioned this issue Mar 22, 2018

new_audit(robots-txt): /robots.txt validation #4845

Merged

paulirish modified the milestones: Sprint Once: March 12 - 23, Sprint Doce: Mar 26 - April 6 Mar 26, 2018

brendankenny closed this as completed in #4845 Mar 27, 2018

rviscomi mentioned this issue May 21, 2018

SEO - Page is blocked from indexing - robots.txt #5273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEO Audits] Integrate robots.txt analysis #4356

[SEO Audits] Integrate robots.txt analysis #4356

rviscomi commented Jan 25, 2018 •

edited

Loading

kdzwinel commented Feb 8, 2018

kdzwinel commented Feb 12, 2018

rviscomi commented Feb 13, 2018

kdzwinel commented Mar 22, 2018 •

edited

Loading

[SEO Audits] Integrate robots.txt analysis #4356

[SEO Audits] Integrate robots.txt analysis #4356

Comments

rviscomi commented Jan 25, 2018 • edited Loading

robots.txt is valid (new audit)

Page is not blocked from indexing

kdzwinel commented Feb 8, 2018

kdzwinel commented Feb 12, 2018

rviscomi commented Feb 13, 2018

kdzwinel commented Mar 22, 2018 • edited Loading

Rules

Test

Resources

rviscomi commented Jan 25, 2018 •

edited

Loading

kdzwinel commented Mar 22, 2018 •

edited

Loading