- Allow
- Cache-delay
- Clean-param
- Comment
- Crawl-delay
- Disallow
- Host
- NoIndex
- Request-rate
- Robot-version
- Sitemap
- User-agent
- Visit-time
The allow
directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
robots.txt:
allow: [path]
See also:
References:
- Google robots.txt specifications
- Yandex robots.txt specifications
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
The Cache-delay
directive specifies the minimum interval (in seconds) for a robot to wait after caching one page, before starting to cache another.
robots.txt:
cache-delay: [seconds]
Note: This is an unofficial directive.
Library specific: When the value is requested but not found, the value of Crawl-delay
is returned, to maintain compatibility.
See also:
If page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), they can be described using the Clean-param
directive.
robots.txt:
clean-param: [parameter]
clean-param: [parameter] [path]
clean-param: [parameter1]&[parameter2]&[...]
clean-param: [parameter1]&[parameter2]&[...] [path]
References:
Comments witch are supposed to be sent back to the author/user of the robot. It can be used to eg. provide contact information for white-listing requests, or even explain the robot policy of a site.
robots.txt:
comment: [text]
References:
The Crawl-delay
directive specifies the minimum interval (in seconds) for a robot to wait after loading one page, before starting to load another.
robots.txt:
crawl-delay: [seconds]
See also:
References:
The disallow
directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
robots.txt:
disallow: [path]
See also:
References:
- Google robots.txt specifications
- Yandex robots.txt specifications
- W3C Recommendation HTML 4.01 specification
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
If a site has mirrors, the host
directive is used to indicate which site is main one.
robots.txt:
host: [host]
The noindex
directive is used to completely remove all traces of any matching site url from the search-engines.
noindex: [path]
See also:
The request-rate
directive specifies the minimum time of how often a robot can request a page, along with timestamps in UTC.
robots.txt:
request-rate: [rate]
request-rate: [rate] [time]-[time]
Library specific: When the value is requested but not found, the value of Crawl-delay
is returned, to maintain compatibility.
See also:
References:
Witch Robot exclusion standard version to use for parsing.
robots.txt:
robot-version: [version]
Note: Due to the different interpretations and robot-specific extensions of the Robot exclusion standard, it has been suggested that the version number present is more for documentation purposes than for content negotiation.
References:
The sitemap
directive is used to list URL's witch describes the site structure.
robots.txt:
sitemap: [url]
References:
The user-agent
directive is used as an start-of-group record, and specifies witch User-agent(s) the following rules should be applied to.
robots.txt:
user-agent: [name]
user-agent: [name]/[version]
References:
- Google robots.txt specifications
- Yandex robots.txt specifications
- W3C Recommendation HTML 4.01 specification
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
The robot is requested to only visit the site inside the given visit-time
window.
robots.txt:
visit-time: [time]-[time]
See also:
References: