There's a typical issue with ad networks that often switch to using random new domains, and it's hard to keep an eye on them. This crawler is supposed to automate this process.
Every day the circumvention monitor runs automatically and generates two files:
- report/report.md - human-readable report.
- report/rules.txt - blocking rules for the domains discovered by the crawler.
In order to add a new ad system to monitor, add a new JS object to the configuration.
{
"name": "AD SYSTEM NAME",
"criteria": [
{
"urlPattern": "URL PATTERN",
"contentPattern": "CONTENT PATTERN",
"contentType": "script",
"thirdParty": true,
"ruleProperties": {
"modifiers": ["third-party"],
"scope": "registeredDomain"
}
}
],
"pages": ["https://example.net/", "https://example.com/"]
}
-
name
- ad system name. Will be used in the report to identify this ruleset. -
criteria
- a list of criteria that will be used to identify ad requests.-
urlPattern
(optional) - ad request URL must match this pattern. It can be a string, a wildcard, or a regular expression.Examples:
test
- string, all URLs that contain this string.*test*test*
- wildcard, the URL must match this wildcard./.*test.*/
- regular expression. Note that/
are just special characters and not a part of the regular expression.
-
contentPattern
(optional) - response content must match this pattern. Just likeurlPattern
, it can be a string, a wildcard, or a regular expression. -
contentType
(optional) - one of this list. -
thirdParty
(optional) - if specified, we check if request is third party or not. -
ruleProperties
(optional) - additional propreties for the rules generated by the compiler.-
modifiers
(optional) - an array of modifiers that should be added to the rule -
scope
(optional) - rule scope. Possible values are:domain
- full domain name (||exact.domain.name^
)registeredDomain
- registered domain name (eTLD+1) (||domain.name^
)domainAndPath
- domain + path (||exact.domain.name/path/without/query
)
-
-
-
pages
- a list of webpages that will be crawled in order to extract this ad system domains.
yarn install
- install dependenciesyarn monitor
- run the crawler with default arguments
Run yarn monitor -v
to make it print the verbose log.
- Make basic rules modifiers configurable (see report.js)
- Allow monitoring DOM state (I need examples where this is needed)
- "criteria" should allow blocking or adding custom CSS to test pages so that we could trigger circumvention scripts