Skip to content

Gather pagegraph data from all over the internet

License

Notifications You must be signed in to change notification settings

brave/pagegraph-crawl

Repository files navigation

pagegraph-crawl

Command line tool for crawling web pages with PageGraph.

Install

For building the tool, you need to have tsc (TypeScript Compiler) package installed.

npm install
npm run build

Test

npm run test

The tests are defined in test/test.js. Test parameters are defined in test/config.js and can be overriden via environment variables. You need to specify a PageGraph binary path.

Usage

Since PageGraph is built as part of Brave Nightly, you can simply point the binary path to be your local installation.

npm run crawl -- \
    -b /Applications/Brave\ Browser\ Nightly.app/Contents/MacOS/Brave\ Browser\ Nightly \
    -u https://brave.com \
    -t 5 \
    -o output/ \
    --debug debug

The -t specifies how many seconds to crawl the URL provided in -u using the PageGraph binary in -b.

You can see all supported options:

npm run crawl -- -h

NOTE: PageGraph currently does not track puppeteer / automation scripts, and so modifying or interacting with the document through devtools/puppeteer while recording a PageGraph file will fail.