chrome-web-store-scraper

This is a scraper for the Chrome Web Store. It extracts data from extensions, themes, and apps. You can find the dataset at fourleafsearch.com. The dataset is updated regularly. Alternatively, you can scrape the data yourself using this scraper.

Installation

Setup

Clone the repository
Create a virtual environment
Install the dependencies

pip install -r requirements.txt

Set proxy (optional)

Set HTTP_PROXY and HTTPS_PROXY in the .env file.

HTTP_PROXY=http://host:port
HTTPS_PROXY=http://host:port

Setup DynamoDB Pipeline (optional)

The DynamoDBPipeline saves the scraped items to a DynamoDB table.

Deploy the AWS resources using SAM CLI and copy the AccessKeyId and SecretAccessKey:

sam build & sam deploy

Set the env vars in the .env file.

AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY
AWS_REGION_NAME=YOUR_AWS_REGION
DYNAMODB_TABLE_NAME=YOUR_DYNAMODB_TABLE_NAME

Config settings.py. Uncomment the following code in the settings.py file:

ITEM_PIPELINES = {
   # "chrome_web_store_scraper.pipelines.DynamoDbPipeline": 300,
}

Setup PostgresqlPipeline (optional)

The PostgreSQLPipeline saves the scraped items to a PostgreSQL table.

Create the PostgreSQL db
Set the env vars in the .env file.

PGHOST=YOUR_PGHOST
PGDATABASE=YOUR_PGHDATABASE
PGUSER=YOUR_PGUSER
PGPASSWORD=YOUR_PGPASSWORD

Create the table, execute scripts/create_postgresql_table.py
Config settings.py. Uncomment the following code in the settings.py file:

ITEM_PIPELINES = {
  # "chrome_web_store_scraper.pipelines.PostgresqlPipeline": 301,
}

Usage

Note: Remember activate the virtual environment before running the commands.

Scrape the data and use the Pipelines to save the data to a DB.

scrapy crawl chromewebstore

Scrape the data and save in a CSV file. (If a Pipeline is enabled, the data will also be saved also in the corresponding DB)

scrapy crawl chromewebstore -O output.csv

Scrape the data and save in a json file. (If a Pipeline is enabled, the data will also be saved also in the corresponding DB)

scrapy crawl chromewebstore -O output.json

For more information about scrapy crawl arguments, refer to the scrapy docs.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
assets/expected		assets/expected
chrome_web_store_scraper		chrome_web_store_scraper
scripts		scripts
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
samconfig.toml		samconfig.toml
scrapy.cfg		scrapy.cfg
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chrome-web-store-scraper

Installation

Setup

Set proxy (optional)

Setup DynamoDB Pipeline (optional)

Setup PostgresqlPipeline (optional)

Usage

About

Releases

Packages

Languages

License

XavierZambrano/chrome-web-store-scraper

Folders and files

Latest commit

History

Repository files navigation

chrome-web-store-scraper

Installation

Setup

Set proxy (optional)

Setup DynamoDB Pipeline (optional)

Setup PostgresqlPipeline (optional)

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages